MultiMedia Modeling: 30th International Conference, MMM 2024, Amsterdam, The Netherlands, January 29 – February 2, 2024, Proceedings, Part III (Lecture Notes in Computer Science, 14556) 3031533100, 9783031533105


114 104 87MB

English Pages [552]

Report DMCA / Copyright

DOWNLOAD PDF FILE

Table of contents :
Preface
Organization
Contents – Part III
Global-to-Local Feature Mining Network for RGB-Infrared Person Re-Identification
1 Introduction
2 Related Work
2.1 RGB-Infrared Person Re-Identification
2.2 Intermediate Modality Learning
3 Proposed Method
3.1 Overview
3.2 Attention-Aware Feature Mining Module
3.3 Local Information Mining Module
3.4 Objective Function
4 Experiments
4.1 Datasets and Settings
4.2 Comparison with State-of-the-Art Methods
4.3 Ablation Study and Visualization
5 Conclusion
References
Semantic Transition Detection for Self-supervised Video Scene Segmentation
1 Introduction
2 Related Work
2.1 Long Video Scene Segmentation
2.2 Self-supervised Leaning in Videos
3 Method
3.1 Pseudo-Boundary Extraction
3.2 Shot Duration
3.3 Pre-training
3.4 Fine-Tuning
4 Experiments
4.1 Experimental Setup
4.2 Comparison with State-of-the-Art Methods
4.3 Ablation Studies
4.4 Visualization of Pseudo-Boundary Results
4.5 Visualization of Context Embedding Distribution
5 Conclusion
References
Multi-task Collaborative Network for Image-Text Retrieval
1 Introduction
2 Related Work
2.1 Image-Text Retrieval
2.2 Multi-task Learning
3 Methodology
3.1 Shared Feature Representation
3.2 Multi-task Collaborative Learning
4 Experiments
4.1 Dataset and Protocols
4.2 Comparison with Existing Methods
4.3 Ablation Study
4.4 Effect of Different Hyper-parameter
4.5 Visualization of Retrieval Results
5 Conclusions
References
FGENet: Fine-Grained Extraction Network for Congested Crowd Counting
1 Introduction
2 Related Work
2.1 Methods Under Density-Map Framework
2.2 Methods Under the Point Framework
3 Our Method
3.1 Network Design
3.2 FGFP Module
3.3 TTC Loss
4 Experiments
4.1 Datasets
4.2 Model Evaluation
4.3 Ablation Study
5 Conclusion
References
MSMV-UNet: A 2.5D Stroke Lesion Segmentation Method Based on Multi-slice Feature Fusion
1 Introduction
2 Method
2.1 Multi-slice Dense Feature Fusion
2.2 Inter-slice Attention Module
2.3 Multi-view Soft Voting Strategy
3 Experiments
3.1 Datasets
3.2 Implementation Details
3.3 Comparison of Different Methods
3.4 Influence of Consecutive Slice Quantity
3.5 Influence of Multi-view Soft Voting Strategy
4 Conclusion
References
Non-Local Spatial-Wise and Global Channel-Wise Transformer for Efficient Image Super-Resolution
1 Introduction
2 Related Work
3 Proposed Method
3.1 Overall Architecture
3.2 Non-local Spatial-Wise and Global Channel-Wise Transformer
3.3 Loss Function
4 Experiments
4.1 Experimental Setup
4.2 Comparisons with State-of-the-Art Methods
4.3 Ablation Studies
5 Conclusions
References
MobileViT-FocR: MobileViT with Fixed-One-Centre Loss and Gradient Reversal for Generalised Fake Face Detection
1 Introduction
2 Related Work
3 The Proposed Method
3.1 Model Overview
3.2 Loss Function Improvement
4 Experiment
4.1 Experiment Settings
4.2 DataSet
4.3 Choose a Good Base Model
5 Conclusion
References
ASF-Conformer: Audio Scoring Conformer with FFC for Speaker Verification in Noisy Environments
1 Introduction
2 Related Work
3 Method
3.1 Network Architecture
3.2 Downsampling Module with Audio Scoring (D-AS)
3.3 F-Conformer Block
4 Experiments
4.1 Datasets
4.2 Implementation Details
4.3 Evaluation Metrics
4.4 Ablation Study
4.5 Qualitative Results
5 Conclusions
References
Prior-Knowledge-Free Video Frame Interpolation with Bidirectional Regularized Implicit Neural Representations
1 Introduction
2 Related Work
2.1 Video Frame Interpolation
2.2 Implicit Neural Representations
3 Method
3.1 Implicit Neural Representation
3.2 Latent Code Interpolation
3.3 Bidirectional Regularization Framework (BiRF)
4 Experiments
4.1 Datasets
4.2 Implementation Details
4.3 Results
4.4 Ablation Study
5 Conclusion
References
Two-Stage Reasoning Network with Modality Decomposition for Text VQA
1 Introduction
2 Related Work
2.1 Representation Learning in Text VQA
2.2 Feature Interaction in Text VQA
3 Methodology
3.1 Multimodal Feature Extraction
3.2 Modality-Specific Attention Module
3.3 Semantic-Guided Modality Interaction
3.4 Answer Prediction
4 Experiments
4.1 Datasets and Metrics Settings
4.2 Implementation Details
4.3 Comparison with State-of-the-Art Methods
4.4 Ablation Study
4.5 Case Study and Visualization
5 Conclusion
References
Localization and Local Motion Magnification of Pulsatile Regions in Endoscopic Surgery Videos
1 Introduction
2 Related Work
3 Method
3.1 Frequency Map Estimation
3.2 Pulsatile Region Localization
3.3 Local Motion Magnification
4 Experiments Setting
4.1 Dataset
4.2 Comparison and Evaluation
5 Results
5.1 Synthetic Video
5.2 Clinical Surgical Video
6 Conclusion
References
Co-speech Gesture Generation with Variational Auto Encoder
1 Introduction
2 Related Works
2.1 Variational Autoencoder
2.2 Co-speech Gesture Generation
2.3 Audio2Gesture (A2G)
2.4 Human Motion Synthesis
2.5 Spatial Temporal Graph Convolutional Network (ST-GCN)
3 Speaker-Aware Audio2Gesture (SA2G)
4 Experiments
4.1 Experimental Setting
4.2 Comparison on TED Dataset
4.3 Qualitative Comparison
4.4 Ablation Study
5 Conclusion
References
Differentiable Neural Architecture Search Based on Efficient Architecture for Lightweight Image Super-Resolution
1 Introduction
2 Background
2.1 Hand-Crafted Lightweight SR Methods
2.2 NAS Based Lightweight SR Methods
2.3 Attention Mechanism
2.4 Hierarchical Architecture
3 Methodology
3.1 Network Architecture
3.2 Search Space
3.3 Search Strategy
4 Experiments
4.1 Datasets and Implementation Details
4.2 Search Settings
4.3 Train Settings
4.4 Search Results
4.5 Ablation Study
4.6 Comparison with Hand-Crafted Lightweight Methods
4.7 Comparison with NAS-Based Lightweight Methods
5 Conclusion
References
Learning Collaborative Reinforcement Attention for 3D Face Reconstruction and Dense Alignment
1 Introduction
2 Proposed Method
2.1 Regional Noise Injection and Image Composition Module
2.2 Collaborative Reinforcement Attention Module
2.3 Objective Loss Function
3 Experiments
3.1 Datasets
3.2 Implementation Details
3.3 3D Face Alignment and Reconstruction
3.4 Ablation Study
3.5 Evaluation of Generalization Performance
4 Conclusion
References
Exploring Multi-modal Fusion for Image Manipulation Detection and Localization
1 Introduction
2 Related Work
3 Methods
3.1 Encoder-Decoder Architecture
3.2 Auxiliary Modalities
3.3 Late Fusion
3.4 Fusion by Early Convolutions
4 Experiments
4.1 Experimental Setup
4.2 Comparisons
4.3 Ablation Study
4.4 Robustness Analysis
5 Conclusion
References
Appearance-Motion Dual-Stream Heterogeneous Network for VideoQA
1 Introduction
2 Related Work
2.1 Video Question Answering
2.2 Visual-Linguistic Interaction
3 Method
3.1 Video-Text Feature Extraction Module
3.2 Object Relation Inference Module
3.3 V-Q Heterogeneous Interaction Module
3.4 Answer Preference Module
3.5 Answer Prediction and Loss Function
4 Experiments
4.1 Experimental Details
4.2 Datasets
4.3 Comparison with State-of-the-Arts
4.4 Ablation Study
4.5 Qualitative Results
5 Conclusion
References
Adaptive Token Selection and Fusion Network for Multimodal Sentiment Analysis
1 Introduction
2 Related Work
3 Methodology
3.1 Problem Definition
3.2 Modality Feature Extraction
3.3 Unimodal Token Selection
3.4 Multimodal Token Fusion
3.5 Sentiment Prediction
4 Experiments
4.1 Datasets
4.2 Evaluation Metrics
4.3 Training Details
4.4 Baselines
5 Results and Analysis
5.1 Summary of Results
5.2 Ablation Analysis
5.3 Case Study
6 Conclusion
References
Exploring Imperceptible Adversarial Examples in YCbCr Color Space
1 Introduction
2 Related Work
3 Methodology
3.1 Overview
3.2 Luma-Chroma Optimization
3.3 Spectrum Transformation
3.4 Low-Frequency Constraint
4 Experiments
4.1 Experimental Setup
4.2 White-Box Attacks
4.3 Robustness
4.4 Transferability
4.5 Ablation Study
5 Conclusion
References
Fractional-Order Image Moments and Applications
1 Introductory
2 Related Work
2.1 Fractional Fourier Transform
2.2 Zernike Moment
2.3 Fourier-Merlin Moment
2.4 Exponential Fourier Moment
3 Fractional-Order Moments
3.1 Zernike Fractional Fourier Moment
3.2 Merlin Fractional Fourier Moment
3.3 Exponential Fractional Fourier Moment
4 Zero Watermarking Algorithm
4.1 Watermark Embedding
4.2 Watermark Extract
5 Experiments and Analysis
5.1 Watermark Capacity
5.2 Robustness
5.3 Experiments
6 Conclusions
References
Time-Quality Tradeoff of MuseHash Query Processing Performance
1 Introduction
2 Hash-Based Representation and MuseHash
3 Approximate Indexes and ANN Benchmarks
4 Multi-Core and GPU Processing
5 Experiments
5.1 Datasets
5.2 Experimental Settings
5.3 Experiment 1: Impact of Approximate Indexes
5.4 Experiment 2: Query Parallelism Vs. Data Parallelism
5.5 Experiment 3: Comparison of GPU and CPU Processing
5.6 Experiment 4: Query Parallelism with PyNNDescent Indexing
5.7 Experiment 5: Curse of Dimensionality
5.8 Discussion
6 Conclusion
References
Dual-Fisheye Image Stitching via Unsupervised Deep Learning
1 Introduction
2 Related Work
2.1 Feature-Based Method
2.2 Grid-Based Method
2.3 Deep Learning-Based Method
3 Unsupervised Deep Learning Dual-Fisheye Image Stitching System
3.1 Fisheye Image Distortion Correction
3.2 Image Stitching
3.3 Image Rectangularization
4 Experiment and Analysis
4.1 Dataset
4.2 Result Analysis
5 Conclusion
References
CA-GAN: Conditional Adaptive Generative Adversarial Network for Text-to-Image Synthesis
1 Introduction
2 Related Work
3 Method
3.1 Model Overview
3.2 CARBlock
3.3 Attention Block
3.4 Objective Function
4 Experiment
4.1 Quantitative Evaluation
4.2 Qualitative Evaluation
4.3 Ablation Study
5 Conclusions
References
RDC-YOLOv5: Improved Safety Helmet Detection in Adverse Weather
1 Introduction
2 Related Work
2.1 Object Detection in Adverse Weather
2.2 YOLOv5 Model
3 Our Proposed Method
3.1 Set up a Restoration Network
3.2 Adding a 152152 Detection Layer for Micro-Scale
3.3 Adding a Cross-Layer Connection
4 Experimental Results and Analysis
4.1 Dataset and Environment Construction
4.2 Evaluation Criteria
4.3 Ablation Studies
4.4 Comparison Experiments
5 Conclusions
References
Sustainable Commercial Fishery Control Using Multimedia Forensics Data from Non-trusted, Mobile Edge Nodes
1 Introduction
1.1 Related Work
2 System Overview
2.1 Secure Storage Platform
2.2 Data Distribution Layer
2.3 Distributed Log Layer
2.4 Inference Layer
2.5 Policies and Trust
2.6 Encryption
3 Experiments and Results
3.1 File Storage
3.2 Policy-Based Data Replication
3.3 Satellite Network Connection
4 Discussion
5 Conclusion
References
MC-TCMNER: A Multi-modal Fusion Model Combining Contrast Learning Method for Traditional Chinese Medicine NER
1 Introduction
2 Related Work
3 Method
3.1 Multi-modal Fusion NER Model
3.2 Training Strategies Combining Contrastive Learning
4 Experiments
4.1 Compared with Commonly Used Chinese NER Methods in TCMNER
4.2 Compared the Performance on C-CLUE Benchmark
4.3 Ablation Experiment
5 Conclusions
References
C3-PO: A Convolutional Neural Network for COVID Onset Prediction from Cough Sounds
1 Introduction
2 Related Work
2.1 Feasibility
2.2 Cough Classification
3 Method
3.1 Cough Segmentation
3.2 Data Augmentation
3.3 Split Majority Set and Ensemble Models
3.4 Model Architecture
3.5 Feature Selection
4 Experiments and Results
4.1 Dataset
4.2 Data Preprocessing
4.3 Feature Extraction
4.4 Data Analysis
4.5 Feature Selection
4.6 Train and Test Models
4.7 Results
4.8 Ablation Study
5 Conclusion
References
Pseudo-label Based Unsupervised Momentum Representation Learning for Multi-domain Image Retrieval
1 Introduction
2 Related Work
2.1 Domain Adaption
2.2 Cross-Domain Hashing Image Retrieval
3 Proposed PUMR Method
3.1 Problem Setup
3.2 Feature Extraction with Momentum Contrast Mechanism
3.3 Pseudo-label Based Contrastive Learning
4 Experiments
4.1 Dataset
4.2 Results Analysis
4.3 Ablation Study
5 Conclusion
References
DFGait: Decomposition Fusion Representation Learning for Multimodal Gait Recognition
1 Introduction
2 Related Work
2.1 Multimodal Gait Recognition
2.2 Decomposition Representation Learning
3 Method
3.1 Pipeline
3.2 Feature Encoder
3.3 Multimodal Feature Decoupling
3.4 Modality Alignment and Fusion
3.5 Objective Optimization
4 Experiments
4.1 Datasets
4.2 Implementation Details
4.3 Comparison with State-of-the-Art Methods
4.4 Ablation Study
5 Conclusion
References
MoPE: Mixture of Pooling Experts Framework for Image-Text Retrieval
1 Introduction
2 Related Work
3 Mixture of Pooling Experts Framework
3.1 Route Gating Module
3.2 Aggregation Expert Module
3.3 Loss Function Module
4 Experiment
4.1 Experiment Setup
4.2 Evaluation Results
4.3 Ablation Analysis
4.4 Case Study
5 Conclusion
References
Multi-modal Video Topic Segmentation with Dual-Contrastive Domain Adaptation
1 Introduction
2 Related Work
3 Neural Video Topic Segmentation
3.1 Problem Definition
3.2 Model Architecture
4 Long Video Adaptation
4.1 Simple Sliding Window Inference
4.2 Dual-Contrastive Adaptation
5 Experimental Setup
5.1 Intra-domain Dataset – YouTube
5.2 Datasets for Cross-Domain Inference
5.3 Baselines
5.4 Evaluation Metrics
5.5 Implementation Details
6 Results and Discussion
7 Conclusion and Future Work
References
Unsupervised Multi-collaborative Learning Network for 3D Face Reconstruction
1 Introduction
2 Method
2.1 FCN-UNet Collaborative Network
2.2 Multi-resolution Co-optimization Module
3 Experiments
3.1 Setup
3.2 Comparison
3.3 Ablation Study
4 Conclusion
References
A Region Based Non-overlapping Reference Speech Estimation Method for Speaker Extraction
1 Introduction
2 Methodology
2.1 Overall Network Structure
2.2 Region Proposal for Speaker Reference Selection
2.3 Speaker Extraction
3 Experiments
3.1 Dataset
3.2 Experimental Settings
3.3 Baselines
3.4 Evaluation Metrics
3.5 Experimental Results
4 Conclusion
References
Self-supervised Edge Structure Learning for Multi-view Stereo and Parallel Optimization
1 Introduction
2 Method
2.1 Depth Estimation Network
2.2 Edge Structure Learning Network
2.3 Masking Mechanism
2.4 Loss Function
2.5 Parallel Optimization
3 Experiments
3.1 Implementation Datails
3.2 Ablation Study
3.3 Parallel Results
4 Conclusion
References
Prototype-Enhanced Hypergraph Learning for Heterogeneous Information Networks
1 Introduction
2 Related Work
2.1 Heterogeneous Information Networks
2.2 Hypergraph Learning
3 Methodology
3.1 Feature Preprocessing
3.2 Hypergraph Attention Layer
3.3 Learnable Prototype Classifier
3.4 Hyperedge Prototype Regularization
4 Experimental Setup
4.1 Datasets
4.2 Implementation Details
5 Experimental Results
5.1 Heterogeneous Hypergraph Modelling
5.2 Learnable Prototype Classifier and Prototype Regularization for Hypergraph Modeling
5.3 Prototype for Interpreting HINs
6 Future Work and Conclusion
References
A Language-Based Solution to Enable Metaverse Retrieval
1 Introduction
2 Related Work
2.1 Background on Metaverse-Related Research
2.2 Cross-Modal Understanding and Retrieval Applications
3 Proposed Methodology
3.1 Network Architecture
3.2 Dataset Collection
4 Experimental Results
4.1 How to Model the Metaverses?
4.2 How to Model the Descriptions?
4.3 Limitations
4.4 Implementation Details
5 Conclusions
References
Part-Aware Prompt Tuning for Weakly Supervised Referring Expression Grounding
1 Introduction
2 Related Work
2.1 Referring Expression Grounding (REG)
2.2 Weakly Supervised Referring Expression Grounding (WSREG)
3 Method
3.1 Problem Setting and Overview
3.2 Pre-trained Multi-modal Model Pipeline
3.3 Part-Aware Prompt Tuning
3.4 Train and Inference
4 Experiment
4.1 Dataset
4.2 Evaluation Metric
4.3 Implementation Details
4.4 Results
4.5 Ablation Study
4.6 Qualitative Analysis
5 Conclusion
References
Adversarially Robust Deepfake Detection via Adversarial Feature Similarity Learning
1 Introduction
2 Related Work
2.1 Deepfake Creation and Detection
2.2 Adversarial Examples
3 Adversarial Feature Similarity Learning
3.1 Overview
3.2 Deepfake Classification Loss
3.3 Adversarial Similarity Loss
3.4 Similarity Regularization Loss
3.5 Final Loss Function
4 Experimental Description
4.1 Implementation Details
4.2 Victim Models: Deepfake Detectors
4.3 Robust Cross-Manipulation Generalization
4.4 Evaluation on Frame-Based Detectors
4.5 Robustness to Common Distortions
5 Ablation Study
6 Conclusion and Future Work
References
A Multidimensional Taxonomy Model for Music Tangible User Interfaces
1 Introduction
2 Music Tangible User Interfaces
3 A Multidimensional Taxonomy for TUIs
4 The Taxonomy in Practice
5 Conclusions and Future Work
References
Author Index

MultiMedia Modeling: 30th International Conference, MMM 2024, Amsterdam, The Netherlands, January 29 – February 2, 2024, Proceedings, Part III (Lecture Notes in Computer Science, 14556)
 3031533100, 9783031533105

  • 0 0 0
  • Like this paper and download? You can publish your own PDF file online for free in a few minutes! Sign Up
File loading please wait...
Recommend Papers