MultiMedia Modeling: 30th International Conference, MMM 2024, Amsterdam, The Netherlands, January 29 – February 2, 2024, Proceedings, Part III (Lecture Notes in Computer Science, 14556) 3031533100, 9783031533105

131 104 87MB

English Pages [552]

Table of contents :
Preface
Organization
Contents – Part III
Global-to-Local Feature Mining Network for RGB-Infrared Person Re-Identification
1 Introduction
2 Related Work
2.1 RGB-Infrared Person Re-Identification
2.2 Intermediate Modality Learning
3 Proposed Method
3.1 Overview
3.2 Attention-Aware Feature Mining Module
3.3 Local Information Mining Module
3.4 Objective Function
4 Experiments
4.1 Datasets and Settings
4.2 Comparison with State-of-the-Art Methods
4.3 Ablation Study and Visualization
5 Conclusion
References
Semantic Transition Detection for Self-supervised Video Scene Segmentation
1 Introduction
2 Related Work
2.1 Long Video Scene Segmentation
2.2 Self-supervised Leaning in Videos
3 Method
3.1 Pseudo-Boundary Extraction
3.2 Shot Duration
3.3 Pre-training
3.4 Fine-Tuning
4 Experiments
4.1 Experimental Setup
4.2 Comparison with State-of-the-Art Methods
4.3 Ablation Studies
4.4 Visualization of Pseudo-Boundary Results
4.5 Visualization of Context Embedding Distribution
5 Conclusion
References
Multi-task Collaborative Network for Image-Text Retrieval
1 Introduction
2 Related Work
2.1 Image-Text Retrieval
2.2 Multi-task Learning
3 Methodology
3.1 Shared Feature Representation
3.2 Multi-task Collaborative Learning
4 Experiments
4.1 Dataset and Protocols
4.2 Comparison with Existing Methods
4.3 Ablation Study
4.4 Effect of Different Hyper-parameter
4.5 Visualization of Retrieval Results
5 Conclusions
References
FGENet: Fine-Grained Extraction Network for Congested Crowd Counting
1 Introduction
2 Related Work
2.1 Methods Under Density-Map Framework
2.2 Methods Under the Point Framework
3 Our Method
3.1 Network Design
3.2 FGFP Module
3.3 TTC Loss
4 Experiments
4.1 Datasets
4.2 Model Evaluation
4.3 Ablation Study
5 Conclusion
References
MSMV-UNet: A 2.5D Stroke Lesion Segmentation Method Based on Multi-slice Feature Fusion
1 Introduction
2 Method
2.1 Multi-slice Dense Feature Fusion
2.2 Inter-slice Attention Module
2.3 Multi-view Soft Voting Strategy
3 Experiments
3.1 Datasets
3.2 Implementation Details
3.3 Comparison of Different Methods
3.4 Influence of Consecutive Slice Quantity
3.5 Influence of Multi-view Soft Voting Strategy
4 Conclusion
References
Non-Local Spatial-Wise and Global Channel-Wise Transformer for Efficient Image Super-Resolution
1 Introduction
2 Related Work
3 Proposed Method
3.1 Overall Architecture
3.2 Non-local Spatial-Wise and Global Channel-Wise Transformer
3.3 Loss Function
4 Experiments
4.1 Experimental Setup
4.2 Comparisons with State-of-the-Art Methods
4.3 Ablation Studies
5 Conclusions
References
MobileViT-FocR: MobileViT with Fixed-One-Centre Loss and Gradient Reversal for Generalised Fake Face Detection
1 Introduction
2 Related Work
3 The Proposed Method
3.1 Model Overview
3.2 Loss Function Improvement
4 Experiment
4.1 Experiment Settings
4.2 DataSet
4.3 Choose a Good Base Model
5 Conclusion
References
ASF-Conformer: Audio Scoring Conformer with FFC for Speaker Verification in Noisy Environments
1 Introduction
2 Related Work
3 Method
3.1 Network Architecture
3.2 Downsampling Module with Audio Scoring (D-AS)
3.3 F-Conformer Block
4 Experiments
4.1 Datasets
4.2 Implementation Details
4.3 Evaluation Metrics
4.4 Ablation Study
4.5 Qualitative Results
5 Conclusions
References
Prior-Knowledge-Free Video Frame Interpolation with Bidirectional Regularized Implicit Neural Representations
1 Introduction
2 Related Work
2.1 Video Frame Interpolation
2.2 Implicit Neural Representations
3 Method
3.1 Implicit Neural Representation
3.2 Latent Code Interpolation
3.3 Bidirectional Regularization Framework (BiRF)
4 Experiments
4.1 Datasets
4.2 Implementation Details
4.3 Results
4.4 Ablation Study
5 Conclusion
References
Two-Stage Reasoning Network with Modality Decomposition for Text VQA
1 Introduction
2 Related Work
2.1 Representation Learning in Text VQA
2.2 Feature Interaction in Text VQA
3 Methodology
3.1 Multimodal Feature Extraction
3.2 Modality-Specific Attention Module
3.3 Semantic-Guided Modality Interaction
3.4 Answer Prediction
4 Experiments
4.1 Datasets and Metrics Settings
4.2 Implementation Details
4.3 Comparison with State-of-the-Art Methods
4.4 Ablation Study
4.5 Case Study and Visualization
5 Conclusion
References
Localization and Local Motion Magnification of Pulsatile Regions in Endoscopic Surgery Videos
1 Introduction
2 Related Work
3 Method
3.1 Frequency Map Estimation
3.2 Pulsatile Region Localization
3.3 Local Motion Magnification
4 Experiments Setting
4.1 Dataset
4.2 Comparison and Evaluation
5 Results
5.1 Synthetic Video
5.2 Clinical Surgical Video
6 Conclusion
References
Co-speech Gesture Generation with Variational Auto Encoder
1 Introduction
2 Related Works
2.1 Variational Autoencoder
2.2 Co-speech Gesture Generation
2.3 Audio2Gesture (A2G)
2.4 Human Motion Synthesis
2.5 Spatial Temporal Graph Convolutional Network (ST-GCN)
3 Speaker-Aware Audio2Gesture (SA2G)
4 Experiments
4.1 Experimental Setting
4.2 Comparison on TED Dataset
4.3 Qualitative Comparison
4.4 Ablation Study
5 Conclusion
References
Differentiable Neural Architecture Search Based on Efficient Architecture for Lightweight Image Super-Resolution
1 Introduction
2 Background
2.1 Hand-Crafted Lightweight SR Methods
2.2 NAS Based Lightweight SR Methods
2.3 Attention Mechanism
2.4 Hierarchical Architecture
3 Methodology
3.1 Network Architecture
3.2 Search Space
3.3 Search Strategy
4 Experiments
4.1 Datasets and Implementation Details
4.2 Search Settings
4.3 Train Settings
4.4 Search Results
4.5 Ablation Study
4.6 Comparison with Hand-Crafted Lightweight Methods
4.7 Comparison with NAS-Based Lightweight Methods
5 Conclusion
References
Learning Collaborative Reinforcement Attention for 3D Face Reconstruction and Dense Alignment
1 Introduction
2 Proposed Method
2.1 Regional Noise Injection and Image Composition Module
2.2 Collaborative Reinforcement Attention Module
2.3 Objective Loss Function
3 Experiments
3.1 Datasets
3.2 Implementation Details
3.3 3D Face Alignment and Reconstruction
3.4 Ablation Study
3.5 Evaluation of Generalization Performance
4 Conclusion
References
Exploring Multi-modal Fusion for Image Manipulation Detection and Localization
1 Introduction
2 Related Work
3 Methods
3.1 Encoder-Decoder Architecture
3.2 Auxiliary Modalities
3.3 Late Fusion
3.4 Fusion by Early Convolutions
4 Experiments
4.1 Experimental Setup
4.2 Comparisons
4.3 Ablation Study
4.4 Robustness Analysis
5 Conclusion
References
Appearance-Motion Dual-Stream Heterogeneous Network for VideoQA
1 Introduction
2 Related Work
2.1 Video Question Answering
2.2 Visual-Linguistic Interaction
3 Method
3.1 Video-Text Feature Extraction Module
3.2 Object Relation Inference Module
3.3 V-Q Heterogeneous Interaction Module
3.4 Answer Preference Module
3.5 Answer Prediction and Loss Function
4 Experiments
4.1 Experimental Details
4.2 Datasets
4.3 Comparison with State-of-the-Arts
4.4 Ablation Study
4.5 Qualitative Results
5 Conclusion
References
Adaptive Token Selection and Fusion Network for Multimodal Sentiment Analysis
1 Introduction
2 Related Work
3 Methodology
3.1 Problem Definition
3.2 Modality Feature Extraction
3.3 Unimodal Token Selection
3.4 Multimodal Token Fusion
3.5 Sentiment Prediction
4 Experiments
4.1 Datasets
4.2 Evaluation Metrics
4.3 Training Details
4.4 Baselines
5 Results and Analysis
5.1 Summary of Results
5.2 Ablation Analysis
5.3 Case Study
6 Conclusion
References
Exploring Imperceptible Adversarial Examples in YCbCr Color Space
1 Introduction
2 Related Work
3 Methodology
3.1 Overview
3.2 Luma-Chroma Optimization
3.3 Spectrum Transformation
3.4 Low-Frequency Constraint
4 Experiments
4.1 Experimental Setup
4.2 White-Box Attacks
4.3 Robustness
4.4 Transferability
4.5 Ablation Study
5 Conclusion
References
Fractional-Order Image Moments and Applications
1 Introductory
2 Related Work
2.1 Fractional Fourier Transform
2.2 Zernike Moment
2.3 Fourier-Merlin Moment
2.4 Exponential Fourier Moment
3 Fractional-Order Moments
3.1 Zernike Fractional Fourier Moment
3.2 Merlin Fractional Fourier Moment
3.3 Exponential Fractional Fourier Moment
4 Zero Watermarking Algorithm
4.1 Watermark Embedding
4.2 Watermark Extract
5 Experiments and Analysis
5.1 Watermark Capacity
5.2 Robustness
5.3 Experiments
6 Conclusions
References
Time-Quality Tradeoff of MuseHash Query Processing Performance
1 Introduction
2 Hash-Based Representation and MuseHash
3 Approximate Indexes and ANN Benchmarks
4 Multi-Core and GPU Processing
5 Experiments
5.1 Datasets
5.2 Experimental Settings
5.3 Experiment 1: Impact of Approximate Indexes
5.4 Experiment 2: Query Parallelism Vs. Data Parallelism
5.5 Experiment 3: Comparison of GPU and CPU Processing
5.6 Experiment 4: Query Parallelism with PyNNDescent Indexing
5.7 Experiment 5: Curse of Dimensionality
5.8 Discussion
6 Conclusion
References
Dual-Fisheye Image Stitching via Unsupervised Deep Learning
1 Introduction
2 Related Work
2.1 Feature-Based Method
2.2 Grid-Based Method
2.3 Deep Learning-Based Method
3 Unsupervised Deep Learning Dual-Fisheye Image Stitching System
3.1 Fisheye Image Distortion Correction
3.2 Image Stitching
3.3 Image Rectangularization
4 Experiment and Analysis
4.1 Dataset
4.2 Result Analysis
5 Conclusion
References
CA-GAN: Conditional Adaptive Generative Adversarial Network for Text-to-Image Synthesis
1 Introduction
2 Related Work
3 Method
3.1 Model Overview
3.2 CARBlock
3.3 Attention Block
3.4 Objective Function
4 Experiment
4.1 Quantitative Evaluation
4.2 Qualitative Evaluation
4.3 Ablation Study
5 Conclusions
References
RDC-YOLOv5: Improved Safety Helmet Detection in Adverse Weather
1 Introduction
2 Related Work
2.1 Object Detection in Adverse Weather
2.2 YOLOv5 Model
3 Our Proposed Method
3.1 Set up a Restoration Network
3.2 Adding a 152152 Detection Layer for Micro-Scale
3.3 Adding a Cross-Layer Connection
4 Experimental Results and Analysis
4.1 Dataset and Environment Construction
4.2 Evaluation Criteria
4.3 Ablation Studies
4.4 Comparison Experiments
5 Conclusions
References
Sustainable Commercial Fishery Control Using Multimedia Forensics Data from Non-trusted, Mobile Edge Nodes
1 Introduction
1.1 Related Work
2 System Overview
2.1 Secure Storage Platform
2.2 Data Distribution Layer
2.3 Distributed Log Layer
2.4 Inference Layer
2.5 Policies and Trust
2.6 Encryption
3 Experiments and Results
3.1 File Storage
3.2 Policy-Based Data Replication
3.3 Satellite Network Connection
4 Discussion
5 Conclusion
References
MC-TCMNER: A Multi-modal Fusion Model Combining Contrast Learning Method for Traditional Chinese Medicine NER
1 Introduction
2 Related Work
3 Method
3.1 Multi-modal Fusion NER Model
3.2 Training Strategies Combining Contrastive Learning
4 Experiments
4.1 Compared with Commonly Used Chinese NER Methods in TCMNER
4.2 Compared the Performance on C-CLUE Benchmark
4.3 Ablation Experiment
5 Conclusions
References
C3-PO: A Convolutional Neural Network for COVID Onset Prediction from Cough Sounds
1 Introduction
2 Related Work
2.1 Feasibility
2.2 Cough Classification
3 Method
3.1 Cough Segmentation
3.2 Data Augmentation
3.3 Split Majority Set and Ensemble Models
3.4 Model Architecture
3.5 Feature Selection
4 Experiments and Results
4.1 Dataset
4.2 Data Preprocessing
4.3 Feature Extraction
4.4 Data Analysis
4.5 Feature Selection
4.6 Train and Test Models
4.7 Results
4.8 Ablation Study
5 Conclusion
References
Pseudo-label Based Unsupervised Momentum Representation Learning for Multi-domain Image Retrieval
1 Introduction
2 Related Work
2.1 Domain Adaption
2.2 Cross-Domain Hashing Image Retrieval
3 Proposed PUMR Method
3.1 Problem Setup
3.2 Feature Extraction with Momentum Contrast Mechanism
3.3 Pseudo-label Based Contrastive Learning
4 Experiments
4.1 Dataset
4.2 Results Analysis
4.3 Ablation Study
5 Conclusion
References
DFGait: Decomposition Fusion Representation Learning for Multimodal Gait Recognition
1 Introduction
2 Related Work
2.1 Multimodal Gait Recognition
2.2 Decomposition Representation Learning
3 Method
3.1 Pipeline
3.2 Feature Encoder
3.3 Multimodal Feature Decoupling
3.4 Modality Alignment and Fusion
3.5 Objective Optimization
4 Experiments
4.1 Datasets
4.2 Implementation Details
4.3 Comparison with State-of-the-Art Methods
4.4 Ablation Study
5 Conclusion
References
MoPE: Mixture of Pooling Experts Framework for Image-Text Retrieval
1 Introduction
2 Related Work
3 Mixture of Pooling Experts Framework
3.1 Route Gating Module
3.2 Aggregation Expert Module
3.3 Loss Function Module
4 Experiment
4.1 Experiment Setup
4.2 Evaluation Results
4.3 Ablation Analysis
4.4 Case Study
5 Conclusion
References
Multi-modal Video Topic Segmentation with Dual-Contrastive Domain Adaptation
1 Introduction
2 Related Work
3 Neural Video Topic Segmentation
3.1 Problem Definition
3.2 Model Architecture
4 Long Video Adaptation
4.1 Simple Sliding Window Inference
4.2 Dual-Contrastive Adaptation
5 Experimental Setup
5.1 Intra-domain Dataset – YouTube
5.2 Datasets for Cross-Domain Inference
5.3 Baselines
5.4 Evaluation Metrics
5.5 Implementation Details
6 Results and Discussion
7 Conclusion and Future Work
References
Unsupervised Multi-collaborative Learning Network for 3D Face Reconstruction
1 Introduction
2 Method
2.1 FCN-UNet Collaborative Network
2.2 Multi-resolution Co-optimization Module
3 Experiments
3.1 Setup
3.2 Comparison
3.3 Ablation Study
4 Conclusion
References
A Region Based Non-overlapping Reference Speech Estimation Method for Speaker Extraction
1 Introduction
2 Methodology
2.1 Overall Network Structure
2.2 Region Proposal for Speaker Reference Selection
2.3 Speaker Extraction
3 Experiments
3.1 Dataset
3.2 Experimental Settings
3.3 Baselines
3.4 Evaluation Metrics
3.5 Experimental Results
4 Conclusion
References
Self-supervised Edge Structure Learning for Multi-view Stereo and Parallel Optimization
1 Introduction
2 Method
2.1 Depth Estimation Network
2.2 Edge Structure Learning Network
2.3 Masking Mechanism
2.4 Loss Function
2.5 Parallel Optimization
3 Experiments
3.1 Implementation Datails
3.2 Ablation Study
3.3 Parallel Results
4 Conclusion
References
Prototype-Enhanced Hypergraph Learning for Heterogeneous Information Networks
1 Introduction
2 Related Work
2.1 Heterogeneous Information Networks
2.2 Hypergraph Learning
3 Methodology
3.1 Feature Preprocessing
3.2 Hypergraph Attention Layer
3.3 Learnable Prototype Classifier
3.4 Hyperedge Prototype Regularization
4 Experimental Setup
4.1 Datasets
4.2 Implementation Details
5 Experimental Results
5.1 Heterogeneous Hypergraph Modelling
5.2 Learnable Prototype Classifier and Prototype Regularization for Hypergraph Modeling
5.3 Prototype for Interpreting HINs
6 Future Work and Conclusion
References
A Language-Based Solution to Enable Metaverse Retrieval
1 Introduction
2 Related Work
2.1 Background on Metaverse-Related Research
2.2 Cross-Modal Understanding and Retrieval Applications
3 Proposed Methodology
3.1 Network Architecture
3.2 Dataset Collection
4 Experimental Results
4.1 How to Model the Metaverses?
4.2 How to Model the Descriptions?
4.3 Limitations
4.4 Implementation Details
5 Conclusions
References
Part-Aware Prompt Tuning for Weakly Supervised Referring Expression Grounding
1 Introduction
2 Related Work
2.1 Referring Expression Grounding (REG)
2.2 Weakly Supervised Referring Expression Grounding (WSREG)
3 Method
3.1 Problem Setting and Overview
3.2 Pre-trained Multi-modal Model Pipeline
3.3 Part-Aware Prompt Tuning
3.4 Train and Inference
4 Experiment
4.1 Dataset
4.2 Evaluation Metric
4.3 Implementation Details
4.4 Results
4.5 Ablation Study
4.6 Qualitative Analysis
5 Conclusion
References
Adversarially Robust Deepfake Detection via Adversarial Feature Similarity Learning
1 Introduction
2 Related Work
2.1 Deepfake Creation and Detection
2.2 Adversarial Examples
3 Adversarial Feature Similarity Learning
3.1 Overview
3.2 Deepfake Classification Loss
3.3 Adversarial Similarity Loss
3.4 Similarity Regularization Loss
3.5 Final Loss Function
4 Experimental Description
4.1 Implementation Details
4.2 Victim Models: Deepfake Detectors
4.3 Robust Cross-Manipulation Generalization
4.4 Evaluation on Frame-Based Detectors
4.5 Robustness to Common Distortions
5 Ablation Study
6 Conclusion and Future Work
References
A Multidimensional Taxonomy Model for Music Tangible User Interfaces
1 Introduction
2 Music Tangible User Interfaces
3 A Multidimensional Taxonomy for TUIs
4 The Taxonomy in Practice
5 Conclusions and Future Work
References
Author Index

MultiMedia Modeling: 30th International Conference, MMM 2024, Amsterdam, The Netherlands, January 29 – February 2, 2024, Proceedings, Part III (Lecture Notes in Computer Science, 14556)
3031533100, 9783031533105

Author / Uploaded
Stevan Rudinac
Alan Hanjalic
Cynthia Liem
Marcel Worring
Björn Þór Jónsson
Bei Liu
Yoko Yamakata

0 0 0
Like this paper and download? You can publish your own PDF file online for free in a few minutes! Sign Up

File loading please wait...

Recommend Papers

MultiMedia Modeling: 30th International Conference, MMM 2024, Amsterdam, The Netherlands, January 29 – February 2, 2024, Proceedings, Part II (Lecture Notes in Computer Science) 3031533070, 9783031533075

114 59 86MB Read more

MultiMedia Modeling: 30th International Conference, MMM 2024, Amsterdam, The Netherlands, January 29 – February 2, 2024, Proceedings, Part IV (Lecture Notes in Computer Science) 3031533011, 9783031533013

108 4 78MB Read more

MultiMedia Modeling: 30th International Conference, MMM 2024, Amsterdam, The Netherlands, January 29 – February 2, 2024, Proceedings, Part I (Lecture Notes in Computer Science) 3031533046, 9783031533044

97 26 82MB Read more

MultiMedia Modeling: 30th International Conference, MMM 2024, Amsterdam, The Netherlands, January 29 – February 2, 2024, Proceedings, Part V (Lecture Notes in Computer Science, 14565) 3031564340, 9783031564345

This book constitutes the refereed proceedings of the 30th International Conference on MultiMedia Modeling, MMM 2024, he

115 10 23MB Read more

Foreign Affairs Magazine (January February 2024) [January February 2024]

107 49 157MB Read more

MultiMedia Modeling: 28th International Conference, MMM 2022, Phu Quoc, Vietnam, June 6–10, 2022, Proceedings, Part I (Lecture Notes in Computer Science) 3030983579, 9783030983574

114 97 91MB Read more

Verification, Model Checking, and Abstract Interpretation: 25th International Conference, VMCAI 2024, London, United Kingdom, January 15–16, 2024, ... Part II (Lecture Notes in Computer Science) 3031505204, 9783031505201

The two-volume set LNCS 14499 and 14500 constitutes the proceedings of the 25th International Conference on Verification

103 55 Read more

Distributed Computing and Intelligent Technology: 20th International Conference, ICDCIT 2024, Bhubaneswar, India, January 17–20, 2024, Proceedings (Lecture Notes in Computer Science) 3031505824, 9783031505829

122 117 21MB Read more

MultiMedia Modeling: 29th International Conference, MMM 2023, Bergen, Norway, January 9–12, 2023, Proceedings, Part II 9783031278181, 3031278186

The two-volume set LNCS 13833 and LNCS 13834 constitutes the proceedings of the 29th International Conference on MultiMe

110 100 110MB Read more

Passive and Active Measurement: 25th International Conference, PAM 2024, Virtual Event, March 11–13, 2024, Proceedings, Part II (Lecture Notes in Computer Science, 14538) 3031562518, 9783031562518

This book constitutes the proceedings of the 25th International Conference on Passive and Active Measurement, PAM 2024,

108 33 28MB Read more