Intelligence of Things: Technologies and Applications: The Second International Conference on Intelligence of Things (ICIT 2023) Ho Chi Minh City Vietnam October 25-27 2023 Proceedings vol 1 [vol. 187, 1 ed.] 9783031465727, 9783031465734

This book aims to provide state-of-the-art knowledge in the field of Intelligence of Things to both academic and industr

116 19 51MB

English Pages xv; 440 [452] Year 2023

Table of contents :
Preface
Organization
Contents
State-of-the-Art and Theoretical Analyses
FPGA/AI-Powered Data Security for IoT Edge Computing Platforms: A Survey and Open Issues
1 Introduction
1.1 Related Work
1.2 Contributions
1.3 Outline
2 Preliminary
2.1 IoT Layers and Threats
2.2 IoT Security vs. Traditional Security
3 FPGA-Based Security for Edge Devices
4 AI-Based Security for Edge Devices
4.1 Processor-Based AI Approaches
4.2 FPGA-Based AI Approaches
5 FPGA/AI-Powered Security for Edge Devices: Open Issues
6 Conclusion
References
A Review in Deep Learning-Based Thyroid Cancer Detection Techniques Using Ultrasound Images
1 Introduction
2 Deep Learning-Based Thyroid Cancer Detection Using Ultrasound Image
2.1 Convolutional Neural Networks - CascadeMaskR-CNN
2.2 VGG16, VGG19, and Inception v3
2.3 ThyNet
2.4 Generative Adversarial Networks (GANs)
3 Discussion
4 Conclusion
References
Bio-Inspired Clustering: An Ensemble Method for User-Based Collaborative Filtering
1 Introduction
2 Related Work
3 Bio-Inspired Clustering Model for User-Based Collaborative Filtering (BICCF)
4 Experiments and Results
4.1 Setting
4.2 Evaluation
5 Conclusions
References
Deep Reinforcement Learning-Based Sum-Rate Maximization for Uplink Multi-user SIMO-RSMA Systems
1 Introduction
2 DRL-Based Sum-Rate Maximization for Uplink Multi-user SIMO-RSMA Framework
2.1 System Model and Problem Formulation
2.2 Proposed Deep Reinforcement Learning Framework
3 Evaluation
4 Conclusion
References
Multiobjective Logistics Optimization for Automated ATM Cash Replenishment Process
1 Introduction
2 Research Problem
3 Mathematical Model
3.1 Problem Statement
3.2 Constraints
3.3 Mathematical Model
4 Methodology
5 Testing and Evaluation
6 Conclusion
References
Adaptive Conflict-Averse Multi-gradient Descent for Multi-objective Learning
1 Introduction
2 Conflict-Averse Methods for MOL
2.1 Multi-objective Learning Problems
2.2 Conflicting Gradients
2.3 Convergence and Learning Rate Issues
2.4 AdaCAGrad: Adaptive Conflict-Averse Multi-gradient Descent
3 Experiments
3.1 Toy Optimization Example
3.2 Image Classification
4 Conclusion
References
Multicriteria Portfolio Selection with Intuitionistic Fuzzy Goals as a Pseudoconvex Vector Optimization
1 Introduction
2 Multicriteria Portfolio Selection Problem
3 Multicriteria Portfolio Selection with Intuitionistic Fuzzy Goals
3.1 Intuitionistic Fuzzy Goals
3.2 Transformation to Deterministic Model
4 Computational Experiment
5 Conclusion
References
Research and Develop Solutions to Traffic Data Collection Based on Voice Techniques
1 Introduction
2 Related Work
3 Definition of Problem and End-to-End ASR System
3.1 Data Collection
3.2 Data Preprocessing
3.3 Language Modeling
3.4 Training End-to-End ASR
3.5 Decoding and Transcription
4 Experiment
4.1 Experimental Setup
4.2 Experimental Result
4.3 Analysis and Discussion
5 Conclusion
References
Using Machine Learning Algorithms to Diagnosis Melasma from Face Images
1 Introduction
2 Diagnostic Data for Melasma
3 Machine Learning Algorithm
3.1 About YOLO V8
3.2 Anchor-Free Detection
3.3 Model for Diagnosing Melasma
3.4 Results of Model Evaluation
4 Conclusions
References
Reinforcement Learning for Portfolio Selection in the Vietnamese Market
1 Introduction
2 Overview
2.1 State-of-the-Art Reinforcement Learning
2.2 Related Work
3 Method
3.1 Modeling the Stock Trading Problem
3.2 Environment for Vietnamese Market
3.3 Noise Filter
4 Experimental Evaluation
4.1 Data Pre-processing
4.2 Experimental Setup
4.3 Experimental Results
5 Conclusion
References
AIoT Technologies
A Systematic CL-MLP Approach for Online Forecasting of Multiple Key Performance Indicators
1 Introduction
2 Preliminaries
3 Related Works
3.1 Time Series Forecasting Models
3.2 Online Learning
4 CL-MLP
4.1 Our Workflow
4.2 Model Construction
4.3 Online Learning
5 Experiment Results
5.1 Dataset
5.2 Our Results
6 Conclusion
References
Neutrosophic Fuzzy Data Science and Addressing Research Gaps in Geographic Data and Information Systems
1 Introduction
2 Neutrosophic Fuzzy Data Sciences
3 Neutrosophic Fuzzy GIS- Map
4 Neutrosophic Crisp Open in GIS Topology
5 Conclusion and Future Work
References
Inhibitory Control during Visual Perspective Taking Revealed by Multivariate Analysis of Event-Related Potentials
1 Introduction
2 Method
2.1 Participants
2.2 Stimulus
2.3 Procedure
2.4 Analysis
3 Results
3.1 Go vs No/Go Condition in the Self and Other Conditions Combined
3.2 Go vs No/Go Condition in the Self and Other Perspective Condition
4 Discussion
References
A Novel Custom Deep Learning Network Combining 1D-Convolution and LSTM for Rapid Wine Quality Detection in Small and Average-Scale Applications
1 Introduction
2 Material and Methodology
2.1 Data Description
2.2 Sampling Procedure
2.3 Computation Algorithm
3 Computation Algorithm
4 Validation Strategy
5 Result and Discussion
6 Conclusion
References
IoT-Enabled Wearable Smart Glass for Monitoring Intraoperative Anesthesia Patients
1 Introduction
1.1 Surgical Patient Monitoring System
1.2 Literature Review
2 Experimental Setup and Procedure
3 Results and Discussions
4 Conclusion
References
Traffic Density Estimation at Intersections via Image-Based Object Reference Method
1 Introduction
2 Related Work
3 Problem Definition and Proposed Solutions
3.1 Problem Definition
3.2 Proposed Solutions
4 Experiment Setup and Result
4.1 Overall System Architecture
4.2 Automatic Access
4.3 Data Setup
4.4 Error Rate Calculation
4.5 Result and Evaluation
5 Conclusion and Future Work
References
Improving Automatic Speech Recognition via Joint Training with Speech Enhancement as Multi-task Learning
1 Introduction
2 Related Work
3 ASR-SE: A MTL Approach
4 Experiments and Results
5 Conclusion
References
Solving Feature Selection Problem by Quantum Optimization Algorithm
1 Introduction
2 Feature Selection Model
3 Solving Feature Selection Problems by CVaR-QAOA
3.1 Quantum Approximate Optimization Algorithm
3.2 CVaR Optimization for QAOA
3.3 Apply CVaR-QAOA to Feature Selection Problem
4 Numerical Simulation
5 Conclusion and Feature Work
References
A Methodology of Extraction DC Model for a 65 nm Floating-Gate Transistor
1 Introduction
2 Floating-Gate Transistor Concepts
2.1 Device Structure
2.2 DC Operation
3 Methodology in Model Extraction
4 Result
4.1 Drain Current Versus Control Gate Voltage at Initial Condition
4.2 Drain Current Versus Control Gate Voltage When VSB Varies
4.3 Drain Current Versus Control Gate Voltage When VD Varies
4.4 Drain Current Versus Drain Voltage When VCG Varies
5 Conclusion
References
imMeta: An Incremental Sub-graph Merging for Feature Extraction in Metagenomic Binning
1 Introduction
2 Methods
2.1 Fundamentals and Notations
2.2 Algorithms
3 Experimental Results
3.1 Dataset
3.2 Performance Metrics
3.3 Results
3.4 Parameter Evaluation
4 Conclusion
References
Virtual Sensor to Impute Missing Data Using Data Correlation and GAN-Based Model
1 Introduction
2 Related Work
3 Problem Description
4 Virtual Sensor Components
4.1 Generator
4.2 Discriminator
4.3 Data Correlation Arrangement
4.4 Hint
4.5 Objective
5 Algorithm
6 Experiments
6.1 Performance of the Proposed Virtual Sensor
6.2 Virtual Sensor Prediction Accuracy
7 Conclusions and Future Work
References
An Edge AI-Based Vehicle Tracking Solution for Smart Parking Systems
1 Introduction
2 Related Work
3 Proposed Method
4 Experimental Results
4.1 Training Phase
4.2 Evaluation
5 Conclusion
References
Low-Light Image Enhancement Using Quaternion CNN
1 Introduction
2 Background
2.1 Quaternion Algebra
2.2 Quaternion Convolutional Neural Network
2.3 CNN Approaches for Image Enhancements
3 Proposed Quaternion Attention Unet
3.1 Quaternion ResUnet
3.2 Quaternion Attention Module
3.3 The proposed Quaternion Attention Unet model
4 Experimental Results
4.1 Datasets
4.2 Training of Quaternion CNN
4.3 Performance Evaluations
5 Conclusion and Future Work
References
Leverage Deep Learning Methods for Vehicle Trajectory Prediction in Chaotic Traffic
1 Introduction
1.1 Vehicle Trajectory Prediction
1.2 The Challenges in Vietnamese Traffic
2 Related Work
3 Methods
3.1 Vehicle Detection
3.2 Vehicle Tracking
3.3 Vehicle Trajectory Prediction
4 Experiment
4.1 Experimental Setup and Implementation
4.2 Metrics
4.3 Experimental Result
5 Conclusion
References
AIoT System Architectures
Wireless Sensor Network to Collect and Forecast Environment Parameters Using LSTM
1 Introduction
2 Related Work
3 Proposing System
3.1 System Overview
3.2 System Details
4 Simulation and Result
4.1 Product
4.2 Training Result
4.3 Discussion
5 Conclusion
References
SCBM: A Hybrid Model for Vietnamese Visual Question Answering
1 Introduction
2 Related Works
2.1 Visual Attention-Based Models
2.2 Language Attention-Based Models
3 Methods
4 Experiments and Results
4.1 Dataset
4.2 Experimental Settings
4.3 Results
5 Conclusion
References
A High-Performance Pipelined FPGA-SoC Implementation of SHA3-512 for Single and Multiple Message Blocks
1 Introduction
2 Background SHA-3
3 Comprehensive Proposed Design
3.1 Pre-processing: Multiple-Block Solution
3.2 Processing: Padding and Round Function
4 Result and Evaluation
5 Conclusion
References
Optimizing ECC Implementations Based on SoC-FPGA with Hardware Scheduling and Full Pipeline Multiplier for IoT Platforms
1 Introduction
2 Background
2.1 Eliptic Curve Cryptography
2.2 SECG Standard for 256-Bit Koblitz Curves
2.3 Scalar Point Multiplication
3 Proposed System
3.1 Pipelined Multiplier for High Speed ECC
3.2 Point Generation
4 Experimental Results
4.1 Testing and Verification
5 Conclutions
References
Robust Traffic Sign Detection and Classification Through the Integration of YOLO and Deep Learning Networks
1 Introduction
2 Related Work
3 Traffic Sign Detection with YOLOv5 Model
3.1 Data Collection
3.2 Yolov5 Structure
4 Traffic Sign Classification with Deep Learning Models
4.1 Data Collection
4.2 Structure of Models: MobileNetv2, ResNet50, VGG19
5 Experimental Results
5.1 Traffic Signs Detection with Yolov5 Results
5.2 Traffic Signs Classification with CNN Model Results
6 Conclusion
References
OPC-UA/MQTT-Based Multi M2M Protocol Architecture for Digital Twin Systems
1 Introduction
2 Related Digital Twin Systems
3 Multi-M2M Protocol Architecture Using OPC-UA and MQTT
3.1 Industrial Devices
3.2 DataCenter
3.3 Local Users
3.4 Cloud Server
3.5 Remote Users
4 System Implementation on JetMax and VR Oculus
4.1 Control Module
4.2 DataCenter Module
4.3 Virtual Reality Application
5 Experiments
5.1 Experimental Setup
5.2 Performance Evaluation
6 Conclusion
References
Real-Time Singing Performance Improvement Through Pitch Correction Using Apache Kafka Stream Processing
1 Introduction
2 Related Work
3 Pitch Shifting and Pitch Correction
3.1 Pitch Shifting with TD-PSOLA Technique
3.2 Estimating Fundamental Frequencies with PYIN Algorithm
3.3 Pitch Correction Process
4 Data Streaming with Kafka
4.1 Big Data Challenges
4.2 Kafka Architecture and Design Principles
4.3 System Architecture
5 Implementation
5.1 Audio Formating
5.2 Kafka Setup
5.3 Building Client Application
5.4 Building Pitch Correction Unit
6 Conclusion
References
An Implementation of Human-Robot Interaction Using Machine Learning Based on Embedded Computer
1 Introduction
2 Related Work
3 Methodology
3.1 System Overview
3.2 Pre-processing
3.3 Extracting Features and Classification
3.4 Robot Design and Hardware Implementation
4 Results and Discussion
4.1 Facial Emotion Recognition
4.2 Pose Recognition
4.3 Robot Motion
5 Conclusion
References
DIKO: A Two-Stage Hybrid Network for Knee Osteoarthritis Diagnosis Using Deep Learning
1 Introduction
2 Knee Osteoarthristis and Region of Interest
3 Related Works
4 DIKO Architecture
4.1 Stage 1: ROI Identification
4.2 Stage 2: ROI Analysis
5 Experiment
5.1 Datasets
5.2 Results
6 Conclusion
References
Shallow Convolutional Neural Network Configurations for Skin Disease Diagnosis
1 Introduction
2 Related Work
3 Methods
3.1 Data Set Description
3.2 Architectures for Skin Diseases Classification
4 Experimental Results
4.1 Performance with Different Input Image Sizes
4.2 Evaluation on Configurations of Shallow Convolutional Neural Networks on the Input Size of 3232
4.3 Investigation on the Configurations with Input Size of 6464
4.4 Configurations on the Input Size of 128128
4.5 Performance on Various Diseases
5 Conclusion
References
Design an Indoor Positioning System Using ESP32 Ultra-Wide Band Module
1 Introduction
2 Literature Review
2.1 What is Ultra-Wide Band?
2.2 Ranging Methods Using UWB
3 Methodology
3.1 System Overview
3.2 Device Connections
4 Simulation and Experimental Results
4.1 Transmitting Ranging Data
4.2 Collecting Ranging Data
4.3 Simulation and Experimental Testing
5 Conclusion
References
Towards a Smart Parking System with the Jetson Xavier Edge Computing Platform
1 Introduction
2 Related Work
3 System Design
3.1 Edge Layer
3.2 Cloud Layer
3.3 User Layer
4 Experiments
4.1 Prototype Setup
4.2 Experimental Results
5 Conclusion
References
AlPicoSoC: A Low-Power RISC-V Based System on Chip for Edge Devices with a Deep Learning Accelerator
1 Introduction
2 Related Work
3 Proposed Architecture
3.1 The Architectural Design of Alpha Accelerator
3.2 Pipelining in Alpha Accelerator
3.3 Workload Mapping
4 Evaluation
4.1 Setup
4.2 Result
5 Conclusion
References
A Transparent Scalable E-Voting Protocol Based on Open Vote Network Protocol and Zk-STARKs
1 Introduction
2 Preliminaries
2.1 Requirements for Voting Protocols
2.2 Open Vote Network
2.3 Proof of Validity of Ballot (CDS Proof)
2.4 Zero-Knowledge Rollup
3 Related Works
4 Our Protocol
4.1 Phase 1: Smart Contract Deployment
4.2 Phase 2.1: Registration Validation
4.3 Phase 2.2: Registration Confirmation
4.4 Phase 3: Casting Votes
4.5 Phase 4: Tallying Votes
4.6 Phase 5: Refunding
5 Evaluation
5.1 Theoretical Discussion
5.2 Concrete Performance
6 Conclusion
References
DarkMDE: Excavating Synthetic Images for Nighttime Depth Estimation Using Cross-Domain Supervision
1 Introduction
2 Related Works
2.1 Self-supervised Depth Estimation
2.2 Nighttime Self-supervised Depth Estimation
3 Method
3.1 Self-supervised Training
3.2 Cross Domain Supervision
4 Experiment
4.1 Dataset
4.2 Implementation Detail
4.3 Discussion
4.4 Ablation Study
5 Conclusion
References
Author Index

Recommend Papers

Intelligence of Things: Technologies and Applications: The Second International Conference on Intelligence of Things (ICIT 2023), Ho Chi Minh City, ... Engineering and Communications Technologies) 3031467485, 9783031467486

This book aims to provide state-of-the-art knowledge in the field of Intelligence of Things to both academic and industr

105 95 35MB Read more

Industrial Networks and Intelligent Systems: 9th EAI International Conference, INISCOM 2023, Ho Chi Minh City, Vietnam, August 2-3, 2023, Proceedings ... and Telecommunications Engineering, 531) 3031473582, 9783031473586

This book constitutes the refereed proceedings of the 9th EAI International Conference on Industrial Networks and Intell

120 45 39MB Read more

Dreaming of Money in Ho Chi Minh City 0295992751, 9780295992754

The expanding use of money in contemporary Vietnam has been propelled by the rise of new markets, digital telecommunicat

297 78 2MB Read more

Proceedings of International Conference on Artificial Intelligence and Communication Technologies (ICAICT 2023): Artificial Intelligence and Wireless ... Innovation, Systems and Technologies, 368) [1 ed.] 981996640X, 9789819966400

This book gathers selected papers presented at the International Conference on Artificial Intelligence and Communication

100 23 13MB Read more

Dreaming of Money in Ho Chi Minh City 9780295804620, 0295804629

The expanding use of money in contemporary Vietnam has been propelled by the rise of new markets, digital telecommunicat

305 70 1MB Read more

Ho Chi Minh [42]

370 114 22MB Read more

Internet of Things and Connected Technologies: Conference Proceedings on 6th International Conference on Internet of Things and Connected Technologies ... 2021 (Lecture Notes in Networks and Systems) 3030945065, 9783030945060

This book presents recent advances on IoT and connected technologies. We are currently in the midst of the Fourth Indust

100 12 46MB Read more

Innovative Computing Vol 1 - Emerging Topics in Artificial Intelligence: Proceedings of IC 2023 9819920914, 9789819920914

This book comprises select peer-reviewed proceedings of the 6th International Conference on Innovative Computing (IC 202

175 25 46MB Read more

Internet of Things and Connected Technologies: Conference Proceedings on 5th International Conference on Internet of Things and Connected Technologies ... in Intelligent Systems and Computing, 1382) 3030767353, 9783030767358

This book presents the recent research adoption of a variety of enabling wireless communication technologies like RFID t

122 83 75MB Read more

Proceedings of International Conference on Artificial Intelligence and Communication Technologies (ICAICT 2023): Network Technologies: Mathematical ... Innovation, Systems and Technologies, 369) 9819969557, 9789819969555

This book contains selected papers presented at the International Conference on Artificial Intelligence and Communicatio

123 42 12MB Read more

Intelligence of Things: Technologies and Applications: The Second International Conference on Intelligence of Things (ICIT 2023) Ho Chi Minh City Vietnam October 25-27 2023 Proceedings vol 1 [vol. 187, 1 ed.]
9783031465727, 9783031465734

Author / Uploaded
Nhu-Ngoc Dao
Tran Ngoc Thinh
Ngoc Thanh Nguyen

Similar Topics
Science (general)
International Conferences and Symposiums

0 0 0
Like this paper and download? You can publish your own PDF file online for free in a few minutes! Sign Up

File loading please wait...

Citation preview

Lecture Notes on Data Engineering and Communications Technologies 187

Nhu-Ngoc Dao Tran Ngoc Thinh Ngoc Thanh Nguyen Editors

Intelligence of Things: Technologies and Applications The Second International Conference on Intelligence of Things (ICIT 2023), Ho Chi Minh City, Vietnam, October 25–27, 2023, Proceedings, Volume 1

Lecture Notes on Data Engineering and Communications Technologies Series Editor Fatos Xhafa, Technical University of Catalonia, Barcelona, Spain

187

The aim of the book series is to present cutting edge engineering approaches to data technologies and communications. It will publish latest advances on the engineering task of building and deploying distributed, scalable and reliable data infrastructures and communication systems. The series will have a prominent applied focus on data technologies and communications with aim to promote the bridging from fundamental research on data science and networking to data engineering and communications that lead to industry products, business knowledge and standardisation. Indexed by SCOPUS, INSPEC, EI Compendex. All books published in the series are submitted for consideration in Web of Science.

Nhu-Ngoc Dao · Tran Ngoc Thinh · Ngoc Thanh Nguyen Editors

Intelligence of Things: Technologies and Applications The Second International Conference on Intelligence of Things (ICIT 2023), Ho Chi Minh City, Vietnam, October 25–27, 2023, Proceedings, Volume 1

Editors Nhu-Ngoc Dao Sejong University Seoul, Korea (Republic of) Ngoc Thanh Nguyen Wroclaw University of Science and Technology Wrocław, Poland

Tran Ngoc Thinh Vietnam National University Ho Chi Minh City (VNU-HCM) Ho Chi Minh City University of Technology (HCMUT) Ho Chi Minh City, Vietnam

ISSN 2367-4512 ISSN 2367-4520 (electronic) Lecture Notes on Data Engineering and Communications Technologies ISBN 978-3-031-46572-7 ISBN 978-3-031-46573-4 (eBook) https://doi.org/10.1007/978-3-031-46573-4 © The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 This work is subject to copyright. All rights are solely and exclusively licensed by the Publisher, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed. The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use. The publisher, the authors, and the editors are safe to assume that the advice and information in this book are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or the editors give a warranty, expressed or implied, with respect to the material contained herein or for any errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional claims in published maps and institutional affiliations. This Springer imprint is published by the registered company Springer Nature Switzerland AG The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland Paper in this product is recyclable.

Preface

This volume contains the proceedings of the Second International Conference on Intelligence of Things (ICIT 2023), hosted by Ho Chi Minh City University of Technology (HCMUT) in Ho Chi Minh City, Vietnam, October 25–27, 2023. The conference was co-organized by Ho Chi Minh City University of Technology (HCMUT), Hanoi University of Mining and Geology (HUMG), Vietnam National University of Agriculture (VNUA), Ho Chi Minh City Open University, and Quy Nhon University, Vietnam. In recent years, we have witnessed important changes and innovations that the Internet of things (IoT) enables for emerging digital transformations in human life. Continuing impressive successes of the IoT paradigms, things now require an intelligent ability while connecting to the Internet. To this end, the integration of artificial intelligence (AI) technologies into the IoT infrastructure has been considered a promising solution, which defines the next generation of the IoT, i.e., the intelligence of things (AIoT). The AIoT is expected to achieve more efficient IoT operations in manifolds such as flexible adaptation to environmental changes, optimal tradeoff decisions among various resources and constraints, and friendly human–machine interactions. In this regard, the ICIT 2023 was held to gather scholars who address the current state of technology and the outcome of ongoing research in the area of AIoT. The organizing committee received 159 submissions from 15 countries. Each paper was reviewed by at least two members of the program committee (PC) and external reviewers. Finally, we selected 71 best papers for oral presentation and publication. We would like to express our thanks to the keynote speakers: Ngoc Thanh Nguyen from Wroclaw University of Science and Technology, Poland, Emanuel Popovici from University College Cork, Ireland, and Koichiro Ishibashi from the University of ElectroCommunications, Japan, for their world-class plenary speeches. Many people contributed toward the success of the conference. First, we would like to recognize the work of the PC co-chairs for taking good care of the organization of the reviewing process, an essential stage in ensuring the high quality of the accepted papers. In addition, we would like to thank the PC members for performing their reviewing work with diligence. We thank the local organizing committee chairs, publicity chair, multimedia chair, and technical support chair for their fantastic work before and during the conference. Finally, we cordially thank all the authors, presenters, and delegates for their valuable contributions to this successful event. The conference would not have been possible without their support. Our special thanks are also due to Springer for publishing the proceedings and to all the other sponsors for their kind support.

vi

Preface

Finally, we hope that ICIT 2023 contributed significantly to the academic excellence of the field and will lead to the even greater success of ICIT events in the future. October 2023

Nhu-Ngoc Dao Tran Ngoc Thinh Ngoc Thanh Nguyen

Organization

Organizing Committee Honorary Chairs Mai Thanh Phong Thanh Hai Tran Thanh Thuy Nguyen Nguyen Minh Ha Do Ngoc My Nguyen Thi Lan

HCMC University of Technology, Vietnam Hanoi University of Mining and Geology, Vietnam Vietnam National University, Vietnam Ho Chi Minh City Open University, Vietnam Quy Nhon University, Vietnam Vietnam National University of Agriculture, Vietnam

General Chairs Nhu-Ngoc Dao Quang-Dung Pham Hong Anh Le Tran Vu Pham Koichiro Ishibashi Truong Hoang Vinh

Sejong University, South Korea Vietnam National University of Agriculture, Vietnam Hanoi University of Mining and Geology, Vietnam HCMC University of Technology, Vietnam The University of Electro-Communications, Japan Ho Chi Minh City Open University, Vietnam

Program Chairs Takayuki Okatani Pham Quoc Cuong Tran Ngoc Thinh Ing-Chao Lin Shin Nakakima Le Xuan Vinh

Tohoku University, Japan HCMC University of Technology, Vietnam HCMC University of Technology, Vietnam National Cheng Kung University, Taiwan National Institution of Informatics, Japan Quy Nhon University, Vietnam

viii

Organization

Steering Committee Ngoc Thanh Nguyen (Chair) Hoang Pham Sungrae Cho Hyeonjoon Moon Jiming Chen Dosam Hwang Gottfried Vossen Manuel Nunez Torsten Braun Schahram Dustdar

Wroclaw University of Science and Technology, Poland Rutgers University, USA Chung-Ang University, South Korea Sejong University, South Korea Zhejiang University, China Yeungnam University, South Korea Muenster University, Germany Universidad Complutense de Madrid, Spain University of Bern, Switzerland Vienna University of Technology, Austria

Local Organizing Chairs Pham Hoang Anh Le Trong Nhan Phan Tran Minh Khue

HCMC University of Technology, Vietnam HCMC University of Technology, Vietnam Ho Chi Minh City Open University, Vietnam

Publication Chairs Vo Nguyen Quoc Bao Laihyuk Park Ho Van Lam

Posts and Telecommunications Institute of Technology, Vietnam Seoul National University of Science & Technology, South Korea Quy Nhon University, Vietnam

Finance Chair Nguyen Cao Tri

HCMC University of Technology, Vietnam

Publicity Chairs Tran Trung Hieu Phu Huu Phung

University of Stuttgart, Germany University of Dayton, USA

Organization

Woongsoo Na Trong-Hop Do

Kongju National University, South Korea University of Information Technology VNU-HCM, Vietnam

Track Chairs Quan Thanh Tho Ngoc Thanh Dinh Tran Minh Quang Nguyen Huu Phat Mai Dung Nguyen Le Tuan Ho Pham Trung Kien Khac-Hoai Nam Bui Thuy Nguyen Hong Phan Thi Thu Van-Phuc Hoang Kien Nguyen

HCMC University of Technology, Vietnam Soongsil University, South Korea HCMC University of Technology, Vietnam Hanoi University of Science and Technology, Vietnam Hanoi University of Mining and Geology, Vietnam Quy Nhon University, Vietnam HCMC International University, Vietnam Viettel Cyberspace Center, Vietnam RMIT University, Vietnam FPT University, Vietnam Le Quy Don Technical University, Vietnam Chiba University, Japan

Program Committee Daisuke Ishii Takako Nakatani Kozo Okano Kazuhiko Hamamoto Tran Van Hoai Hiroshi Ishii Cong-Kha Pham Man Van Minh Nguyen Shigenori Tomiyama Le Thanh Sach Quan Thanh Tho Minh-Triet Tran Pham Tran Vu Nguyen Duc Dung Nguyen An Khuong Nguyen-Tran Huu-Nguyen Nguyen Le Duy Lai Van Sinh Nguyen

Tuan Duy Anh Nguyen Jae Young Hur Minh Son Nguyen Truong Tuan Anh Tran Minh Quang Vo Thi Ngoc Chau Le Thanh Van Hoa Dam Surin Kittitornkun Tran Manh Ha Luong Vuong Nguyen Tri-Hai Nguyen Nguyen Tien Dat Duong Huu Thanh Le Viet Tuan Denis Hamad Fadi Dornaika Thongchai Surinwarangkoon

ix

x

Organization

Kittikhun Meethongjan Daphne Teck Ching Lai Meirambek Zhaparov Minh Ngoc Dinh Tsuchiya Takeshi Thuy Nguyen Trang T. T. Do Ha X. Dang Nguyen Trong Kuong Nguyen Hoang Huy Quang-Dung Pham

Nguyen Doan Dong Tran Duc Quynh Phan Thi Thu Hong Nguyen Huu Du Nguyen Van Hanh Hirokazu Doi John Edgar S. Anthony Long Tan Le Tran Vinh Duc Le Duc Hung

Contents

State-of-the-Art and Theoretical Analyses FPGA/AI-Powered Data Security for IoT Edge Computing Platforms: A Survey and Open Issues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Cuong Pham-Quoc A Review in Deep Learning-Based Thyroid Cancer Detection Techniques Using Ultrasound Images . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Le Chieu Long, Y. Bui Hoang, Nguyen Luong Trung, Bui Tuan Dung, Thi-Thao Ha, and Luong Vuong Nguyen Bio-Inspired Clustering: An Ensemble Method for User-Based Collaborative Filtering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Luong Vuong Nguyen, Tri-Hai Nguyen, Ho-Trong-Nguyen Pham, Quoc-Trinh Vo, Huu-Thanh Duong, and Tram-Anh Nguyen-Thi Deep Reinforcement Learning-Based Sum-Rate Maximization for Uplink Multi-user SIMO-RSMA Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Thanh Phung Truong, Tri-Hai Nguyen, Anh-Tien Tran, Si Van-Tien Tran, Van Dat Tuong, Luong Vuong Nguyen, Woongsoo Na, Laihyuk Park, and Sungrae Cho Multiobjective Logistics Optimization for Automated ATM Cash Replenishment Process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Bui Tien Thanh, Dinh Van Tuan, Tuan Anh Chi, Nguyen Van Dai, Nguyen Tai Quang Dinh, Nguyen Thu Thuy, and Nguyen Thi Xuan Hoa Adaptive Conflict-Averse Multi-gradient Descent for Multi-objective Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Dinh Van Tuan, Tran Anh Tuan, Nguyen Duc Anh, Bui Khuong Duy, and Tran Ngoc Thang Multicriteria Portfolio Selection with Intuitionistic Fuzzy Goals as a Pseudoconvex Vector Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Vuong D. Nguyen, Nguyen Kim Duyen, Nguyen Minh Hai, and Bui Khuong Duy Research and Develop Solutions to Traffic Data Collection Based on Voice Techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Ty Nguyen Thi and Quang Tran Minh

3

15

26

36

46

57

68

80

xii

Contents

Using Machine Learning Algorithms to Diagnosis Melasma from Face Images . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Van Lam Ho, Tuan Anh Vu, Xuan Viet Tran, Thi Hoang Bich Diu Pham, Xuan Vinh Le, Ngoc Huan Nguyen, and Ngoc Dung Nguyen

91

Reinforcement Learning for Portfolio Selection in the Vietnamese Market . . . . . 102 Bao Bui Quoc, Quang Truong Dang, and Anh Son Ta AIoT Technologies A Systematic CL-MLP Approach for Online Forecasting of Multiple Key Performance Indicators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117 Pha Le, Triet Le, Thien Pham, and Tho Quan Neutrosophic Fuzzy Data Science and Addressing Research Gaps in Geographic Data and Information Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128 A. A. Salama, Roheet Bhatnagar, N. S. Alharthi, R. E. Tolba, and Mahmoud Y. Shams Inhibitory Control during Visual Perspective Taking Revealed by Multivariate Analysis of Event-Related Potentials . . . . . . . . . . . . . . . . . . . . . . . 140 Hirokazu Doi A Novel Custom Deep Learning Network Combining 1D-Convolution and LSTM for Rapid Wine Quality Detection in Small and Average-Scale Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148 Quoc Duy Nam Nguyen, Hoang Viet Anh Le, Le Vu Trung Duong, Sang Duong Thi, Hoai Luan Pham, Thi Hong Tran, and Tadashi Nakano IoT-Enabled Wearable Smart Glass for Monitoring Intraoperative Anesthesia Patients . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 160 B. Gopinath, V. S. Yugesh, T. Sobeka, and R. Santhi Traffic Density Estimation at Intersections via Image-Based Object Reference Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 171 Hieu Bui Minh and Quang Tran Minh Improving Automatic Speech Recognition via Joint Training with Speech Enhancement as Multi-task Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 182 Nguyen Hieu Nghia Huynh, Huy Nguyen-Gia, Tran Hoan Duy Nguyen, Vo Hoang Thi Nguyen, Tuong Nguyen Huynh, Duc Dung Nguyen, and Hung T. Vo Solving Feature Selection Problem by Quantum Optimization Algorithm . . . . . . 192 Anh Son Ta and Huy Phuc Nguyen Ha

Contents

xiii

A Methodology of Extraction DC Model for a 65 nm Floating-Gate Transistor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 202 Thinh Dang Cong and Trang Hoang imMeta: An Incremental Sub-graph Merging for Feature Extraction in Metagenomic Binning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 214 Hong Thanh Pham, Van Hoai Tran, and Van Vinh Le Virtual Sensor to Impute Missing Data Using Data Correlation and GAN-Based Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 224 Nguyen Thanh Quan, Nguyen Quang Hung, and Nam Thoai An Edge AI-Based Vehicle Tracking Solution for Smart Parking Systems . . . . . . 234 Doan Viet Tu, Pham Minh Quang, Huynh Phuc Nghi, and Tran Ngoc Thinh Low-Light Image Enhancement Using Quaternion CNN . . . . . . . . . . . . . . . . . . . . 244 Truong Quang Vinh, Tran Quang Duy, and Nguyen Quang Luc Leverage Deep Learning Methods for Vehicle Trajectory Prediction in Chaotic Traffic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 255 Tan Chau, Duc-Vu Ngo, Minh-Tri Nguyen, Anh-Duc Nguyen-Tran, and Trong-Hop Do AIoT System Architectures Wireless Sensor Network to Collect and Forecast Environment Parameters Using LSTM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 269 Phat Nguyen Huu, Loc Bui Dang, Trong Nguyen Van, Thao Dao Thu Le, Chau Nguyen Le Bao, Anh Tran Ha Dieu, and Quang Tran Minh SCBM: A Hybrid Model for Vietnamese Visual Question Answering . . . . . . . . . 280 Hieu Le Trung, Tuyen Dao Cong, Trung Nguyen Quoc, and Vinh Truong Hoang A High-Performance Pipelined FPGA-SoC Implementation of SHA3-512 for Single and Multiple Message Blocks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 288 Tan-Phat Dang, Tuan-Kiet Tran, Trong-Thuc Hoang, Cong-Kha Pham, and Huu-Thuan Huynh Optimizing ECC Implementations Based on SoC-FPGA with Hardware Scheduling and Full Pipeline Multiplier for IoT Platforms . . . . . . . . . . . . . . . . . . . 299 Tuan-Kiet Tran, Tan-Phat Dang, Trong-Thuc Hoang, Cong-Kha Pham, and Huu-Thuan Huynh

xiv

Contents

Robust Traffic Sign Detection and Classification Through the Integration of YOLO and Deep Learning Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 310 D. Anh Nguyen, Nhat Thanh Luong, Tat Hien Le, Duy Anh Nguyen, and Hoang Tran Ngoc OPC-UA/MQTT-Based Multi M2M Protocol Architecture for Digital Twin Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 322 Le Phuong Nam, Tran Ngoc Cat, Diep Tran Nam, Nguyen Van Trong, Trong Nhan Le, and Cuong Pham-Quoc Real-Time Singing Performance Improvement Through Pitch Correction Using Apache Kafka Stream Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 339 Khoi Bui and Trong-Hop Do An Implementation of Human-Robot Interaction Using Machine Learning Based on Embedded Computer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 348 Thanh-Truc Tran, Thanh Vo-Minh, and Kien T. Pham DIKO: A Two-Stage Hybrid Network for Knee Osteoarthritis Diagnosis Using Deep Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 360 Trung Hieu Phan, Thiet Su Nguyen, Trung Tuan Nguyen, Tan Loc Le, Duc Trung Mai, and Thanh Tho Quan Shallow Convolutional Neural Network Configurations for Skin Disease Diagnosis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 370 Ngoc Huynh Pham, Hai Thanh Nguyen, and Tai Tan Phan Design an Indoor Positioning System Using ESP32 Ultra-Wide Band Module . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 382 Ton Nhat Nam Ho, Van Su Tran, and Ngoc Truong Minh Nguyen Towards a Smart Parking System with the Jetson Xavier Edge Computing Platform . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 394 Cuong Pham-Quoc and Tam Bang AlPicoSoC: A Low-Power RISC-V Based System on Chip for Edge Devices with a Deep Learning Accelerator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 403 Thai Ngo, Tran Ngoc Thinh, and Huynh Phuc Nghi A Transparent Scalable E-Voting Protocol Based on Open Vote Network Protocol and Zk-STARKs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 414 Ngan Nguyen and Khuong Nguyen-An

Contents

xv

DarkMDE: Excavating Synthetic Images for Nighttime Depth Estimation Using Cross-Domain Supervision . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 428 Thai Tran Trung, Huy Le Xuan, Minh Huy Vu Nguyen, Hiep Nguyen The, Nhat Huy Tran Hoang, and Duc Dung Nguyen Author Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 439

State-of-the-Art and Theoretical Analyses

FPGA/AI-Powered Data Security for IoT Edge Computing Platforms: A Survey and Open Issues Cuong Pham-Quoc1,2(B) 1

Ho Chi Minh City University of Technology (HCMUT), 268 Ly Thuong Kiet, District 10, Ho Chi Minh City, Vietnam [email protected] 2 Vietnam National University - Ho Chi Minh City (VNU-HCM), Thu Duc, Ho Chi Minh City, Vietnam Abstract. In recent years, the Internet of Things has been widely applied in many application domains, such as monitoring environments, healthcare, or industry. Although design approaches, technologies, and frameworks for IoT-based applications have been introduced eﬃciently, the security issues for IoT-based systems still need more studies from academia and industry. As one of the most suitable technologies for IoT edge computing devices, FPGAs oﬀer many advantages compared to traditional processors. Moreover, AI-based data processing for IoT systems has shown more and more beneﬁts in recent years. In this paper, we ﬁrst present IoT security threats that many studies have tried to cope with in recent years. We then survey FPGA/AI-powered security proposals in the literature for IoT edge computing platforms. We classify the studies on this topic into three categories for comparison: FPGA-based security approaches, using AI for security with traditional processors, and AI-based security building on FPGA platforms. Finally, based on these proposals in the literature, we introduce open issues for future research on this topic. Keywords: FPGA · Secured IoT devices computing · AI-based security for IoTs

1

· Security for Edge

Introduction

According to the statistic in [37], there will be 75B+ IoT devices connected to the internet in 2025. The statistic also reports that by 2023, up to 1.1 trillion USD will be spent globally. Along with the increase in the number of IoT devices and money spent, IoT-based application domains are also increased. For example, the healthcare industry is one of the top domains using wearable IoT devices for patient monitoring, such as blood pressure monitoring, connected inhaler, surgery robots, and intelligent hearing aids [14]. A smart home is another domain that mainly requires IoT devices for smart door locks, smart lights controllers, surveillance videos, or smart appliances [18]. Recently, IoT platforms used for smart cities have increased dramatically, including smart street and traﬃc lights or air quality monitoring [7]. c The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 N.-N. Dao et al. (Eds.): ICIT 2023, LNDECT 187, pp. 3–14, 2023. https://doi.org/10.1007/978-3-031-46573-4_1

4

C. Pham-Quoc

Despite the success of IoT in many sectors, IoT devices and platforms suﬀer from resource constraints and energy limitations. In addition, these devices usually need more security solutions to reduce building and operating costs. Hence, consolidating IoT devices with security approaches is a strong demand [16]. However, the main requirement for these approaches is to deal with hardware, networking, and software limitations. In recent years, due to the high demand, many studies published in the literature have proposed numerous systems for protecting IoT devices and platforms. They use diﬀerent computing platforms such as micro-controllers/processors, FPGAs (Field Programmable Gate Arrays), or ASICs (Application Speciﬁc Integrated Circuits) using various techniques like AI (Artiﬁcial Intelligent) or pattern-based. As one of the most suitable and modern platforms to overcome the limitations of IoT devices, FPGA-based platforms have been used for deploying IoT-based applications in recent years [4]. Therefore, this paper surveys studies in the literature focusing on FPGA platforms and using AI approaches. 1.1

Related Work

As one of the most attractive research topics in IoT, security for IoT systems has been considered in many studies. Recently published articles also survey IoT security and privacy proposals in the literature. P. Williams et al. [40] presented a survey of emerging technologies to countermeasure IoT threats focusing on Machine learning and blockchain approaches. A survey of challenges and sources of threats for IoTs was introduced in [15]. K. Najmi et al. [24] summarized threats and countermeasures in IoT systems targeting conﬁdentiality and reliability of users. One of the oldest surveys of threats and vulnerabilities in the IoT world and solutions was introduced in [2]. O. Abiodun et al. [1] reviewed the analyzed IoT security requirements, current research challenges in security, and proposed ideas for potential solutions. D. Swessi et al. [38] introduced a comprehensive taxonomy of IoT security issues and countermeasures used for these threats. A survey of AI-based intrusion detection approaches was presented in [9]. 1.2

Contributions

Unlike the surveys mentioned above that focus on security threats and general approaches for countermeasures to these threats, our work targets FPGA platforms for deploying techniques to counteract data security issues. More precisely, our survey focuses on speciﬁc systems designed and implemented for FPGAbased IoT edge computing instead of general proposals like other surveys. The main contributions of our paper can be summarized in three folds. 1. We summarize the surveys of IoT security in the literature; 2. We survey FPGA-based design and implementation for IoT security of edge computing platforms with an emphasis on AI-based approaches, i.e., at the platforms layer; 3. We present open issues and challenges to IoT security and threats at the platforms layer.

FPGA/AI-Powered Security IoTs

1.3

5

Outline

The rest of the paper is organized as follows. Section 2 presents the security background for IoT edge computing systems. We introduce FPGA-based security for IoT edge devices studies in Sect. 3. Section 4 shows the AI-based security studies for edge computing systems in the literature. Open issues for researching FPGA/AI-powered security for edge devices are discussed in Sect. 5. Finally, Sect. 6 concludes our paper.

2

Preliminary

In this section, we ﬁrst present the IoT architecture layers and associated security and threats. We then summarize the spectrum of publications’ sources where we collect articles for this work. 2.1

IoT Layers and Threats

Currently, there is no standard architecture for IoT in which the architecture can be classiﬁed into three layers, four layers, or six layers [33]. Therefore, we consider the architecture of an IoT system as three layers. Figure 1 illustrates the 3-layer architecture for an IoT system, including a platform layer with sensors and computing boards, a storage and processing layer, and a users/administrators layer for interacting with people. Each layer is responsible for particular purposes of the system and is associated with potential threats. The ﬁgure also presents potential security issues from which each layer suﬀers. 2.2

IoT Security vs. Traditional Security

IoT and conventional security are two domains that address security concerns in distinct environments. Below, we list the six key issues of IoT security and threats. 1. Scope: while traditional security primarily focuses on securing physical assets, IoT security extends beyond physical assets to encompass a vast network of interconnected devices, sensors, and data transmission. 2. Attack Surface: IoT security faces similar threats to traditional like unauthorized access, theft, vandalism, and physical breaches, but also deals with unique challenges such as data interception, device manipulation, ﬁrmware vulnerabilities, and distributed denial-of-service (DDoS) attacks. 3. Connectivity: IoT devices are designed to be connected to the internet, enabling data exchange and remote control. This connectivity introduces additional vulnerabilities, as IoT devices can be accessed from anywhere, increasing the potential attack surface. 4. Scale and Complexity: IoT deployments often involve many devices, ranging from small sensors to complex systems. The scale and complexity of IoT networks make security management more challenging than conventional security setups.

6

C. Pham-Quoc

Users

Admin

Users/Administrators Applications

- DDoS - Phishing - Side channel - Virus/Trojan

Storage and processing (Cloud, Servers,...)

- Encrypted data - Authentication - Authorization - DDoS - Devices cloned

Platform layers (sensors, FPGA boards, MCU boards,...)

- Encrypted data - Authentication - Authorization - DDoS - Hardware trojan

IoT architecture layers

Security & threats

Fig. 1. The IoT architecture layers and related security and threats

5. Authentication and Authorization: IoT security requires more advanced authentication mechanisms than traditional, including digital certiﬁcates, secure protocols, and multi-factor authentication, to establish trust and secure communications between devices. 6. Data Privacy: Both traditional and IoT security address data privacy concerns. However, IoT security faces additional privacy challenges due to the massive conﬁdentiality and integrity of sensitive data collected by interconnected devices.

3

FPGA-Based Security for Edge Devices

In this section, we survey published literature work that proposes security solutions at the platform layer targeting FPGA devices. Samir et al. [31] present an implementation of eight diﬀerent data encryption algorithms on FPGA. The work targets a lightweight hardware-secured IoT computing note. The implementation is deployed in the Xilinx Zynq-7000 FPGA devices that can function at only 10 MHz. Z. Chen et al. [8] present a clocktree-based approach for detecting hardware trojans by extracting mathematical

FPGA/AI-Powered Security IoTs

7

features. The trojans will then be isolated using a neural network. Experimental results on a Xilinx Virtex 7 board show that a detection rate of 100% is obtained. Meenakshi et al. [22] ﬁll up all unutilized logic so the devices are trojan-free without any power consumption overhead and critical paths delay. Various security techniques are implemented, and the dynamic partial reconﬁguration (DPR) approach is used for randomly switching the techniques in [36]. Experimental results with a Xilinx Zynq-7000 board show that up to 80% power consumption is reduced at 10 MHz frequency. Bhoyar D. et al. [3] implement a 128-bit AES with VHDL for the security of IoT data. The implementation is simulated with ISIM. Parikibandla S. et al. [28] build the Lorenz Chaotic Circuit with Dual-port Read Only Memory-based PRESENT Algorithm on FPGA Virtex-6 board for IoT sensor nodes. However, the paper does not report any synthesis and experimental results. Sekar et al. [32] introduce an FPGA-based Elliptic Curve Cryptography (ECC) implementation for multi-factor authentication in IoT applications. Experimental results with Verilog-HDL on the Zynq FPGA board show that the proposed system can prevent multi-attacks. CanoQuiveu et al. [6] use SystemVerilog with the Nexys4DDR XC7Z020 FPGA board for building the embedded LUKS for IoT security. The article also reports better results than other related work. Lin et al. [20] introduce an FPGA-based implementation of a secure edge computing device targeting data conﬁdentiality. The system is tested with the Altera Cyclone II DE2-70 board with a 50 MHz working frequency. Gomes et al. [13] present a FAC-V coprocessor to accelerate the AES algorithm for RISC-V processors targeting IoT low-end devices. The proposed system is developed with a Xilinx XC7A100 device resulting in a 65 MHz working frequency. Siva and Murugan show their work with low-area FPGA-based AES implementation for IoT applications. One of the main contributions in this work is the Eﬃcient Pseudo Random Number Generator to generate keys. The experimental system on a Xilinx Virtex 6 device oﬀers a working frequency of 335.45 MHz. Rajput et al. [30] implement the VLSI architectures of WiMax/IoT MAES security approaches for light cryptography with reduced complexity. Experimental results with simulation show that the system can work with 23 MHz. Damodharan et al. [10] propose implementing a reliable, lightweight PRESENT encryption algorithm for medical IoT applications. Results with the Zynq-7000 FPGA board show an improvement of 85.54% throughput with a frequency of 13.56 MHz. A lightweight IoT edge device with ECC consolidated on FPGA is introduced in [19]. The secured device aim at a combination of performance and resources. Experimental results with a Xilinx Virtex 6 board oﬀer a 117 MHz working frequency system. Table 1 summarizes FPGA-based security for edge device proposals in the literature. The table shows that most studies use Xilinx FPGA devices for their prototypes. Working frequencies of these systems are pretty low except for work in [35].

8

C. Pham-Quoc Table 1. Comparison of the FPGA-based security for edge devices proposals Work Approaches

4

FPGA platform

Frequency

Year

[31]

Various

Zynq-7000

10 MHz

2019

[8]

Clock-tree

Xilinx Artix-7

N/A

2019

[22]

Fill up LUTs

Artix-7 FPGA

N/A

2019

[36]

Various + DPR

Zynq-7000

10 MHz

2019

[3]

AES 128 bit

Simulation

N/A

2020

[28]

PRESENT

MATLAB simulation

N/A

2021

[6]

LUKS

Nexys4DDR XC7Z020 N/A

2021

[32]

ECC

Zynq-7000

N/A

2021

[20]

XOR scheme-based Altera DE2-70

50 MHz

2021

[13]

AES + RISC-V

Xilinx XC7A100

65 MHz

2022

[35]

AES

Xilinx Virtex 6

335.45 MHz 2022

[30]

WiMax/IoT MAES Simulation

23 MHz

2023

[10]

PRESENT

Zynq-7000

13.56 MHz

2023

[19]

ECC

Virtex 6

117 MHz

2023

AI-Based Security for Edge Devices

In this section, we survey AI-based security for edge devices, where we classify the proposals into processor-based systems and FPGA-based implementations. 4.1

Processor-Based AI Approaches

Abebe et al. [11] design a distributed attack scheme with deep learning techniques for IoT security. The proposed system is tested with the NSL-KDD dataset and achieves up to 99.20% accuracy. The introduction of federated selflearning anomaly detection in IoT networks using a self-generated dataset is discussed in the paper [27]. An analysis of eﬀective machine learning models with a new BoT-IoT dataset is introduced in [34] for IoT attack prevention. According to the comparisons, random forest, C4.5, and random tree achieve the best results in terms of accuracy. In [39], researchers test various machine learning algorithms on a generated dataset called MQTT to detect attacks in IoT networks, achieving an accuracy of 98%. Likewise, in [23], a decentralized, federated learning approach with an ensemble is proposed, combining long shortterm memory (LSTM) and gated recurrent units to enable anomaly detection. Elsayed et al. [12] present an FPGA-based Secured Automatic Two-level Intrusion Detection System (SATIDS) using the Long Short Term Memory approach with a new proposed dataset called ToN-IoT. Deep learning algorithms have also been utilized to create detection models from diﬀerent IoT datasets. For example, authors in [21] construct deep belief network models from the CICIDS 2017 dataset to classify regular records and six

FPGA/AI-Powered Security IoTs

9

attack types, achieving an average accuracy of 97.46%. The Yahoo web scope s5 dataset is employed in [41] for convolutional neural network (CNN) and recurrent autoencoder algorithms. Lightweight detection models based on a deep autoencoder were generated from the Bot-IoT dataset by the authors in [5], achieving the best setup F1-score of 97.61%. However, the hardware platform used for experimentation is not mentioned. The dataset used in [29] is self-generated and utilized in a graph neural network, resulting in a literature-reported accuracy of up to 97%. Table 2 summarizes all the above proposals. Table 2. Comparison of the processor-based AI approaches for security of edge devices

Work Dataset

4.2

Accuracy Platform

Year

[11]

NSL-KDD

99.20

N/A

2018

[27]

Self-generated

95.6%

GPU

2019

[34]

BoT-IoT

99.99%

N/A

2020

[39]

MQTTset

98.0%

Intel

2020

[21]

CICIDS 2017

99.4%

Intel

2020

[41]

Yahoo Webscope S5 99.6%

Google Colab 2020

[23]

Modbus

GPU

[5]

BoT-IoT

99.0%

N/A

2021

[29]

Self-generated

97.0%

N/A

2021

[12]

ToN-IoT

99.73%

Intel

2023

99.5%

2021

FPGA-Based AI Approaches

A neural network implemented on FPGA SoC for network intrusion detection targeting IoT gateway is presented in [17]. The work uses the NSL-KDD dataset for training and testing and oﬀers a 76 MHz working frequency with the Xilinx Zynq Z-7020 device. The system achieves 80.52% accuracy. A neural network model trained with GPU is implemented in [26]. The system is built on a Xilinx PYNQ-Z2 board for the inference phase using the high-level synthesis approach. Experimental results with the IoT-23 dataset show that the system achieves 104 MHz working frequency and 99.43% accuracy. Ngo et al. [25] improve their previous work and present their updated system with an implementation of an intrusion detection system on FPGA based on the IoT-23 dataset. The proposed work is built on a Xilinx PYNQ-Z2 board to oﬀer a 102 MHz working frequency with up to 99.66% accuracy. Table 3 summarizes the FPGA&AI-based proposed system for IoT security in the literature. However, as the table shows, few studies exist on this topic because AI approaches usually require substantial computational resources and much knowledge in hardware architecture.

10

C. Pham-Quoc

Table 3. Comparison of the FPGA-based AI approaches for security of edge devices Work Dataset

5

Accuracy FPGA platform

Frequency Year

[17]

NSL-KDD 80.52%

Xilinx Z-7020

[26]

IoT-23

99.43%

Xilinx PYNQ-Z2 104 MHz

76 MHz

2019 2021

[25]

IoT-23

99.66%

Xilinx PYNQ-Z2 102 MHz

2023

FPGA/AI-Powered Security for Edge Devices: Open Issues

FPGA/AI-powered security for edge devices presents several open issues and challenges. Here are some of the key ones: 1. Performance and Resource Constraints: Edge devices, such as IoT devices or embedded systems, often have limited computational resources and power constraints. Implementing FPGA-based security solutions while ensuring minimal impact on device performance and energy consumption is a signiﬁcant challenge. Optimizing algorithms and hardware designs to strike a balance between security requirements and resource limitations is crucial. 2. Design Complexity and Development Time: Designing and developing FPGAbased security solutions require specialized skills and expertise. Creating eﬃcient hardware architectures, designing algorithms, and implementing AIbased models on FPGAs can be complex and time-consuming. The challenge lies in reducing development time and complexity while maintaining robust security measures. 3. Hardware Security Assurance: Ensuring the security of the underlying FPGA hardware is critical for FPGA/AI-powered security solutions. However, FPGAs can be vulnerable to attacks, such as reverse engineering, side-channel attacks, and tampering. Protecting the integrity and conﬁdentiality of the FPGA conﬁguration, as well as implementing secure boot mechanisms, are important challenges in this context. 4. Adaptability and Flexibility: Edge devices often operate in dynamic and diverse environments, requiring adaptable and ﬂexible security solutions. FPGA-based security approaches should be capable of accommodating changes in device conﬁgurations, network conditions, and security requirements. Ensuring the ability to update FPGA conﬁgurations or AI models on the ﬂy to address emerging threats is a challenge. 5. Model Robustness and Reliability: AI-powered security solutions heavily rely on machine learning models for tasks such as anomaly detection, intrusion detection, or malware classiﬁcation. Ensuring the robustness and reliability of these models is crucial, as they need to be resistant to adversarial attacks and capable of handling real-world variations, noise, and evolving attack techniques. 6. Scalability and Compatibility: Deploying FPGA/AI-powered security solutions across a large number of diverse edge devices requires scalability and

FPGA/AI-Powered Security IoTs

11

compatibility considerations. Ensuring that the solutions can be easily integrated with diﬀerent hardware architectures, operating systems, and communication protocols is a challenge. Furthermore, accommodating the varying computational capabilities and FPGA resources of diﬀerent edge devices adds complexity to the deployment process. 7. Interoperability and Standardization: Establishing interoperability standards and frameworks for FPGA-based security solutions can simplify integration, collaboration, and compatibility among diﬀerent vendors and stakeholders. However, achieving consensus on such standards and promoting their adoption across the industry is an ongoing challenge. 8. Trust and Verification: Building trust in FPGA/AI-powered security solutions is crucial, especially when deploying them in critical or sensitive applications. Ensuring the transparency, veriﬁability, and auditability of the implemented security mechanisms and AI algorithms is an open issue. Developing techniques for independent veriﬁcation and validation of FPGA-based security solutions can help establish trust among users and stakeholders. Addressing these open issues requires collaborative eﬀorts from FPGA manufacturers, AI researchers, security experts, and industry standardization organizations. Continuous research, innovation, and the development of best practices are necessary to enhance the security, performance, and usability of FPGA/AIpowered security solutions for edge devices.

6

Conclusion

In recent years, the Internet of Things (IoT) has found widespread application in various domains such as environmental monitoring, healthcare, and industry. Despite the eﬃcient introduction of design approaches, technologies, and frameworks for IoT-based applications, there is still a need for extensive research on the security aspects of these systems, both in academia and industry. FieldProgrammable Gate Arrays (FPGAs) have emerged as one of the most suitable technologies for IoT edge computing devices, oﬀering numerous advantages over traditional processors. Additionally, the use of artiﬁcial intelligence (AI) for data processing in IoT systems has demonstrated increasing beneﬁts. This paper begins by presenting the IoT security threats that have been addressed in numerous studies conducted in recent years. Subsequently, a survey of FPGA/AI-powered security proposals for IoT edge computing platforms is conducted. The studies in this area are classiﬁed into three categories for comparative analysis, namely FPGA-based security approaches, the utilization of AI for security in conjunction with traditional processors, and AI-based security solutions implemented on FPGA platforms. Drawing upon these proposals from the existing literature, we identify and discuss the open issues that warrant further investigation in this ﬁeld of research. Acknowledgement. We acknowledge Ho Chi Minh City University of Technology (HCMUT), VNU- HCM for supporting this study.

12

C. Pham-Quoc

References 1. Abiodun, O.I., Abiodun, E.O., Alawida, M., Alkhawaldeh, R.S., Arshad, H.: A review on the security of the internet of things: challenges and solutions. Wireless Pers. Commun. 119, 2603–2637 (2021) 2. Alaba, F.A., Othman, M., Hashem, I.A.T., Alotaibi, F.: Internet of things security: a survey. J. Netw. Comput. Appl. 88, 10–28 (2017) 3. Bhoyar, D.B., Wankhede, S.R., Modod, S.K.: Design and implementation of AES on FPGA for security of IOT data. In: Nain, N., Vipparthi, S.K. (eds.) ICIoTCT 2019. AISC, vol. 1122, pp. 376–383. Springer, Cham (2020). https://doi.org/10. 1007/978-3-030-39875-0 40 4. Biookaghazadeh, S., Zhao, M., Ren, F.: Are FPGAs suitable for edge computing? In: USENIX Workshop on Hot Topics in Edge Computing (HotEdge 18). USENIX Association, Boston, MA, July 2018. https://www.usenix.org/ conference/hotedge18/presentation/biookaghazadeh 5. Bovenzi, G., Aceto, G., Ciuonzo, D., Persico, V., Pescap´e, A.: A hierarchical hybrid intrusion detection approach in IoT scenarios. In: GLOBECOM 2020–2020 IEEE Global Communications Conference, pp. 1–7. IEEE (2020) 6. Cano-Quiveu, G., et al.: Embedded LUKS (E-LUKS): a hardware solution to IoT security. Electronics 10(23), 3036 (2021) 7. Chaudhary, S., Johari, R., Bhatia, R., Gupta, K., Bhatnagar, A.: CRAIoT: concept, review and application(s) of IoT. In: 2019 4th International Conference on Internet of Things: Smart Innovation and Usages (IoT-SIU), pp. 1–4 (2019). https://doi. org/10.1109/IoT-SIU.2019.8777467 8. Chen, Z., Guo, S., Wang, J., Li, Y., Lu, Z.: Toward FPGA security in IoT: a new detection technique for hardware trojans. IEEE Internet Things J. 6(4), 7061–7068 (2019) 9. Da Costa, K.A., Papa, J.P., Lisboa, C.O., Munoz, R., de Albuquerque, V.H.C.: Internet of things: a survey on machine learning-based intrusion detection approaches. Comput. Netw. 151, 147–157 (2019) 10. Damodharan, J., Susai Michael, E.R., Shaikh-Husin, N.: High throughput present cipher hardware architecture for the medical IoT applications. Cryptography 7(1), 6 (2023) 11. Diro, A.A., Chilamkurti, N.: Distributed attack detection scheme using deep learning approach for internet of things. Futur. Gener. Comput. Syst. 82, 761–768 (2018) 12. Elsayed, R.A., Hamada, R.A., Abdalla, M.I., Elsaid, S.A.: Securing IoT and SDN systems using deep-learning based automatic intrusion detection. Ain Shams Eng. J., 102211 (2023) 13. Gomes, T., Sousa, P., Silva, M., Ekpanyapong, M., Pinto, S.: FAC-V: an FPGAbased AES coprocessor for RISC-V. J. Low Power Electron. Appl. 12(4), 50 (2022) 14. Hasan, M.: IoT in healthcare: 20 examples that’ll make you feel better, 2 April 2020. https://www.ubuntupit.com/iot-in-healthcare-20-examples-thatllmake-you-feel-better. Accessed 22 May 2023 15. Hassija, V., Chamola, V., Saxena, V., Jain, D., Goyal, P., Sikdar, B.: A survey on IoT security: application areas, security threats, and solution architectures. IEEE Access 7, 82721–82743 (2019) 16. Hossain, M.M., Fotouhi, M., Hasan, R.: Towards an analysis of security issues, challenges, and open problems in the internet of things. In: 2015 IEEE World Congress on Services, pp. 21–28. IEEE (2015)

FPGA/AI-Powered Security IoTs

13

17. Ioannou, L., Fahmy, S.A.: Network intrusion detection using neural networks on FPGA SoCs. In: 2019 29th International Conference on Field Programmable Logic and Applications (FPL), pp. 232–238. IEEE (2019) 18. Lanner: Examples of IoT devices in your next smart home, 10 September 2018. https://www.lanner-america.com/blog/5-examples-iotdevices-next-smarthome. Accessed 22 May 2023 19. Lin, J.L., Zheng, P.Y., Chao, P.C.P.: A new ECC implemented by FPGA with favorable combined performance of speed and area for lightweight IoT edge devices. Microsyst. Technol., 1–10 (2023) 20. Lin, W.C., Huang, P.K., Pan, C.L., Huang, Y.J.: FPGA implementation of mutual authentication protocol for medication security system. J. Low Power Electron. Appl. 11(4), 48 (2021) 21. Manimurugan, S., Al-Mutairi, S., Aborokbah, M.M., Chilamkurti, N., Ganesan, S., Patan, R.: Eﬀective attack detection in internet of medical things smart environment using a deep belief neural network. IEEE Access 8, 77396–77404 (2020) 22. Meenakshi, S., Nirmala Devi, M.: Conﬁguration security of FPGA in IoT using logic resource protection. In: Sengodan, T., Murugappan, M., Misra, S. (eds.) Advances in Electrical and Computer Technologies: Select Proceedings of ICAECT 2021, pp. 625–633. Springer, Singapore (2022). https://doi.org/10.1007/978-98119-1111-8 47 23. Mothukuri, V., Khare, P., Parizi, R.M., Pouriyeh, S., Dehghantanha, A., Srivastava, G.: Federated-learning-based anomaly detection for IoT security attacks. IEEE Internet Things J. 9(4), 2545–2554 (2021) 24. Najmi, K.Y., AlZain, M.A., Masud, M., Jhanjhi, N., Al-Amri, J., Baz, M.: A survey on security threats and countermeasures in IoT to achieve users conﬁdentiality and reliability. Mater. Today Proc. (2021) 25. Ngo, D.M., et al.: HH-NIDS: heterogeneous hardware-based network intrusion detection framework for IoT security. Future Internet 15(1), 9 (2023) 26. Ngo, D.M., Temko, A., Murphy, C.C., Popovici, E.: FPGA hardware acceleration framework for anomaly-based intrusion detection system in IoT. In: 2021 31st International Conference on Field-Programmable Logic and Applications (FPL), pp. 69–75. IEEE (2021) 27. Nguyen, T.D., Marchal, S., Miettinen, M., Fereidooni, H., Asokan, N., Sadeghi, A.R.: D¨ıot: a federated self-learning anomaly detection system for IoT. In: 2019 IEEE 39th International Conference on Distributed Computing Systems (ICDCS), pp. 756–767. IEEE (2019) 28. Parikibandla, S., Sreenivas, A.: FPGA performance evaluation of present cipher using LCC key generation for IoT sensor nodes. In: Chowdary, P.S.R., Chakravarthy, V.V.S.S.S., Anguera, J., Satapathy, S.C., Bhateja, V. (eds.) Microelectronics, Electromagnetics and Telecommunications. LNEE, vol. 655, pp. 371– 379. Springer, Singapore (2021). https://doi.org/10.1007/978-981-15-3828-5 39 29. Protogerou, A., Papadopoulos, S., Drosou, A., Tzovaras, D., Refanidis, I.: A graph neural network method for distributed anomaly detection in IoT. Evol. Syst. 12, 19–36 (2021) 30. Rajput, G.S., Thakur, R., Tiwari, R.: VLSI implementation of lightweight cryptography technique for FPGA-IoT application. Mater. Today Proc. (2023) 31. Samir, N., et al.: ASIC and FPGA comparative study for IoT lightweight hardware security algorithms. J. Circuits Syst. Comput. 28(12), 1930009 (2019) 32. Sekar, S.R., Elango, S., Philip, S.P., Raj, A.D.: FPGA implementation of ECC enabled multi-factor authentication (E-MFA) protocol for IoT based applications.

14

33. 34.

35. 36. 37.

38. 39. 40.

41.

C. Pham-Quoc In: Arunachalam, V., Sivasankaran, K. (eds.) ICMDCS 2021. CCIS, vol. 1392, pp. 430–442. Springer, Singapore (2021). https://doi.org/10.1007/978-981-16-50482 34 Sethi, P., Sarangi, S.R.: Internet of things: architectures, protocols, and applications. J. Electric. Comput. Eng. 2017 (2017) Shaﬁq, M., Tian, Z., Sun, Y., Du, X., Guizani, M.: Selection of eﬀective machine learning algorithm and Bot-IoT attacks traﬃc identiﬁcation for internet of things in smart city. Futur. Gener. Comput. Syst. 107, 433–442 (2020) Siva Balan, N., Murugan, B.: Low area FPGA implementation of AES architecture with EPRNG for IoT application. J. Electron. Test. 38(2), 181–193 (2022) Soliman, S., et al.: FPGA implementation of dynamically reconﬁgurable IoT security module using algorithm hopping. Integration 68, 108–121 (2019) Statista Research Department: Internet of things - number of connected devices worldwide 2015–2025 (2016). https://www.statista.com/statistics/471264/ iot-number-of-connected-devices-worldwide/. Accessed 1 Apr 2023 Swessi, D., Idoudi, H.: A survey on internet-of-things security: threats and emerging countermeasures. Wireless Pers. Commun. 124(2), 1557–1592 (2022) Vaccari, I., Chiola, G., Aiello, M., Mongelli, M., Cambiaso, E.: MQTTset, a new dataset for machine learning techniques on MQTT. Sensors 20(22), 6578 (2020) Williams, P., Dutta, I.K., Daoud, H., Bayoumi, M.: A survey on security in internet of things with a focus on the impact of emerging technologies. Internet Things 19, 100564 (2022) Yin, C., Zhang, S., Wang, J., Xiong, N.N.: Anomaly detection based on convolutional recurrent autoencoder for IoT time series. IEEE Trans. Syst. Man Cybern. Syst. 52(1), 112–122 (2020)

A Review in Deep Learning-Based Thyroid Cancer Detection Techniques Using Ultrasound Images Le Chieu Long1 , Y. Bui Hoang1 , Nguyen Luong Trung1 , Bui Tuan Dung1 , Thi-Thao Ha2 , and Luong Vuong Nguyen2(B) 1

FPT University, Danang, Vietnam {longlcde160374,ybhde160208,trungnlde170311,dungbtde160632}@fpt.edu.vn 2 Department of Artiﬁcial Intelligence, FPT University, Danang, Vietnam {thaoht32,vuongnl3}@fe.edu.vn

Abstract. Early detection of thyroid cancer nodules will lead to the most speciﬁc and eﬀective treatments, signiﬁcantly reducing morbidity and mortality. The application of ultrasound imaging according to the traditional method has been widely used in the early detection of thyroid nodules. However, applying traditional methods is time-consuming, costly, and even ineﬀective because of the direct intervention of machines in the human body. Therefore, the development and application of deep learning methods in the diagnostic process are of great signiﬁcance. Deep learning methods have improved the quality of the diagnostic process most objectively. This review evaluated many aspects of deep learning methods, including CNNs, GANs, and ThyNet. The results show that the preferred method has 94% accuracy, 93% sensitivity, and up to 95% speciﬁcity. It is the CascadeMaskR-CNN method. In addition, the methods mentioned have relatively good metrics and high learning properties. Keywords: deep learning · thyroid cancer detection · image processing

1

Introduction

The thyroid is known to be the body’s largest endocrine gland. It is located in the anterior neck region, in front of the trachea, consisting of the left and right lobes connected by an isthmus, forming a butterﬂy shape [1]. The thyroid plays a vital role in producing hormones [11]. They help regulate the functioning of the body’s cells and tissues. Abnormal growth of thyroid cells is one of the hallmarks of thyroid cancer [23]. It forms malignant tumors in the thyroid region. This type of cancer is more common in women than in men [36]. The most recent estimate for thyroid cancer cases and deaths reported by the American Cancer Society in 2022 is approximately 43,800 new cases and nearly 20 times the number of deaths [34]. c The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 N.-N. Dao et al. (Eds.): ICIT 2023, LNDECT 187, pp. 15–25, 2023. https://doi.org/10.1007/978-3-031-46573-4_2

16

L. C. Long et al.

In recent years, treating malignant thyroid nodules detected early before the cancer cells of the thyroid have spread can make treatment more eﬀective and less harmful [24]. There are two primary methods used to detect malignant thyroid nodules: (1) palpation of the neck during the physical examination and (2) ultrasonography, which can detect palpable and non-palpable nodules, especially those less than 1cm in diameter [17]. Of these, ultrasound is the primary method for detecting the characteristics of thyroid nodules, with a standardized procedure for detecting thyroid nodules requiring medical imaging, such as angiography. Computerized tomography (CT), magnetic resonance imaging (MRI), radioiodine scintigraphy, and positron emission tomography (PET) are widely used and universal diagnostic tools [21]. However, these tools cannot provide adequate diﬀerentiation between thyroid nodules. Thus ﬁne-needle aspiration (FNA) is performed [8,35]. It is safe, low cost, minimally invasive, and has high diagnostic accuracy nodular form [4]. However, the accuracy rate of the diagnosis of FNA is highly dependent on the experience and expertise of the physician and the disease study. Since then, about 30% of FNA results have been undetermined or misdiagnosed due to inexperienced doctors and pathologists [25]. Therefore, thyroid cancer diagnosis using deep learning has emerged. And speciﬁcally, the diagnosis of thyroid nodules through deep learning-based image recognition analysis to distinguish benign and malignant thyroid nodules has reduced the burden on physicians and the pathologist and avoided unnecessary FAN implementation. In the past decade, to improve diagnosis rates and reduce losses due to misdiagnosis, the diagnosis of human thyroid cancer has resorted to computer-aided (CAD) [16,33]. The limitations of using CAD by doctors and professionals have been overcome and improved by the development of machine learning and artiﬁcial intelligence. The CAD system uses AI-based deep learning and machine learning techniques to more accurately and intelligently diagnose thyroid nodules automatically[10]. To develop the CAD system, experts used nodular thyroid imaging to extract many direct-dependent features, such as histogram parameters, fractal size, and mean luminance values in diﬀerent grayscale bands [29]. Many systems based on deep learning and machine learning algorithms have been studied for automatic thyroid detection in recent years, such as support vector machines (SVMs) [29] and deep convolutional neural networks [12]. In this article, we aim to discuss the applications of deep learning techniques and their applications in the early diagnosis of thyroid cancer. Here, we will not go too deeply into algorithms as well as the implementation of algorithms. Still, instead, we will go deeper into the algorithm’s deﬁnitions, impacts, and advantages. Thereby, we will make the most objective comparisons through the statistical table of the essential parameters of the methods. In addition, we will also highlight the dark sides, challenges, and diﬃculties in applying these methods in practice. As a result, we will summarize the tasks that need to be researched and implemented to improve the research’s quality and eﬀectiveness.

A Review in DL-Based Thyroid Cancer Detection

2 2.1

17

Deep Learning-Based Thyroid Cancer Detection Using Ultrasound Image Convolutional Neural Networks - CascadeMaskR-CNN

Convolutional neural networks (CNNs) are state-of-the-art in computer vision and excel at many tasks on par with or beyond human performance [14]. It has been shown that CNNs, in particular, oﬀer a lot of potential for solving ﬁnegrained categorization issues [28,32]. Neurons in CNN are self-optimizing, similar to those in conventional artiﬁcial neural networks. The building blocks of many artiﬁcial neural networks are neurons, each receiving data and carrying out an operation. CNNs are commonly used in image recognition, medical image analysis, image segmentation, and many other uses because of their unique design that handles many 2D images shapes [3,13,40]. They can automatically identify essential components from the input without the involvement of a human, making it more eﬀective than a typical network. The CNN layer will usually consist of 3 layers: convolution, synthesis, and full connectivity. Each layer of CNNs conducts diﬀerent tasks and input data. The input image of the CNN is a matrix of pixel values, each representing a pixel in the image. The brightness and color of each pixel are speciﬁed by its pixel value [3,13,40]. For example, a photo might be 300 × 300 pixels with millions of pixel values. To solve this problem, CNN uses ﬁltering techniques to extract the features of the input image. The convolution layer plays an essential role in the overall structure of the CNN. This layer covers a collection of ﬁlters or kernels to be applied to the input data before use. The product layer has two hyperparameters, kernel size, and ﬁlter [37]. This class splits the input image into a series of local patches of a predeﬁned size that have the same size in all ﬁlters (deﬁned as determined by the kernel size) used by the current convolution layer [15]. Where ﬁlters of the existing convolutional layers have the same channels as the current inputs [15], the ﬁlter and patch elements are multiplied in pairs and added to produce a single value. This value is then subject to the network’s activation function. The network extracts essential features from the original image. Therefore, the convolution layer generates several feature maps imported into the convolution or composite layer later. Aggregation is used to reduce the size of the geographical map but still keep essential data. The ﬁlter applies the pooling by skimming it in the pooling layer (max, min, average) [37]. The most frequently used aggregation is the complete pool. This is used to reduce the complexity of the upper layer. Image processing can reduce the resolution. But the number of ﬁlters is not interfered with and is aﬀected by pooling [37]. It summarizes the pixel values to a maximum value. This process reduces model complexity and prevents overﬁt when data information is lost. Various methods have been used in ultrasound imaging to classify and detect thyroid cancer. Artiﬁcial neural networks (ANN) and convolutional neural networks (CNN) are the most popular and eﬀective deep-learning models for categorizing thyroid nodules. ANN has shown an accuracy rate of approximately 82%

18

L. C. Long et al.

in diﬀerentiating between benign and malignant thyroid nodules [42,44]. However, recent studies have demonstrated higher accuracy with CNN compared to previous research [2]. SVM and CNN achieved an accuracy of 92.5%, a sensitivity of 96.4%, and a speciﬁcity of 83.1% [2]. Another variant, CascadeMaskRCNN, has been successfully applied to diagnose benign tumors from melanoma with 94% accuracy, 93% sensitivity, and 95% speciﬁcity [20]. CascadeMaskRCNN, based on the original Mask R-CNN model, improves diagnostic accuracy by 3.2% and demonstrates superior performance in diagnosing thyroid nodules compared to the original model [20]. The cascade approach focuses on conﬁdent detections in the initial stage and progressively reﬁnes the results, leading to better performance, particularly for small objects and low-contrast instances. CascadeMaskR-CNN has achieved state-of-the-art performance in object detection tasks. 2.2

VGG16, VGG19, and Inception v3

The convolutional neural network (CNN) proposed to use the VGG-16 model as a backbone to process ultrasound images [34]. VGG-16 is a deep convolutional neural network consisting of 16 layers combined by multiple 3 × 3 and 2 × 2 repeated pooling layers, and VGG-16 can extract notable features to achieve better eﬀects in image classiﬁcation. Each nodule image is ﬁrst converted into a hierarchical tile-based data structure and processed to access the proposed CNN nodule segmentation results. The proposed method is compared to the U-Net model [7] using the Thyroid Digital Image Database (TDID) [26]. TDID consists of 400 ultrasound images from 298 patients, each with a size of 560 × 360 pixels and accompanying diagnostic descriptions. The images were scored using the TI-RADS system [38], indicating the risk of malignant nodules. VGG16 outperforms U-Net in quantity and accuracy, achieving 99% overall accuracy and 98% sensitivity. U-Net achieves 96% overall accuracy and 95.2% sensitivity [18]. The Inception-v3 network was pre-trained on the ImageNet database and ﬁne-tuned for thyroid nodules analysis. The network structure of Inception-v3 consists of 3 types of Inception modules. All Inception modules have several small convolutional and pooling layers [9]. Some studies have attempted to develop deep learning networks to diﬀerentiate thyroid nodules by US images; however, obtaining many images from a base is diﬃcult. Fudan University Cancer Center is a university-aﬃliated hospital that treats thousands of thyroid cancer patients annually. Therefore the hospital has provided an extensive dataset, including 2,836 images from 2,235 patients. Most importantly, all PTC nodules were surgically removed and conﬁrmed by pathology. Based on this dataset, this study provided strong evidence that Inception-v3 has similar accuracy to experienced X-ray doctors in diﬀerentiating PTC from benign nodules, conﬁrming the potential of Inception-v3 to provide a second opinion, especially when X-ray doctors are inexperienced [9]. In the experiment with 399 images, the sensitivity and speciﬁcity of Inception-v3 were 93.3% (195/209) and 87.4% (166/190), respectively. To compare, the sensitivity and speciﬁcity of X-ray doctors were 84.7% (177/209) and 97.9% (186/190), respectively. Although overall sensitivity and

A Review in DL-Based Thyroid Cancer Detection

19

speciﬁcity are close together, Inception-v3 is more accurate in diagnosing PTC but less accurate in diagnosing benign nodules than experienced X-ray doctors, indicating that the features of PTC can be easily described by Inception-v3 [9]. The ﬁnal convolutional neural network is VGG-19, a transfer learning model based on ImageNet with a complete architecture of the model mentioned in more detail than VGG-16, with 16 convolutional layers and three fully connected layers. Here we present the practical events and parameters veriﬁed for the VGG-19 model, Inception-v3, for clinical practitioners. First, the ImageNet project is a tool to promote computer vision and deep learning research. ImageNet provides image databases based on the WordNet classiﬁcation system, and data is provided free of charge to researchers for non-commercial applications [31]. The database has been manually annotated with over 14 million images. From 2010 to 2017, ImageNet organized an annual competition (abbreviated as ILSVRC) to evaluate algorithms used in object detection and image classiﬁcation. The CNNs used in this study for transfer learning achieved the highest classiﬁcation accuracy in 2014 (Inception) and the highest detection results in 2014 (VGG). Since 2015, deep learning image classiﬁcation accuracy has exceeded 95%, far beyond human capabilities. In recent years, new CNNs (e.g., SEnet) have achieved even higher accuracy; however, the diﬀerence is insigniﬁcant. Most previous thyroid cancer imaging studies using Inception, ResNet, and VGG have achieved acceptable accuracy and are considered suitable for transfer learning in current research. We found that less complex CNNs (Inception) were slightly faster than VGG in training and classiﬁcation; However, the overall classiﬁcation accuracy is almost identical. Furthermore, patients with benign thyroid nodules conﬁrmed by surgery between January 2016 and July 2020 were also enrolled [5]. The classiﬁcation performance of the retrained CNNs was much better than that of participating doctors, especially in malignant groups. The poor diagnostic performance of doctors in dealing with malignant tumors leads to poor sensitivity. In clinical practice, endocrinologists or radiologists usually consider malignant features proposed by the entire image, not just the low-resolution area cut around the tumor. However, with the help of ﬁne-needle aspiration (FAN) and cytology analysis, the sensitivity of doctors can be compared to CNNs trained only by ultrasound images. The signiﬁcantly low accuracy in classifying malignant tumors by doctors also indicates diﬃculties in clinical diagnosis, with CNN and doctor’s diagnostic performance as follows: InceptionV3 (76.5%), VGG19 (76.1%), Endocrinologist 1 (58.8%), and Endocrinologist 2 (62%). Sensitivity is as follows: InceptionV3 (83.7%), VGG19 (66.2%), Endocrinologist 1 (38.7%), and Endocrinologist 2 (35.3%). The accuracy and sensitivity will give us the superior diagnostic performance of CNNs over doctors [5]. 2.3

ThyNet

In recent years, many AI deep learning models have been born to improve the diagnosis and treatment of thyroid cancer, including ThyNet. ThyNet was taken to distinguish the diﬀerence between malignant and benign tumors, helping to

20

L. C. Long et al.

diagnose thyroid cancer more eﬀectively. In addition, applying ThyNet to the reading of ultrasound images eliminates the need for doctors and pathologists to use traditional methods such as aspiration. Several large-scale studies have shown that ThyNet partly improves radiologists’ performance after reviewing ultrasound images. An experiment by Peng et al. on using the AI-ThyNet model in distinguishing malignant and benign tumors was conducted. The investigation aims to answer whether, with the help of ThyNet, doctors can improve the diagnosis of thyroid cancer by reading ultrasound images and videos. As expected, ThyNet has done an outstanding job as a powerful tool in assisting physicians in diagnosis with minimal use of aspiration devices. Nearly 62% of Fine needle aspiration cytology - FNAC use was reduced to almost half of FNAC use in this study with the help of ThyNet [27]. 2.4

Generative Adversarial Networks (GANs)

Generative Adversarial Networks (GANs) [22] have had massive success since they were introduced in 2014 by Ian J. Goodfellow and co-authors in the article Generative Adversarial Nets. They have gained much attention among artiﬁcial intelligence researchers [30]. The idea of two-player zero-sum games inspires GANs. Generative Adversarial Networks (GANs) are used in many applications, including image synthesis, semantic image editing, style transfer, image superresolution, and classiﬁcation [6]. The high expense of collecting many real medical images has led to the need to use Generative Adversarial Networks (GANs) to synthesize medical images for medical image model identiﬁcation [46]. To synthesize medical images, deep convolutional GAN (DCGAN), Wasserstein GAN (WGAN), and boundary equilibrium GAN (BEGAN) have been deployed and compared [43]. Moreover, to capture high semantic information content feature representations of medical images, GAN models have applied convolutional neural networks (CNN) [48]. After the synthesized images are created, the GAN model uses a discriminative network to detect synthesized images from real images [41]. Then, the synthesized images are created using a mapping network generated from random noise [43].

3

Discussion

Table 1 shows that SVM+CNN, CascadeMaskR-CNN, Inception-v3, VGG-19, VGG-16, ThyNet, and GAN models were all used to analyze ultrasound images for thyroid cancer diagnosis. However, the models have diﬀerent performance parameters, and based on these parameters, we will come up with the model that is believed to be the best for thyroid cancer diagnosis. Models SVM+CNN, CascadeMaskR-CNN, Inception-v3, VGG-19, and ThyNet all achieved high accuracy, with CascadeMaskR-CNN achieving the highest accuracy with 94%, considered the preferred model. This is because the model has the best sensitivity and speciﬁcity, with a sensitivity and speciﬁcity of 93% and 95%, respectively. These results show that the SVM+CNN,

A Review in DL-Based Thyroid Cancer Detection

21

Table 1. The experimental results of existing recent models that use ultrasound images datasets to detect Thyroid cancer Method

Sensitivity Speciﬁcity Accuracy

SVM+CNN [2]

92,5%

CascadeMaskR-CNN [20] 93,0% VGG-16 [39]

63,1%

Inception-v3 [9]

93,3%

VGG-19 [5]

66,2%

Inception-v3 [5]

83,7%

ThyNet [19]

94,0%

GAN [41]

95,0%

94,0% 74,7%

87,4%

95,0% 76,1% 76,5%

81,0%

89,0% 98,8%

CascadeMaskR-CNN, Inception-v3, and ThyNet models can diagnose thyroid cancer well. However, CascadeMaskR-CNN is the preferred model due to its highest accuracy and best sensitivity and speciﬁcity parameters. With increased diagnostic accuracy and eﬃciency, patients can be relieved from the mental and ﬁnancial pressures of current clinical diagnostic procedures, while clinicians can make more eﬀective decisions about thyroidectomy. The use of ultrasound images for early diagnosis of thyroid cancer has become one of the primary and modern technologies in pioneering the use of deep learning to improve the accuracy of the diagnosis process. These deep learning methods are primarily safe because they do not use body-invasive methods and tools like traditional methods used before. Traditional methods, such as vectors or matrices, may be suboptimal because they cannot accurately capture disease development patterns. After all, it depends heavily on the synthesis over time. Moreover, applying deep learning techniques to the diagnostic process also helps reduce time and cost-eﬀectively. Thanks to the help of computers and modern ultrasound technologies, image acquisition is made more quickly and economically. Although there are many advantages of applying deep learning methods in the early diagnosis of thyroid cancer, after the research process, we have summarized many challenges and diﬃculties that need to be overcome in the future. – Data volume: With the need to have the most accurate, up-to-date, and complete deep learning model, the prerequisite is that we have to collect a large enough, even huge, amount of data. However, this factor is challenging to achieve because, regarding health-related issues, access to health varies from place to place, and most people do not have access to standard medical services. – Data quality: Unlike other ﬁelds, such as education and technology, medicine is one of the most challenging ﬁelds to get reliable, highly accurate, and up-to-date data sources and fully updated over time. The data obtained is somewhat ambiguous, ﬂawed, and sometimes inconsistent.

22

L. C. Long et al.

– Training problem: Due to the reason that the volume of data collected from the medical ﬁeld is not large enough, although the application of deep learning to diagnosis is very potential and promising, the problem of training is still very limited. – Temporary: With the existing deep learning methods, we have not yet achieved the timeliness of the research data, thus making it challenging to ﬁnd appropriate and modern disease treatments. – Interpretability: In healthcare, it’s not just the performance of algorithms that matters but also why and how algorithms work. However, to achieve high conﬁdence, interpreting the algorithm implementation process is very important in convincing experts and doctors to trust the results of the early prediction of thyroid cancer. Because the challenges and diﬃculties in applying deep learning to the diagnosis of thyroid cancer have been clearly outlined above, we hope that in the future, it will be possible to overcome these shortcomings and, to some extent, minimize the challenges mentioned above in terms of the following aspects. – Expanding the data set: Having a good and large enough data set will help promote and improve the smooth running of the research process and the accuracy of research methods. Therefore, to achieve a large data set, health services must be universalized and accessible more widely in the future. – Improve data accuracy: In addition to the need to expand the data set, the accuracy of the data set also needs to be paid attention to. Therefore, it is necessary to improve the quality of the data (e.g., sensitivity parameters, etc.) to be collected. – Improved interpretability of algorithms used in deep learning: The performance of research methods is extremely important, but implementation steps and clariﬁcation of steps leading to algorithmic results are also necessary. So, instead of just studying the algorithm’s performance, let’s explore the essential factors behind the algorithm. – Integration of specialized knowledge: Integrating specialized knowledge and skills into the research process is urgent to increase reliability and guide the research process in the right direction without deviating and wasting time and cost.

4

Conclusion

As a matter of urgency today, controlling cancer’s division and rapid spread into surrounding cells, speciﬁcally thyroid cancer, is necessary. In particular, the early detection of cancer nodules will help to manage the disease better and reduce the number of deaths. The development of deep learning-based models has contributed to the relative improvement in diagnosing thyroid cancer nodules by physicians and medical professionals. Parameters of accuracy, sensitivity, and speciﬁcity have demonstrated that thyroid cancer diagnosis beneﬁts signiﬁcantly from these methods. Through the research and comparison of the collected data

A Review in DL-Based Thyroid Cancer Detection

23

sets about the methods, we have concluded that CNN is temporarily considered a method with higher accuracy and sensitivity than other methods. However, this does not prove that diﬀerent approaches are somewhat weaker. Therefore, our primary goal is to expand the data set large enough to lay the foundation for making objective, accurate, complete conclusions and act as a metric to evaluate methods that hope to optimize the diagnostic process of thyroid cancer.

References 1. Abbad Ur Rehman, H., Lin, C.Y., Mushtaq, Z., Su, S.F.: Performance analysis of machine learning algorithms for thyroid disease. Arab. J. Sci. Eng., 1–13 (2021) 2. Anari, S., Tataei Sarshar, N., Mahjoori, N., Dorosti, S., Rezaie, A.: Review of deep learning approaches for thyroid cancer diagnosis. Mathe. Probl. Eng. 2022 (2022) 3. Baˇcanin Dˇzakula, N., et al.: Convolutional neural network layers and architectures. In: Sinteza 2019-International Scientiﬁc Conference on Information Technology and Data Related Research, pp. 445–451. Singidunum University (2019) 4. Baloch, Z.W., Fleisher, S., LiVolsi, V.A., Gupta, P.K.: Diagnosis of “follicular neoplasm”: a gray zone in thyroid ﬁne-needle aspiration cytology. Diagn. Cytopathol. 26(1), 41–44 (2002) 5. Chan, W.K., et al.: Using deep convolutional neural networks for enhanced ultrasonographic image diagnosis of diﬀerentiated thyroid cancer. Biomedicines 9(12), 1771 (2021) 6. Creswell, A., White, T., Dumoulin, V., Arulkumaran, K., Sengupta, B., Bharath, A.A.: Generative adversarial networks: an overview. IEEE Signal Process. Mag. 35(1), 53–65 (2018) 7. Falk, T., et al.: U-Net: deep learning for cell counting, detection, and morphometry. Nat. Methods 16(1), 67–70 (2019) 8. Friedrich-Rust, M., et al.: Interobserver agreement of thyroid imaging reporting and data system (TIRADS) and strain elastography for the assessment of thyroid nodules. PLoS ONE 8(10), e77927 (2013) 9. Guan, Q., et al.: Deep learning based classiﬁcation of ultrasound images for thyroid nodules: a large scale of pilot study. Ann. Transl. Med. 7(7) (2019) 10. Jin, Z., et al.: Ultrasound computer-aided diagnosis (CAD) based on the thyroid imaging reporting and data system (TI-RADS) to distinguish benign from malignant thyroid nodules and the diagnostic performance of radiologists with diﬀerent diagnostic experience. Med. Sci. Monit. Int. Med. J. Exp. Clin. Res. 26, e918452-1 (2020) 11. Kenigsberg, J.: Thyroid cancer associated with the Chernobyl accident. In: Encyclopedia of Environmental Health, pp. 55–64 (2011) 12. Ko, S.Y., et al.: Deep convolutional neural network for the diagnosis of thyroid nodules on ultrasound. Head Neck 41(4), 885–891 (2019) 13. Koushik, J.: Understanding convolutional neural networks. arXiv preprint arXiv:1605.09081 (2016) 14. LeCun, Y., Bengio, Y., Hinton, G.: Deep learning. Nature 521(7553), 436–444 (2015) 15. Lee, H., Song, J.: Introduction to convolutional neural network using Keras; an understanding from a statistician. Commun. Stat. Appl. Methods 26(6), 591–610 (2019)

24

L. C. Long et al.

16. Li, X., et al.: Diagnosis of thyroid cancer using deep convolutional neural network models applied to sonographic images: a retrospective, multicohort, diagnostic study. Lancet Oncol. 20(2), 193–201 (2019) 17. Lin, J.S., Bowles, E.J.A., Williams, S.B., Morrison, C.C.: Screening for thyroid cancer: updated evidence report and systematic review for the us preventive services task force. JAMA 317(18), 1888–1903 (2017) 18. Lin, Y.J., et al.: Deep learning fast screening approach on cytological whole slides for thyroid cancer diagnosis. Cancers 13(15), 3891 (2021) 19. Liu, Y., Liang, J., Peng, S., Wang, W., Xiao, H.: A deep-learning model to assist thyroid nodule diagnosis and management-authors’ reply. Lancet Digit. Health 3(7), e411–e412 (2021) 20. Lu, Y., Yang, Y., Chen, W.: Application of deep learning in the prediction of benign and malignant thyroid nodules on ultrasound images. IEEE Access 8, 221468– 221480 (2020) 21. Meier, C.A.: Role of imaging in thyroid disease. In: Diseases of the Brain, Head & Neck, Spine: Diagnostic Imaging and Interventional Techniques 40th International Diagnostic Course in Davos (IDKD) Davos, 30 March–4 April 2008, pp. 243–250 (2008) 22. Nguyen, L.V., Vo, N.D., Jung, J.J.: DaGzang: a synthetic data generator for crossdomain recommendation services. PeerJ Comput. Sci. 9, e1360 (2023) 23. Noone, A.M., et al.: Cancer incidence and survival trends by subtype using data from the surveillance epidemiology and end results program, 1992–2013cancer incidence and survival trends by subtype, 1992–2013. Cancer Epidemiol. Biomark. Prev. 26(4), 632–641 (2017) 24. Olson, E., Wintheiser, G., Wolfe, K.M., Droessler, J., Silberstein, P.T.: Epidemiology of thyroid cancer: a review of the national cancer database, 2000–2013. Cureus 11(2) (2019) 25. Ouyang, F.S., et al.: Comparison between linear and nonlinear machine-learning algorithms for the classiﬁcation of thyroid nodules. Eur. J. Radiol. 113, 251–257 (2019) 26. Pedraza, L., Vargas, C., Narv´ aez, F., Dur´ an, O., Mu˜ noz, E., Romero, E.: An open access thyroid ultrasound image database. In: 10th International Symposium on Medical Information Processing and Analysis, vol. 9287, pp. 188–193. SPIE (2015) 27. Peng, S., et al.: Deep learning-based artiﬁcial intelligence model to assist thyroid nodule diagnosis and management: a multicentre diagnostic study. Lancet Digit. Health 3(4), e250–e259 (2021) 28. Polap, D.: Analysis of skin marks through the use of intelligent things. IEEE Access 7, 149355–149363 (2019) 29. Prochazka, A., Gulati, S., Holinka, S., Smutek, D.: Classiﬁcation of thyroid nodules in ultrasound images using direction-independent features extracted by two-threshold binary decomposition. Technol. Cancer Res. Treat. 18, 1533033819830748 (2019) 30. Rocca, J.: Understanding generative adversarial networks (GANs). Medium 7, 20 (2019) 31. Russakovsky, O., et al.: Imagenet large scale visual recognition challenge. Int. J. Comput. Vision 115, 211–252 (2015) 32. Seeland, M., Rzanny, M., Boho, D., W¨ aldchen, J., M¨ ader, P.: Image-based classiﬁcation of plant genus and family for trained and untrained plant species. BMC Bioinformatics 20(1), 1–13 (2019)

A Review in DL-Based Thyroid Cancer Detection

25

33. Shin, J.H., et al.: Ultrasonography diagnosis and imaging-based management of thyroid nodules: revised Korean society of thyroid radiology consensus statement and recommendations. Korean J. Radiol. 17(3), 370–395 (2016) 34. Siegel, R.L., Miller, K.D., Fuchs, H.E., Jemal, A.: Cancer statistics, 2022. CA Cancer J. Clin. 72(1), 7–33 (2022) 35. Stewart, R., Leang, Y.J., Bhatt, C.R., Grodski, S., Serpell, J., Lee, J.C.: Quantifying the diﬀerences in surgical management of patients with deﬁnitive and indeterminate thyroid nodule cytology. Eur. J. Surg. Oncol. 46(2), 252–257 (2020) 36. Suteau, V., Munier, M., Briet, C., Rodien, P.: Sex bias in diﬀerentiated thyroid cancer. Int. J. Mol. Sci. 22(23), 12992 (2021) 37. Taye, M.M.: Theoretical understanding of convolutional neural network: concepts, architectures, applications, future directions. Computation 11(3), 52 (2023) 38. Tessler, F.N., Middleton, W.D., Grant, E.G.: Thyroid imaging reporting and data system (TI-RADS): a user’s guide. Radiology 287(1), 29–36 (2018) 39. Wang, Y., et al.: Comparison study of radiomics and deep learning-based methods for thyroid nodules classiﬁcation using ultrasound images. IEEE Access 8, 52010– 52017 (2020) 40. Wu, J.: Introduction to convolutional neural networks. National Key Lab for Novel Software Technology. Nanjing University. China 5(23), 495 (2017) 41. Yang, W., et al.: DScGANS: integrate domain knowledge in training dual-path semi-supervised conditional generative adversarial networks and S3VM for ultrasonography thyroid nodules classiﬁcation. In: Shen, D., et al. (eds.) MICCAI 2019. LNCS, vol. 11767, pp. 558–566. Springer, Cham (2019). https://doi.org/10.1007/ 978-3-030-32251-9 61 42. Zhang, G., Berardi, V.L.: An investigation of neural networks in thyroid function diagnosis. Health Care Manag. Sci. 1, 29–37 (1998) 43. Zhang, Q., Wang, H., Lu, H., Won, D., Yoon, S.W.: Medical image synthesis with generative adversarial networks for tissue recognition. In: 2018 IEEE International Conference on Healthcare Informatics (ICHI), pp. 199–207. IEEE (2018) 44. Zhu, L.C., et al.: A model to discriminate malignant from benign thyroid nodules using artiﬁcial neural network. PLoS ONE 8(12), e82211 (2013)

Bio-Inspired Clustering: An Ensemble Method for User-Based Collaborative Filtering Luong Vuong Nguyen1 , Tri-Hai Nguyen2 , Ho-Trong-Nguyen Pham1 , Quoc-Trinh Vo1 , Huu-Thanh Duong3 , and Tram-Anh Nguyen-Thi4(B) 1

Department of Artiﬁcial Intelligence, FPT University, Danang 550000, Vietnam {vuongnl3,nguyenpht,trinhvq}@fe.edu.vn 2 Department of Computer Science and Engineering, Seoul National University of Science and Technology, Seoul 01811, Republic of Korea [email protected] 3 Faculty of Information Technology, Ho Chi Minh City Open University, Ho Chi Minh City 700000, Vietnam [email protected] 4 Faculty of Fundamental Studies, Ho Chi Minh City Open University, Ho Chi Minh City 700000, Vietnam [email protected] Abstract. Clustering techniques are used to group users to enhance the recommendations-generating process of collaborative ﬁltering systems. Collaborative ﬁltering-based approaches are commonly used for generating similarity-based recommendations. While traditional clustering methods are often used for user clustering, there is a need to explore bio-inspired clustering techniques to improve recommendation generation. This article proposes a bio-inspired clustering collaborative ﬁltering (BICCF) approach as an ensemble method for recommendation systems. The process utilizes swarm intelligence to improve the accuracy of recommendations for user-based collaborative ﬁltering. Real-world datasets from MovieLens are used to do experiments that evaluate the eﬀectiveness of the proposed method. Results show signiﬁcant improvements in accuracy and eﬃciency based on Recall, P recision, and M AE metrics compared to other baseline methods. Keywords: Recommendation System Swarm Intelligence · User Clustering

1

· Collaborative Filtering ·

Introduction

The recommendation systems are widely used as a decision-support technique to address the information overload issue that users suﬀer due to the rapid growth of internet technologies in recent years [9,11,13]. To generate similaritybased suggestions, collaborative ﬁltering-based algorithms are frequently used among the numerous recommendation systems [8,10,12]. Collaborative ﬁltering is famous for generating personalized recommendations in various domains, c The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 N.-N. Dao et al. (Eds.): ICIT 2023, LNDECT 187, pp. 26–35, 2023. https://doi.org/10.1007/978-3-031-46573-4_3

Bio-Inspired Clustering: An Ensemble Method for UBCF

27

including e-commerce, social media, and entertainment. Collaborative ﬁltering systems rely on user interactions with items, such as ratings, reviews, and purchases, to identify user similarities and recommend items based on similar user preferences [15]. User-based collaborative ﬁltering is a common technique that groups users with similar preferences into clusters to generate more accurate recommendations for each user. In this context, clustering algorithms have been widely used to improve the accuracy and eﬃciency of user-based collaborative ﬁltering. Clustering is grouping similar objects or entities based on a similarity measure. In user-based collaborative ﬁltering, clustering algorithms group users with similar preferences based on their interactions with items. By identifying groups of users with similar preferences, clustering can improve the accuracy of recommendations by generating personalized recommendations based on the preferences of users in the same cluster [5]. Despite the beneﬁts, clustering in userbased collaborative ﬁltering has several limitations. One limitation is the choice of appropriate clustering algorithms and the determination of optimal parameters. There are many clustering algorithms available, and the choice of the most appropriate algorithm depends on the data’s characteristics and the recommendation system’s speciﬁc requirements. Additionally, determining optimal parameters, such as the number of clusters, can be challenging and require trial and error. Another limitation of clustering in user-based collaborative ﬁltering is the problem of data sparsity. In many real-world recommendation systems, the number of interactions between users and items is limited, which can result in sparse data. Sparse data can make it diﬃcult to accurately group users with similar preferences, leading to inaccurate or biased cluster assignments [1]. This limitation can be particularly challenging for user-based collaborative ﬁltering, as recommendations’ accuracy depends on the cluster assignments’ accuracy. In this study, we propose a new clustering approach called Bio-Inspired Clustering Collaborative Filtering (BICCF) for recommendation services. Our proposed model consists of three functions: clustering users, predicting user preferences, and generating recommendations. Speciﬁcally, the proposed model combines the bio-inspired methods to cluster the users in the individual dataset and the statistical ensemble model to obtain the results. Then, BICCF determines the neighborhood for the active target user to embed an active user in the most similar cluster. Based on the current neighbors in the cluster detected, BICCF predict ratings to develop the top-N recommendations to target user. In this way, the BICCF model improves the quality of recommendations by generating more accurate and diverse suggestions. Finally, we evaluated the eﬀectiveness of the proposed method by comparing it with other state-of-the-art clustering methods, such as GAKM, K-Means, SOM, PCA-G, PCA-K, PCA-S, and UPCC in terms of accuracy and eﬃciency, using MovieLens datasets. The remainder of this manuscript is as follows. Section 2 overviews the related work. Section 3 describes a bio-inspired clustering model for user-based CF recommendation services. The experiment evaluation and discussion are detailed in Sect. 4. Finally, we conclude the work in Sect. 5.

28

2

L. V. Nguyen et al.

Related Work

User-based collaborative ﬁltering is a widely used technique in recommender systems, which recommends items to users based on the preferences of similar users [14]. One of the critical challenges in user-based collaborative ﬁltering is to improve the accuracy of user clustering, which is the process of grouping users into clusters based on their similarities. Several recent studies have proposed various methods to improve user clustering in user-based collaborative ﬁltering. This section reviews recent studies to improve user-based CF by applying bioinspired algorithms. Bio-inspired clustering is a promising technique in user-based collaborative ﬁltering that has gained more attention recently. The technique is based on the observation that natural systems often exhibit eﬃcient and eﬀective clustering mechanisms that can be applied to various clustering problems, including recommendation systems [20]. One of the critical beneﬁts of bio-inspired clustering is its ability to handle large and complex datasets by reducing the number of data points while preserving the underlying structure of the data. Several studies have proposed bio-inspired clustering algorithms in user-based collaborative ﬁltering, showing promising results in improving recommendation accuracy, scalability, and robustness. The studies on bio-inspired clustering in user-based collaborative ﬁltering were proposed by Liu et al. [7], who introduced a hybrid particle swarm optimization (PSO) and K-means clustering algorithm to improve recommendation accuracy. The algorithm used PSO to optimize the initial cluster centers and then applied K-means clustering to reﬁne the clustering results. The experimental results showed that the proposed algorithm outperformed traditional clustering algorithms regarding recommendation accuracy. Another bioinspired clustering algorithm proposed in user-based collaborative ﬁltering is the BAT algorithm (BA) introduced by Vellaichamy et al. [19]. The echolocation behavior of bats inspires the algorithm and uses the concept of frequency tuning to guide the clustering process. The experimental results showed that the BA algorithm outperformed traditional clustering algorithms regarding recommendation accuracy and convergence speed. In addition to PSO and BA, several other bio-inspired algorithms have been proposed in user-based collaborative ﬁltering, including ant colony optimization (ACO) [6], artiﬁcial bee colony (ABC) [4], and grey wolf optimization (GWO) [18]. These algorithms have shown promising results in improving recommendation accuracy and scalability. Furthermore, some studies have proposed bio-inspired clustering algorithms in combination with other techniques, such as feature selection and dimensionality reduction, to further improve recommendation accuracy and scalability. For example, Sadeghi et al. [17] introduced a hybrid algorithm combining PSO, K-means clustering, and principal component analysis (PCA). The algorithm used PSO to optimize the initial cluster centers, K-means clustering to reﬁne the clustering results, and PCA to reduce the dimensionality of the data. The experimental results showed that the proposed algorithm outperformed traditional clustering algorithms regarding recommendation accuracy, scalability, and robustness.

Bio-Inspired Clustering: An Ensemble Method for UBCF

3

29

Bio-Inspired Clustering Model for User-Based Collaborative Filtering (BICCF)

Algorithm 1. The Bio-Inspired Clustering Collaborative Filtering (BICCF) Input: X, k, M axi , Conthres , Crate , Mrate , c Output: Cluster assignments for each user Cen ← random selection of k users from X; Best(f ) ← 0; Best(Clu) ← None; for i in range(M axi ) do Clu ← assign-clusters(X, Cen); Cen ← update-centroids(X, Clu); f ← calculate-ﬁtness(X, Clu); if max(f ) > Best(f ) then Best(f ) ← max(f ); Best(Clu) ← Clu; end Cenparent ← k/2 best centroids sorted by ﬁtness; Cennew ← []; for i in range(k/2) do Cen1 , Cen2 ← random selection of 2 centroids from Cenparent ; Cen1new , Cen2new ← crossover(Cen1 , Cen2 , Crate ); Cen1new ← mutate(Cen1new , Mrate ); Cen2new ← mutate(Cen2new , Mrate ); Cennew .append(Cen1new ); Cennew .append(Cen2new ); end Cennew ← np.array(Cennew ); fnew ← calculate-ﬁtness(Cennew , Cluassign (X, Cennew )); if max(fnew ) > Best(f ) then Best(f ) ← max(fnew ); Best(Clu) ← Cluassign (X,Cennew [np.argmax(fnew )]); end if max(fnew ) - Best(f ) < Conthres then return Best(Clu); end end return Best(Clu);

This section introduces a new approach called Bio-Inspired Clustering Collaborative Filtering (BICCF) for recommendation systems. The BICCF technique applies bio-inspired methods to cluster the users in the dataset individually and then combines the results using a statistical ensemble model. Afterward, the BICCF recommendation system performs neighborhood searches for the active target user and places them in the most similar cluster. Using the current neighbors of the active user in the cluster, the BICCF model estimates ratings and

30

L. V. Nguyen et al.

generates a list of top-N recommendations for the active user. The BICCF model comprises three main stages: user clustering, preference prediction, and recommendation generation, which are explained in detail as follows (Table 1). Table 1. Notations Notation

Description

X

User-item rating matrix extract from datasets

k

Number of clusters

M axi

Maximum number of interactions

Conthres

Convergence threshold

Crate

Crossover rate

Mrate

Mutation rate

Cen

Centroid

Clu

Cluster

f

Fitness

Best(f )

Best ﬁtness

Best(Clu) Best clusters

– Clustering: This stage aims to produce user clusters based on similarity using the provided data. The process of user clustering in the BICCF involves K-PSO [16], FCM-PSO [2], and K-MWO [3]. The clustering is performed by grouping users to minimize the dissimilarity elements. The BICCF method has been conveniently designed to execute the clustering algorithms simultaneously and individually to create a similarity matrix. – Prediction: After generating user clusters, the next step is to predict suitable neighbors for the active target user by computing similarity. A similarity computation function is used to extract user correlation information to determine the neighborhood for the prediction phase. – Recommendation: Once the unknown ratings for the active target user have been predicted, the BICCF method generates a personalized list of top-N recommendations. Organizing the relevant items in the ﬁnal list of recommendations improves the BICCF model’s performance, enhancing user satisfaction. The proposed BICCF is depicted in Algorithm 1, which aims to cluster users with similar preferences based on their rating patterns on a set of items. The algorithm takes as input a user-item rating matrix X, which contains the ratings that each user has given to each item, as well as several hyperparameters, including the number of clusters k, maximum number of iterations M axi , convergence threshold Conthres , crossover rate Crate , and mutation rate Mrate . Next, randomly select k users from the matrix X to serve as the initial centroids for the k clusters. The algorithm then assigns each user to the nearest centroid Cen based on their rating pattern. The centroids Cen were updated for each cluster

Bio-Inspired Clustering: An Ensemble Method for UBCF

31

Clu by calculating the mean rating pattern for each Clu. In the next process, the algorithm calculates the ﬁtness f of each Clu based on how well it captures the rating patterns of the users assigned to it. Then, the k/2 best centroids Cen are selected based on their f . When the chosen centroids are paired randomly and undergo a crossover process to generate two new centroids, Cen1new and Cen2new , each containing genetic information from both parents. The new centroids Cennew are subjected to a mutation process with probability mutation rate Mrate , which introduces random changes to the genetic information. If the diﬀerence between the Best(f ) score in the new population and the current best solution is less than the convergence threshold Conthres , the algorithm terminates and returns the best clusters found so far.

4

Experiments and Results

4.1

Setting

We deployed the experiments on the Movielens 100k1 dataset, a widely used benchmark dataset for collaborative ﬁltering and recommendation systems. This dataset was collected by GroupLens and contained movie rating data from approximately 100,000 users on 1,700 movies. The proposed bio-inspired clustering method with the other methods such as GAKM, K-Means, SOM, PCA-G, PCA-K, PCA-S, UPCC [20]. All baselines in our experiments are as follows. – GAKM: a clustering algorithm that uses Gaussian Adaptive K-Means (GAKM) to partition data points into clusters based on their similarity. – K-Means: a popular clustering algorithm that partitions data points into k clusters based on their distance to the centroid of each cluster. – SOM: a clustering algorithm that uses Self-Organizing Maps (SOM) to generate initial clusters and applies hierarchical clustering to reﬁne the clusters. – PCA-G: a clustering algorithm that combines Principal Component Analysis (PCA) and Gaussian Adaptive K-Means (GAKM) to improve clustering performance. – PCA-K: a clustering algorithm that combines PCA and K-Means to reduce the dimensionality of the data before clustering. – PCA-S: a clustering algorithm that combines PCA and Self-Organizing Maps (SOM) to cluster high-dimensional data. – UPCC: a clustering algorithm that uses User Proﬁle Clustering to group similar users based on their behavior or preferences. 4.2

Evaluation

We use the P recision and Recall metrics to measure the proportion of actual predicted among all predicted results. In the context of clustering, these two metrics measure the completeness of the clustering algorithms in identifying all similar data points. 1

https://grouplens.org/datasets/movielens.

32

L. V. Nguyen et al. Table 2. P recision and Recall for clustering methods with various k-clusters 2 GAKM

3

4

5

6

7

8

9

P recision 0.341 0.322 0.312 0.341 0.335 0.331 0.322 0.325 Recall 0.111 0.181 0.252 0.291 0.352 0.452 0.521 0.632

K-Means P recision 0.111 0.141 0.122 0.111 0.121 0.112 0.115 0.112 Recall 0.041 0.051 0.062 0.072 0.081 0.085 0.091 0.095 SOM

P recision 0.349 0.341 0.325 0.331 0.312 0.325 0.321 0.312 Recall 0.101 0.151 0.202 0.252 0.301 0.402 0.502 0.601

PCA-G

P recision 0.401 0.391 0.382 0.379 0.371 0.362 0.359 0.355 Recall 0.111 0.131 0.201 0.302 0.401 0.502 0.601 0.671

PCA-K

P recision 0.181 0.162 0.155 0.152 0.142 0.135 0.131 0.122 Recall 0.061 0.071 0.081 0.141 0.151 0.171 0.181 0.201

PCA-S

P recision 0.382 0.392 0.382 0.361 0.351 0.348 0.331 0.326 Recall 0.111 0.132 0.251 0.302 0.323 0.402 0.451 0.651

UPCC

P recision 0.312 0.302 0.295 0.291 0.285 0.281 0.271 0.251 Recall 0.101 0.131 0.181 0.201 0.303 0.403 0.453 0.482

BICCF

P recision 0.422 0.411 0.401 0.411 0.408 0.407 0.405 0.409 Recall 0.201 0.252 0.299 0.351 0.392 0.442 0.611 0.641

Fig. 1. The evaluation results between the proposed BICCF with diﬀerent methods for diverse values of k-clusters in terms of P recision and Recall metrics

To calculate the P recision and Recall, the neighbor size was ﬁxed as 30, and the number of clusters is {2, 3, 4, 5, 6, 7, 8, 9}, respectively. The experimental results are shown in Table 2. In which a higher value shows that the clustering method has better results. Figure 1 presented the comparison between the proposed BICCF method with baselines in terms of P recision and Recall, respectively. As shown in these ﬁgures, the P recision gets the highest value of 0.422 when the number of clusters is 2, while the Recall has the highest value of 0.641 when the number of clusters is 9. This show that the P recision and Recall have

Bio-Inspired Clustering: An Ensemble Method for UBCF

33

a conﬂict. The P recision was decreased, and the Recall value increased when the number of clusters increased. Table 3. MAE for distinct clustering methods with various standards of k neighbors #neighbors GAKM K-Means SOM PCA-G PCA-K PCA-S UPCC BICCF 05

0.815

0.825

0.819 0.791

0.851

0.821

0.825

0.773

10

0.805

0.821

0.811 0.771

0.845

0.791

0.825

0.764

15

0.804

0.818

0.811 0.771

0.841

0.791

0.824

0.764

20

0.804

0.815

0.811 0.781

0.842

0.791

0.828

0.771

25

0.804

0.815

0.812 0.781

0.841

0.801

0.824

0.781

30

0.804

0.814

0.805 0.785

0.841

0.805

0.824

0.784

35

0.803

0.813

0.811 0.786

0.841

0.806

0.825

0.782

40

0.803

0.812

0.811 0.788

0.841

0.807

0.825

0.787

Fig. 2. The evaluation results between the proposed BICCF with diﬀerent methods for each number of neighbors (k) in terms of M AE metric.

Mean Absolute Error (MAE) is a metric used to evaluate the accuracy of clustering algorithms in predicting cluster labels. Hence, to evaluate the performance of the proposed BICCF, the MAE was calculated with diﬀerent numbers of neighbors. Table 3 shows the experimental results in calculated MAE of all methods with various numbers of neighbors in a set 5, 10, 15, 20, 25, 30, 25, 40. In this table, the lower MAE value indicates the method has more accuracy in generating recommendations. All comparison of the baseline methods with BICCF is presented in Fig. 2. The BICCF has the lowest MAE of 0.764 when the number of neighbors is 10 or 15. Besides, all baselines obtain an MAE value over 0.771. These experimental results demonstrate that our proposed method completely improves when compared with baselines.

34

5

L. V. Nguyen et al.

Conclusions

In this paper, we presented a model-based collaborative ﬁltering approach that inspired intelligent swarm algorithms. The proposed BICCF model executed clustering algorithms, FCM-PSO, K-PSO, and K-MWO, to create similarity matrices. These matrices are then combined to construct a consensus similarity matrix. Based on this consensus matrix, the predicted step in BICCF determines the neighborhood by extracting user correlation information to obtain the suitable cluster for the active user. Finally, the BICCF generates the predicted ratings for the items that rely on the neighborhood users’ ratings. We evaluate the performance of BICCF by run the model on the MovieLen dataset and compare with other baselines. The experimental results on P recision, Recall, and M AE demonstrate that BICCF outperforms the other clustering algorithms. For future work, we are considering the extended tasks cluster not only similar users but also deﬁning other features such as contextual, cognition, and localized information to obtain the actual user clusters.

References 1. Chen, J., Zhao, C., Uliji, Chen, L.: Collaborative ﬁltering recommendation algorithm based on user correlation and evolutionary clustering. Complex Intell. Syst. 6(1), 147–156 (2019). https://doi.org/10.1007/s40747-019-00123-5 2. Izakian, H., Abraham, A.: Fuzzy C-means and fuzzy swarm for fuzzy clustering problem. Exp. Syst. Appl. 38(3), 1835–1838 (2011). https://doi.org/10.1016/ j.eswa.2010.07.112 3. Kang, Q., Liu, S., Zhou, M., Li, S.: A weight-incorporated similarity-based clustering ensemble method based on swarm intelligence. Knowl.-Based Syst. 104, 156–164 (2016). https://doi.org/10.1016/j.knosys.2016.04.021 4. Katarya, R.: Movie recommender system with metaheuristic artiﬁcial bee. Neural Comput. Appl. 30(6), 1983–1990 (2018). https://doi.org/10.1007/s00521-0173338-4 5. Ko, H., Lee, S., Park, Y., Choi, A.: A survey of recommendation systems: recommendation models, techniques, and application ﬁelds. Electronics 11(1), 141 (2022). https://doi.org/10.3390/electronics11010141 6. Kumar, M.S., Prabhu, J.: A hybrid model collaborative movie recommendation system using k-means clustering with ant colony optimisation. Int. J. Internet Technol. Secured Trans. 10(3), 337 (2020). https://doi.org/10.1504/ijitst.2020.107079 7. Liu, S., Zou, Y.: An improved hybrid clustering algorithm based on particle swarm optimization and K-means. IOP Conf. Ser. Mater. Sci. Eng. 750, 012152 (2020). https://doi.org/10.1088/1757-899x/750/1/012152 8. Nguyen, L.V., Hong, M.S., Jung, J.J., Sohn, B.S.: Cognitive similarity-based collaborative ﬁltering recommendation system. Appl. Sci. 10(12), 4183 (2020). https:// doi.org/10.3390/app10124183 9. Nguyen, L.V., Jung, J.J.: Crowdsourcing platform for collecting cognitive feedbacks from users: a case study on movie recommender system. In: Pham, H. (ed.) Reliability and Statistical Computing. SSRE, pp. 139–150. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-43412-0_9

Bio-Inspired Clustering: An Ensemble Method for UBCF

35

10. Nguyen, L.V., Jung, J.J.: SABRE: cross-domain crowdsourcing platform for recommendation services. In: Braubach, L., Jander, K., Bădică, C. (eds.) Intelligent Distributed Computing XV, pp. 213–223. Springer, Cham (2023). https://doi.org/ 10.1007/978-3-031-29104-3_24 11. Nguyen, L.V., Jung, J.J., Hwang, M.: OurPlaces: cross-cultural crowdsourcing platform for location recommendation services. ISPRS Int. J. Geo-Inf. 9(12), 711 (2020). https://doi.org/10.3390/ijgi9120711 12. Nguyen, L.V., Nguyen, T.H., Jung, J.J.: Content-based collaborative ﬁltering using word embedding. In: Proceedings of the International Conference on Research in Adaptive and Convergent Systems, pp. 96–100. ACM, October 2020. https://doi. org/10.1145/3400286.3418253 13. Nguyen, L.V., Nguyen, T.H., Jung, J.J.: Tourism recommender system based on cognitive similarity between cross-cultural users. In: Intelligent Environments 2021. Ambient Intelligence and Smart Environments, vol. 29, pp. 225–232. IOS Press, June 2021. https://doi.org/10.3233/aise210101 14. Nguyen, L.V., Nguyen, T.H., Jung, J.J., Camacho, D.: Extending collaborative ﬁltering recommendation using word embedding: a hybrid approach. Concurrency Comput. Pract. Exp. 35(16), e6232 (2023). https://doi.org/10.1002/cpe.6232 15. Nguyen, L.V., Vo, Q.T., Nguyen, T.H.: Adaptive KNN-based extended collaborative ﬁltering recommendation services. Big Data Cogn. Comput. 7(2), 106 (2023). https://doi.org/10.3390/bdcc7020106 16. Pei, Z., Hua, X., Han, J.: The clustering algorithm based on particle swarm optimization algorithm. In: 2008 International Conference on Intelligent Computation Technology and Automation (ICICTA), vol. 1, pp. 148–151. IEEE, October 2008. https://doi.org/10.1109/icicta.2008.421 17. Sadeghi, M., Dehkordi, M.N., Barekatain, B., Khani, N.: Improve customer churn prediction through the proposed PCA-PSO-K means algorithm in the communication industry. J. Supercomput. 79(6), 6871–6888 (2022). https://doi.org/10.1007/ s11227-022-04907-4 18. Sivaramakrishnan, N., Subramaniyaswamy, V., Ravi, L., Vijayakumar, V., Gao, X.Z., Sri, S.R.: An eﬀective user clustering-based collaborative ﬁltering recommender system with grey wolf optimisation. Int. J. Bio-Inspired Comput. 16(1), 44 (2020). https://doi.org/10.1504/ijbic.2020.108999 19. Vellaichamy, V., Kalimuthu, V.: Hybrid collaborative movie recommender system using clustering and bat optimization. Int. J. Intell. Eng. Syst. 10(5), 38–47 (2017). https://doi.org/10.22266/ijies2017.1031.05 20. Wang, Z., Yu, X., Feng, N., Wang, Z.: An improved collaborative movie recommendation system using computational intelligence. J. Vis. Lang. Comput. 25(6), 667–675 (2014). https://doi.org/10.1016/j.jvlc.2014.09.011

Deep Reinforcement Learning-Based Sum-Rate Maximization for Uplink Multi-user SIMO-RSMA Systems Thanh Phung Truong1 , Tri-Hai Nguyen2 , Anh-Tien Tran1 , Si Van-Tien Tran3 , Van Dat Tuong1 , Luong Vuong Nguyen4 , Woongsoo Na5 , Laihyuk Park2 , and Sungrae Cho1(B) 1

School of Computer Science and Engineering, Chung-Ang University, Seoul, Korea {tptruong,attran,vdtuong}@uclab.re.kr, [email protected] 2 Department of Computer Science and Engineering, Seoul National University of Science and Technology, Seoul, Korea {haint93,lhpark}@seoultech.ac.kr 3 School of Architecture Engineering, Chung-Ang University, Seoul, Korea [email protected] 4 Department of Artiﬁcial Intelligence, FPT University, Danang, Vietnam [email protected] 5 Department of Computer Science and Engineering, Kongju National University, Seoul, Korea [email protected]

Abstract. This research aims to investigate a sum-rate maximization problem in uplink multi-user single-input multiple-output (SIMO) rate splitting multiple access (RSMA) systems. In these systems, Internet of Things devices (IoTDs) act as single-antenna nodes transmitting data to the multiple-antenna base station (BS) utilizing the RSMA technique. The optimization process includes determining the transmit powers of the IoTDs, decoding order, and receive beamforming vector at the BS. To achieve this goal, the problem is transformed into a deep reinforcement learning (DRL) framework, where a post-actor processing stage is proposed and a deep deterministic policy gradient (DDPG)-based approach is applied to tackle the issue. Via simulation results, we show that the proposed approach outperforms some benchmark schemes. Keywords: SIMO-RSMA

1

· DRL · uplink · sum-rate maximization

Introduction

The upcoming high-data-rate and ultra-low-latency applications in beyond-ﬁfthgeneration (B5G) and sixth-generation (6G) communications, such as augmented reality, connected vehicles, and various Internet of Things (IoT) applications, require even higher throughput with more stringent latency constraints compared to the ﬁfth-generation (5G) networks. To meet these demands, ratesplitting multiple access (RSMA), which is a generalization of non-orthogonal c The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 N.-N. Dao et al. (Eds.): ICIT 2023, LNDECT 187, pp. 36–45, 2023. https://doi.org/10.1007/978-3-031-46573-4_4

DRL-Based Sum-Rate Maximization for Uplink Multi-user SIMO-RSMA

37

multiple access and space-division multiple access [1], has emerged as a cuttingedge, comprehensive, and potent framework for designing multiple access control in future wireless networks [2]. Moreover, multi-antenna systems are considered crucial for improving the spectrum eﬃciency of wireless communication [3]. In particular, single-input multiple-output (SIMO) technology plays a signiﬁcant role in wireless communication as it utilizes multiple antennas at the receiver to enhance data throughput considerably. Thus, combining multiple access techniques with multi-antenna systems presents attractive research opportunities. Several recent studies have investigated the downlink transmission of RSMA in conjunction with multiple-input multiple-output (MIMO) systems [4–6]. For instance, to mitigate the negative eﬀects of residual errors and polarization interference from imperfect successive interference cancellation (SIC) in a massive MIMO-RSMA network, the work [4] proposed three dual-polarized downlink transmission schemes. In another study, novel precoding and decoding schemes were proposed to enhance the downlink MIMO-RSMA system using successive null-space precoding with a non-convex weighted sum rate optimization problem [5]. The power allocation optimization problem for RSMA was considered in [6] for the performance evaluation of RSMA in massive MIMO networks. Optimal power allocation for RSMA was also examined in [7], where the authors utilized a deep reinforcement learning (DRL) algorithm to solve the problem. DRL is a sub-ﬁeld of machine learning that involves training artiﬁcial agents to make decisions based on rewards and punishments received from their environment. DRL has been applied in various ways to optimize wireless communication system performance and eﬃciency [8–10]. However, uplink transmission in an integrated system of RSMA with SIMO technology remains a challenging research issue. Therefore, in this investigation, we aim to study an uplink multiuser SIMO-RSMA system and maximize its achievable sum-rate by optimizing IoTDs’ transmit powers, decoding orders, and receive beamforming vector at the base station (BS). The major contributions of this paper are as follows: – We examine a sum-rate maximization problem in an uplink transmission in the multi-user SIMO-RSMA system, where many single-antenna IoTDs simultaneously transmit data to a multiple-antenna BS with the assistance of the RSMA technique. – We propose a DRL framework to solve the problem. Here, we deﬁne a postactor processing stage to guarantee the variable constraints and apply the DDPG algorithm to resolve them. – We evaluate to examine the system’s performance. The simulation results demonstrate the proposed approach’s eﬀectiveness as it outperforms other benchmark schemes.

38

T. P. Truong et al.

Fig. 1. Uplink multi-user SIMO-RSMA communication.

2 2.1

DRL-Based Sum-Rate Maximization for Uplink Multi-user SIMO-RSMA Framework System Model and Problem Formulation

We consider the uplink multi-user SIMO-RSMA communication model illustrated in Fig. 1, where a BS equipped with L antennas serves K single-antenna IoT devices (IoTDs). The transmission from the IoTDs to the BS is enhanced by applying the RSMA technique, where each transmitted signal is split into two sub-signals [11–13]. Then, the k-th IoTD’s transmitted signal sk is denoted as sk =

2 √

pk,i sk,i ,

(1)

i=1

where sk,i denotes the sub-signal i, i ∈ {1, 2}, of the signal sk , and pk,i ≥ 0 is the corresponding transmit power. Accordingly, the received signals vector at the BS is given by K 2 √ y= hk pk,i sk,i + N, (2) k=1

i=1

L×1

where hk ∈ C is the channel vector from the k−th IoTD to the BS, and N ∈ CL×1 represents the vector of additive white Gaussian noise (AWGN). Then, the soft-estimated received sub-signal sk,i can be obtained after applying a receive beamforming vector and decoding using SIC according to a decoding order, which can be represented as sˆk,i = wH y =

K k=1

wH hk

2 √ i=1

pk,i sk,i + n0 ,

(3)

DRL-Based Sum-Rate Maximization for Uplink Multi-user SIMO-RSMA

39

where wH is the conjugate transposition of the receive beamforming vector, w = [w1 , w2 , . . . , wL ]T ∈ CL×1 , with unit norm, i.e., w2 = 1, and n0 is the AWGN value. Consequently, denoting πk,i as the decoding order of sub-signal sk,i , the achievable rate of sk,i is calculated by |wH hk |2 pk,i rk,i = B log2 1 + , (4) H 2 2 πl,j >πk,i |w hl | pl,j + σ where the power spectral density of the Gaussian noise is denoted by σ 2 , πl,j denotes the decoding order of sub-signal j of IoTD l, and πl,j > πk,i expresses that sub-signal sl,j is decoded after sub-signal sk,i . Therefore, we can calculate the achievable rate of IoTD k as rk =

2

rk,i .

(5)

i=1

The goal of our research is to maximize the achievable system sum-rate by optimizing receive beamforming vector, IoTDs’ transmit powers, and the decoding order of the sub-signals. Denoting π = {πk,1 , πk,2 , k ∈ {1, 2, . . . , K}} and p = {pk,1 , pk,2 , k ∈ {1, 2, . . . , K}} as the decoding order and transmit power set of all sub-signals, respectively, the optimization problem is formulated as (P1) max

w,p,π

K

rk

s.t. w2 = 1, pk,i ≥ 0, k ∈ {1, 2, . . . , K}, i ∈ {1, 2}, 2

(6a)

k=1

pk,i ≤ Pkmax ,

(6b) (6c) (6d)

i=1

where Pkmax denotes the maximum transmit power of the IoTD k. The mixture of the decoding order and continuous variables in problem (6) makes it challenging to solve using traditional optimization techniques. Therefore, we transform the optimization problem into a DRL framework, which can be solved by applying an actor-critic algorithm, i.e., DDPG. 2.2

Proposed Deep Reinforcement Learning Framework

First, we transform the proposed problem into a DRL framework, where the BS plays the role of agent and the environment is the whole network. According to the observation state, the agent decides the action to take in response to the environment and receives a reward at each time step. The state, action, and reward at time slot t are deﬁned as follows. – State space: The environment state at each time slot is the current channel vector between IoTDs and the BS, i.e., s(t) = {h1 (t), h2 (t), . . . , hK (t)}.

40

T. P. Truong et al.

– Action space: The action is determined according to the optimization variables, including the receive beamforming vector, transmit power, and decoding order, expressed as a(t) = {w(t), p(t), π(t)}. – Reward: With the aim of maximizing the achievable system sum-rate, we form the reward K as the sum of the current achievable rate of all IoTDs, given as r(t) = k=1 rk . However, the designed actions have to satisfy constraints in (6), which makes it challenging to solve the problem by applying DRL. To do so, we propose a post-actor processing stage to ensure action constraints. We apply the sigmoid function to the actions to normalize action values to the range of [0, 1], which satisfy (6c). The sigmoid function is calculated as sigmoid(x) =

1 . 1 + e−x

(7)

Then, the actions after applying the sigmoid function can be represented as a (t) = {w (t), p (t), π (t)} = {w1 (t), . . . , wL (t), pk,i (t), πk,i (t), }, i ∈ {1, 2}, k ∈ {1, 2, . . . , K}.

(8)

Proposition 1. Assuming that p∗ (t) = {p∗k,1 (t), p∗k,2 (t)} is the optimal transmitted power of k−th IoTD at time slot t, the constraint in (6d) can be satisfied by the following mapping function: p∗k,1 (t) = pk,1 (t)Pkmax ,

p∗k,2 (t) = pk,2 (t)(Pkmax − p∗k,1 (t)).

(9)

Proof. The sum of the transmitted power of k−th IoTD at time slot t is calculated as p∗k,1 (t) + p∗k,2 (t) = pk,1 (t)Pkmax + pk,2 (t)(Pkmax − p∗k,1 (t)) = pk,1 (t)Pkmax + pk,2 (t)Pkmax (1 − pk,1 (t)) =

Pkmax (pk,1 (t)

+

pk,2 (t)(1

−

(10)

pk,1 (t))).

With pk,1 (t), pk,2 (t) ∈ [0, 1] as the sigmoid function in (7), (pk,1 (t) + pk,2 (t)(1 − pk,1 (t)) ≤ 1. Then Pkmax (pk,1 (t) + pk,2 (t)(1 − pk,1 (t))) ≤ Pkmax , which proves proposition 1. ∗ Proposition 2. Assuming that w∗ (t) = [w1∗ (t), w2∗ (t), . . . , wl∗ (t), . . . , wL (t)]T is the optimal receive beamforming vector, the constraint in (6b) can be satisfied by the following mapping function:

wl∗ (t) =

wl (t) 2 j∈{1,2,...,L} |wj (t)|

.

(11)

DRL-Based Sum-Rate Maximization for Uplink Multi-user SIMO-RSMA

41

Fig. 2. Proposed Deep Reinforcement Learning Framework.

Proof. The norm of the received beamforming vector is calculated as 2 (t) w l |wl∗ (t)|2 =

||w(t)||2 = 2 l∈{1,2,...,L} l∈{1,2,...,L} j∈{1,2,...,L} |wj (t)| 1 2 = |wl (t)| 2

2 j∈{1,2,...,L} |wj (t)| l∈{1,2,...,L} l∈{1,2,...,L} |wl (t)|2 = = 1. 2 j∈{1,2,...,L} |wj (t)| (12) Then, Eq. (12) proves proposition 2. As a result, by applying the sigmoid function, proposition 1, and proposition 2, constraints in (6) are satisﬁed. Then, the DRL-based problem is formulated by maximizing the sum of discounted future rewards as (P2)

max a(t)

T

γ i−t r(t),

(13)

i=t

where γ ∈ [0, 1] is the discount factor and T is the number of time slots. We illustrate the whole framework in Fig. 2 and the proposed algorithm in Algorithm 1. The training process is executed in E episodes, each with T time steps. At each time step, the DDPG agent observes the environment’s state s(t) and takes action a(t) by the main actor network. The action is then applied by the sigmoid function and the mapping stage (deﬁned by proposition 1 and proposition 2) and performed in the environment to obtain the next state s(t + 1) and the reward r(t). Subsequently, an experience tuple, consisting of

42

T. P. Truong et al.

Algorithm 1. Proposed DRL-based algorithm. 1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12: 13: 14: 15: 16:

Set up parameters of algorithm and environment. for e = 1, 2, . . . , E do Get episode initial state s(1). for t = 1, 2, . . . , T do Decide action a(t) = μ(s(t)|θµ ) + N (t). Apply sigmoid function and mapping stage, get π (t), p∗ (t), w∗ (t). Perform π (t), p∗ (t), w∗ (t) to the environment, obtain s(t + 1), r(t). Keep sample {s(t), a(t), r(t), s(t + 1)} in experience buﬀer. s(t) ← s(t + 1). Pick a batch of samples randomly from the experience buﬀer. Update actor network parameter θµ by (14). Update critic network parameter θQ by (15). Update target networks as (16). end for end for ∗ return the optimal actor network θµ .

s(t), a(t), r(t), and s(t + 1), is pushed into the experience buﬀer to train the neural networks. At each update step, a batch of experience samples is selected randomly to update the parameters via the DDPG algorithm, which is an oﬀ-policy reinforcement learning algorithm that combines deep neural networks with actorcritic methods to learn policies for continuous control tasks [14]. It maintains two actor networks: main network (μ(s|θμ )), and target network (μ (s|θμ )); and Q two critic networks: main network (Q(s, a|θ ), and target network (Q (s, a|θQ ), μ μ Q Q where θ , θ , θ , and θ are the neural network parameters. The actor network is used to select actions, and the critic network is used to evaluate the chosen actions by estimating the expected cumulative reward. The main actor network is updated by computing the gradient of the expected return regarding the action, given as ∇θµ J =

B

1 ∇a Q(s, a|θQ )|s=si ,a=μ(si ) ∇θµ μ(si |θμ ) , B i=1

(14)

where B represents the size of the mini-batch. The critic network is trained by minimizing the diﬀerence between the estimated and observed returns, given as L=

B

2 1 Q(si , ai |θQ ) − yi , B i=1

(15)

where yi = ri + γQ si+1 , μ (si+1 |θμ )|θQ denotes the estimated value. Then, the soft update with a coeﬃcient τ is applied to update the target networks’ parameters, given as

θμ ← τ θμ + (1 − τ )θμ ,

θQ ← τ θQ + (1 − τ )θQ .

(16)

DRL-Based Sum-Rate Maximization for Uplink Multi-user SIMO-RSMA

43

Fig. 3. Model convergence with respect to the learning rate.

To generate training data with exploration, the actor policy is modiﬁed by adding noise to produce the action. Then, the action a(t) is expressed by a(t) = μ(s(t)|θμ ) + N (t), where the Ornstein-Uhlenbeck process is used to generate N (t).

3

Evaluation

We experiment on a scenario with 10 IoTDs, where the number of BS antennas ranges from 4 to 8, the system bandwidth ranges from 4 to 16 MHz, and the maximum transmitted power of the IoTDs is set from 4 to 10 dBm. We select hk ∼ CN (0, I), and σ 2 = −170 dBm/Hz. To evaluate the performance of the proposed framework (Proposed), we compare the achievable system sum-rate to some other benchmark schemes, as follows: – Non-Beamforming scheme (DDPG-NB): We deﬁne this scheme to demonstrate the impact of optimizing the received beamforming vector. Here, the received beamforming vector is ﬁxed with a zero phase, and the value of each beamforming element is wl = √1L , i.e., ||w||2 = 1. – Greedy scheme (GR): In this scheme, we quantize the actions into discrete spaces, then apply a greedy searching algorithm to ﬁnd the action with the best return at each time step. – Random scheme (RA): The actions in this scheme are randomly selected in the range [0, 1] and performed in the environment at each time step. First, using various learning rate values, we evaluate the proposed training algorithm’s convergence, as shown in Fig. 3. Accordingly, the algorithm performs best with both the actor learning rate (lra ) and the critic learning rate (lrc ) of

44

T. P. Truong et al.

Fig. 4. Performance evaluation.

1e−3 , gets stuck at a low value of convergence in the big learning rate case (5e−3 ), and converges more slowly in the small learning rate case (1e−4 ), after about 3000 episodes as opposed to 1500 episodes in the best case. Second, we evaluate the proposed framework’s performance compared to other schemes based on variations in system bandwidth, as illustrated in Fig. 4a. It demonstrates how the achievable sum-rate rises as the bandwidth increases, being around four times higher from B = 4 to B = 16 MHz. Speciﬁcally, our proposed approach gives the best performance, where it performs about 33.3 %, 44.1 %, and 85.1 % better than the DDPG-NB, GR, and RA, respectively. Finally, to evaluate the eﬀect of optimizing the receive beamforming vector, we compare the proposed scheme with the DDPG-NB approach by analyzing changes in the maximum transmit power and the number of antennas. According to Fig. 4b, the achievable sum-rate rises as power increases; it rises approximately 65.8 % from Pkmax = 2 to Pkmax = 10 dBm. In addition, optimizing the receive beamforming vector improves performance while expanding the number of BS antennas. In the proposed scheme, the case of 8 antennas performs 33.5 % better than the case of 4 antennas, compared to 1.4 % in the DDPG-NB approach.

4

Conclusion

In this study, we investigated a sum-rate maximization problem in an uplink RSMA multi-user SIMO system, in which we optimized the IoTDs’ transmit powers, the BS receive beamforming vector, and the decoding order at the BS. We transformed the problem into a DRL framework with a post-actor processing stage and applied the DDPG to resolve it. In the evaluation part, we demonstrated the eﬃcacy of the proposed approach compared to some benchmark approaches. This study also leaves some open issues worth researching, such as considering MIMO systems, the fairness of IoTDs’ achievable rate, imperfect channel gain, and integration with MEC systems.

DRL-Based Sum-Rate Maximization for Uplink Multi-user SIMO-RSMA

45

Acknowledgment. This research was supported by the MSIT (Ministry of Science and ICT), Korea, under the ITRC (Information Technology Research Center) support program (IITP-2023-RS-2022-00156353) supervised by the IITP (Institute for Information & Communications Technology Planning & Evaluation). It was also supported by the National Research Foundation of Korea (NRF) grant funded by the Korea government (MSIT) (No. 2021R1F1A1062881).

References 1. Mao, Y., Clerckx, B., Li, V.O.K.: Rate-splitting multiple access for downlink communication systems: bridging, generalizing, and outperforming SDMA and NOMA. EURASIP J. Wirel. Commun. Netw. 2018(1), 1–54 (2018). https://doi.org/10. 1186/s13638-018-1104-7 2. Mao, Y., Dizdar, O., Clerckx, B., Schober, R., Popovski, P., Poor, H.V.: Ratesplitting multiple access: fundamentals, survey, and future research trends. IEEE Commun. Surv. Tutorials 24(4), 2073–2126 (2022) 3. Zheng, G., Wong, K.-K., Ng, T.-S.: Energy-eﬃcient multiuser SIMO: achieving probabilistic robustness with Gaussian channel uncertainty. IEEE Trans. Commun. 57(6), 1866–1878 (2009) 4. De Sena, A.S., et al.: Dual-polarized massive MIMO-RSMA networks: tackling imperfect SIC. IEEE Trans. Wireless Commun. 22, 3194–3215 (2022) 5. Krishnamoorthy, A., Schober, R.: Downlink MIMO-RSMA with successive nullspace precoding. IEEE Trans. Wireless Commun. 21(11), 9170–9185 (2022) 6. Dizdar, O., Mao, Y., Clerckx, B.: Rate-splitting multiple access to mitigate the curse of mobility in (massive) MIMO networks. IEEE Trans. Commun. 69(10), 6765–6780 (2021) 7. Hieu, N.Q., Hoang, D.T., Niyato, D., Kim, D.I.: Optimal power allocation for rate splitting communications with deep reinforcement learning. IEEE Wireless Commun. Lett. 10(12), 2820–2823 (2021) 8. Nguyen, T.H., Truong, T.P., Dao, N.N., Na, W., Park, H., Park, L.: Deep reinforcement learning-based partial task oﬄoading in high altitude platform-aided vehicular networks. In: 2022 13th International Conference on Information and Communication Technology Convergence (ICTC), pp. 1341–1346. IEEE (2022) 9. Nguyen, T.H., Park, L.: A survey on deep reinforcement learning-driven task oﬄoading in aerial access networks. In: 2022 13th International Conference on Information and Communication Technology Convergence (ICTC), pp. 822–827. IEEE (2022) 10. Lakew, D.S., Tran, A.T., Dao, N.N., Cho, S.: Intelligent oﬄoading and resource allocation in heterogeneous aerial access IoT networks. IEEE Internet Things J. 10(7), 5704–5718 (2023) 11. Truong, T.P., Dao, N.N., Cho, S.: HAMEC-RSMA: enhanced aerial computing systems with rate splitting multiple access. IEEE Access 10, 52398–52409 (2022) 12. Yang, Z., Chen, M., Saad, W., Wei, X., Shikh-Bahaei, M.: Sum-rate maximization of uplink rate splitting multiple access (RSMA) communication. IEEE Trans. Mob. Comput. 21(7), 2596–2609 (2022) 13. Katwe, M., Singh, K., Clerckx, B., Li, C.-P.: Rate splitting multiple access for sum-rate maximization in IRS aided uplink communications. IEEE Trans. Wireless Commun. 22(4), 2246–2261 (2023) 14. Lillicrap, T.P., et al.: Continuous control with deep reinforcement learning. In: ICLR (Poster) (2016)

Multiobjective Logistics Optimization for Automated ATM Cash Replenishment Process Bui Tien Thanh1 , Dinh Van Tuan1 , Tuan Anh Chi1 , Nguyen Van Dai1 , Nguyen Tai Quang Dinh1 , Nguyen Thu Thuy1 , and Nguyen Thi Xuan Hoa2,3(B) 1 School of Applied Mathematics and Informatics, Hanoi, Vietnam School of Economics and Management, Hanoi University of Science and Technology, Hanoi, Vietnam [email protected] Center for Digital Technology and Economy, Hanoi University of Science and Technology, Hanoi, Vietnam 2

3

Abstract. In the digital transformation era, integrating digital technology into every aspect of banking operations improves process automation, cost eﬃciency, and service level improvement. Although logistics for Automated Teller Machine (ATM) cash is a crucial task that impacts operating costs and consumer satisfaction, there has been little eﬀort to enhance it. Speciﬁcally, in Vietnam, with a market of more than 20,000 ATMs nationally, research and technological solutions that can resolve this issue remain scarce. In this paper, we generalized the vehicle routing problem for ATM cash replenishment, suggested a mathematical model, and then oﬀered a tool to evaluate various situations. When being evaluated on the simulated dataset, our proposed model and method produced encouraging results with the beneﬁts of cutting ATM cash operating costs. Keywords: ATM cash replenishment · vehicle routing problem optimization · banking system · logistics scheduling

1

·

Introduction

ATM is a physical interaction point and can be compared to an extension arm of banks to customers in many areas. Therefore, storing and replenishing cash for ATMs aﬀect customer satisfaction with banking services. Besides, there are also costs associated with personnel, transportation and interest rate on money deposited at ATMs - this is a relatively high cost since this rate can be as high as 10% [11]. From the supply chain management perspective, this is a classic tradeoﬀ between transportation and inventory costs. Many cost-optimizing models have been devised to solve this problem, especially in manufacturing industries [6]. For the banking sector, this can be considered a money supply chain problem c The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 N.-N. Dao et al. (Eds.): ICIT 2023, LNDECT 187, pp. 46–56, 2023. https://doi.org/10.1007/978-3-031-46573-4_5

Multiobjective Optimization for Automated ATM Cash Replenishment

47

and a speciﬁc commodity that must adhere to the State’s regulations and risk management in logistics. According to the International Monetary Fund (IMF), Vietnam has 20,404 ATMs, with 2,644 (≈13%) belonging to Vietnam Bank for Agriculture and Rural Development (Agribank). With an extensive network covering many regions and continuous cash replenishment, banks can improve the logistics of the ATM cash replenishment process, which will result in cost and administration savings. The ATM cash replenishment problem aims to provide a plan to meet the demand for ATM withdrawal and optimize the costs related to money transportation. Many mathematical models [2,5] been proposed to solve the transportation optimization problem, also known as the Vehicle Routing Problem (VRP). The problem is commonly expressed as a single objective optimization issue, which involves reducing the total transportation expenses of vehicles. This cost includes ﬁxed costs and costs proportional to the total distance travelled. We also consider ﬁnancial costs an important factor in our objective function. To address the VRP, we formulate the problem as a Multiobjective Optimization Problem (MOP). MOP involves multiple objectives that may conﬂict with one another, making it impossible to achieve an optimal solution with respect to a single objective. Thus, the aim is to ﬁnd solutions that balance diﬀerent objective functions best. Some methods to solve MOP are described in [9,10]. In Vietnam, there are few research and technological solutions for this problem. Especially when the Vietnamese banking sector is in the digital transformation, many internal business processes are being digitized but not integrated with optimal technologies. Therefore, an easy-to-apply, close-to-logistics model in the Vietnamese market to improve ATM operations can beneﬁt domestic businesses. In this study, we generalize the replenishment cash process for ATMs based on a survey in the domestic banking market to better deﬁne the problem and the desired goal. Then, we model the problem, propose a tool to solve the problem and perform tests on the simulated dataset to evaluate the results. We expect that our recommendations would beneﬁt the banking industry by: (i) Reducing interest costs on money deposited for a long time at ATMs; (ii) Saving transportation costs and time since distances and trips are optimized; (iii) Minimizing planning time and errors; (iv) Utilizing a system and related toolkits to standardize the procedure and reduce the personnel-dependent risk; (v) Providing a long-term transportation plan and an overview to assist the management department in making decisions regarding cash ﬂow coordination. The rest of this paper is organized as follows. Section 2 presents the problem and business requirements. Section 3 models the problem. Our proposed solution is presented in Sect. 4. Section 5 tests and evaluates the results. Finally, Sect. 6 concludes our main contributions.

48

2

B. T. Thanh et al.

Research Problem

Fig. 1. ATM cash replenishment process

Figure 1 shows a basic ATM replenishment process based on our survey [7]. As a ﬁrst step, staﬀ members plan to top-up the ATMs, i.e. determine which ATMs will be recharged, when to top up, and the amount of money. Once the plan is done, the bank’s staﬀ will coordinate with the shipping department or a third party to transfer the money to the ATMs. Figure 1 only records those directly and full-time involved in replenishment; In fact, other management and support departments are also included, but less often. Generally, the cash replenishment process for ATMs has some common regulations, such as: – Money transportation needs to be done during oﬃce hours. – An ATM is usually provided cash by a permanent warehouse. – If a bank has many cash depots, usually the depots will be in charge of separate areas to meet the criterion of the shortest distance. This is also convenient for grouping personnel and managing cash ﬂow by region. And in the ATM cash replenishment process, the planning procedure plays an important role because it aﬀects the incurred costs of the operation, including two main categories: – Firstly, transportation costs: are the costs related to renting a specialized vehicle to transport money, hiring a driver, salary and allowances paid to the staﬀ planning, checking money and supervising the transport vehicle. – Secondly, ﬁnancial cost: includes loan interest to keep money in ATMs to serve customers’ withdrawal needs. Table 1. Planning for the ATM cash replenishment process Current planning process

Things can be improved

Transportation planning is usually made in a short time, speciﬁcally 1–3 days, which means that the staﬀ usually plans one day in advance and takes 2–3 days later to combine transportation.

Create a longer-term transportation planning (7–14 days) to ﬁnd a better mix and provide an overall vision of the cash plan for management and other departments.

The amount of money deposited for each ATM Consider forecasting actual withdrawals at ATMs normally follows a ﬁxed rate for a period of time by day to make a better reasonable deposit. (monthly or quarterly). Plans are often made using Microsoft Excel with manual calculations and staﬀ experience.

-Use proven mathematical theories and models to create cost-optimized plans. -Use tools and software to calculate more accurately, faster, to save time and eﬀort

Multiobjective Optimization for Automated ATM Cash Replenishment

49

However, according to our survey in a segment of domestic banks in Vietnam, this process can be improved as shown in Table 1. Withdrawal demand forecasting data will be used to calculate the plan to pour money into ATMs. Currently, in Vietnam, depending on the location, the ATM will be supplemented with a certain amount of money when the amount is below a safe threshold. This amount is ﬁxed monthly or quarterly and assessed for changes the following month or quarter. Although this approach is simple, it has many disadvantages. As withdrawal demand can ﬂuctuate over a short period of time, setting a deposit limit at an ATM can result in large interest costs on deposits when actual withdrawal needs diﬀer signiﬁcantly from the limit. On the other hand, if the demand for withdrawals increases sharply over time, the bank must replenish more times. We see an opportunity to combine transportation routes when planning ﬂexibly according to demand ﬂuctuations rather than a ﬁxed amount of money.

3 3.1

Mathematical Model Problem Statement

The problem is deﬁned on a complete directed graph G = (V, E), where V = {1, 2, . . . , A, 01, 02, . . . , 0D} is the set of vertices of the graph and E = {(i, j)|i, j ∈ V, i = j} is the set of edges of the graph, with D and A are positive integers. A = {1, 2, . . . , A} is the set of ATMs and D = {01, 02, . . . , 0D} is the set of cash depots. H = {1, 2, . . . , H} is the set of vehicles. T = {1, 2, . . . , p} is the set of transportation periods. Figure 2 illustrates an example of 3 depots and 16 ATMs. The ATMs 1, 2, 3, 4, 5, 6, 11 are replenished from depot 01. The ATMs 8, 12, 16 are replenished from depot 02. The ATMs 7, 9, 10, 13, 14, 15 are replenished from depot 03. The optimal route for vehicle number 1 is 01 → 1 → 5 → 2 → 3 → 01. The optimal route for vehicle number 4 is 03 → 7 → 9 → 10 → 03. Vehicle number 1 is used in period 1 (z11 = 1) and leaves the depot at 9h30 (u11 = 9h30). ATM number 2 was replenished by depot 01 on period 1 (y0121 = 1). The road (5,2) has vehicle number 1 passing on period 1 (x5211 = 1). The amount to be replenished for ATM number 2 on period 1 is 100 million VND (d21 = 100). The amount withdrawn at ATM number 2 on period 1 is 50 million VND (m21 = 50). Vehicle number 1 arrives at ATM number 2 on period 1 at 10h00 (r211 = 10h00). Vehicle number 1 started replenishing money at ATM number 2 on period 1 at 10h05 (w211 = 10h05).

Fig. 2. A scenario with 3 depots and 16 ATMs

50

3.2

B. T. Thanh et al.

Constraints

The model must satisfy the following constraints. (1) Each vehicle departs from and returns to the same depot. (2) In each period, each ATM can only be served by one depot. (3) In each period, each ATM can only be served by one vehicle. (4) The vehicles are heterogeneous (have diﬀerent capacities and operating costs). (5) The start time of cash replenishment at each ATM must be within the allowable range. (6) Vehicles can only operate within speciﬁed hours. (7) The actual time taken at each ATM must be considered (to perform cash counting, cash loading, machine shutdown, et cetera). (8) The amount of cash replenished at each ATM in each period must not exceed the capacity of each vehicle. (9) The amount of cash to be transferred to each ATM is known in advance. (10) Only the total amount of cash replenished at ATMs is considered without considering the details of the denomination. (11) The location and coordinates of each ATM are known in advance. (12) ATMs function normally without breaking down or needing repair. 3.3

Mathematical Model

In this article, several parameters will be used to help model the problem, such as: tijh : The time for vehicle h to travel from node i to node j; C:The maximum travel distance allowed for vehicle h within the period t; qh : The capacity of vehicle h; IR: The annual interest rate (%); I0j : The initial amount of money at ATM j; cij : The distance between node i and node j; ej : The earliest time to start replenishing money at ATM j; lj : The latest time to start replenishing money at ATM j; sj : The time needed to replenish money at ATM j (measured from the time the money is taken out of the vehicle); e0 : The earliest time for the vehicle to start from the depot; l0 : The latest time for the vehicle to start from the depot; ah : The driving cost of a vehicle h per unit distance; V : The number of vertices in the graph (V = |V|); tmax : The maximum travel time allowed for vehicle h within the period t; p: The maximum number of periods allowed for all; mjt : The amount of money withdrawn from ATM j during t. In addition, the article uses some decision variables to increase the accuracy of the problems such as: xijht = 1 if vehicle h travels on road (i, j) during period t; 0, otherwise; zht = 1 if vehicle h is used during period t; 0, otherwise; yijt = 1 if ATM j is replenished by depot i during period t; 0, otherwise; djt = The amount of money needed to replenish ATM j in period t; rjht = The time when vehicle h arrives at ATM j during period t; wjht = The time when vehicle h starts replenishing money at ATM j during period t (measured from the time the money is taken out of the vehicle); uht = The time when vehicle h starts from the depot during period t. Based on the situation discussed earlier, we have formulated a constrained optimization problem with the objective function represented as follows, with Vmin is the minimum vector: min f = Vmin (f1 , f2 ), and

Multiobjective Optimization for Automated ATM Cash Replenishment

⎧ ⎪ f1 = ah cij xijht ⎪ ⎪ ⎪ ⎪ i∈V t∈T h∈H j∈V|i = j ⎪ ⎪ ⎨ IR f2 = p(I0j + dj1 − mj1 ) + (p − 1)(dj2 − mj2 ) + · · · ⎪ 365 ⎪ ⎪ j∈A ⎪ ⎪ ⎪ ⎪ ⎩ + 2(djp−1 − mjp−1 ) + (djp − mjp )

51

(1)

(2)

s.t.

xijht = 1,

∀j ∈ A, t ∈ T

(3)

i∈V h∈H

djt xijht ≤ qh ,

i∈V j∈A

xijht =

i∈D j∈A

∀h ∈ H, t ∈ T xjiht = zht ,

(4) ∀h ∈ H, t ∈ T

(5)

i∈D j∈A

0 ≤ rjht ≤ wjht , ∀j ∈ A, h ∈ H, t ∈ T ej ≤ wjht ≤ lj , ∀j ∈ A, h ∈ H, t ∈ T xijht ≤ |S| − 1, S ⊆ A, 2 ≤ |S| ≤ A, ∀h ∈ H, t ∈ T

(6) (7) (8)

i∈S j∈S

tijh xijht ≤ tmax ,

∀h ∈ H, t ∈ T

(9)

i∈V j∈V|i=j

(xikht + xkjht ) ≤ 1 + yijt ,

∀i ∈ D, j ∈ A, h ∈ H, t ∈ T

(10)

k∈V

yijt ≤ 1,

∀j ∈ A, t ∈ T

(11)

i∈D

cij xijht ≤ C,

∀h ∈ H

(12)

i∈V j∈V|i=j t∈T

(wjht + sj + tijh − rjht )xijht = 0, ∀i ∈ V, j ∈ A, h ∈ H, t ∈ T mjt ≤ t(I0j + dj1 − mj1 ) + (t − 1)(dj2 − mj2 ) + · · ·

(13)

+ 2(djt−1 − mjt−1 ) + (djt − mjt ) tijh xijht + sj xijht Let T =

(14)

i∈V i∈V|i=j

i∈V j∈A|i=j

l0 − T ≤ uht , ∀h ∈ H, t ∈ T T ≤ l0 − e0 , ∀h ∈ H, t ∈ T

(15) (16)

xijht , yijt , zht ∈ {0, 1}, 0 ≤ djt ∀j ∈ A, t ∈ T

(17) (18)

∀i, j ∈ V, h ∈ H, t ∈ T

52

B. T. Thanh et al.

The objective function (1) minimizes transportation costs, while (2) minimizes ﬁnancial costs. Constraint (3) ensures that each ATM can be passed by only one vehicle per period. Constraint (4) ensures that the amount of money transferred to each ATM per period does not exceed the capacity of each vehicle. Constraint (5) ensures that a vehicle must depart from its corresponding depot when used. Constraint (6) ensures that the time a vehicle starts unloading money at ATM j in period t is no earlier than its arrival time. Constraint (7) ensures that the time a vehicle starts unloading money at ATM j in period t is within the working hours of ATM j. Constraint (8) eliminates sub-tours. Constraint (9) ensures that the travel time of each vehicle in each period is within the threshold. Constraint (10) allows the replenishment of an ATM by a depot only if there is a route from the depot to the ATM. Constraint (11) ensures that each ATM is replenished by at most one depot per period. Constraint (12) ensures that the distance traveled by each vehicle in each period does not exceed the threshold. Constraint (13) relates to the order of passing through vertices: if vehicle h travels directly from vertex i to j, the arrival time rjht at vertex j must equal (wiht + si + tijh ). Constraint (14) ensures that the amount of money withdrawn at ATM j in period t does not exceed the amount currently available at ATM j. Constraints (15) and (16) ensure that a vehicle’s operating time is within the designed working hours [e0 , l0 ]. Constraint (17) indicates that decision variables only take the values 0 or 1. Constraint (18) ensures that variables are non-negative. Pareto-based multiobjective learning algorithms can generate multiple Pareto-optimal solutions, allowing users to gain insights about the problem and make better decisions. To select the best Pareto-optimal solution from the Pareto front, sub-criteria might be employed and optimized accordingly. The mathematical model associated with this problem is the optimization problem on the eﬃcient set, which has been studied in prior works [8]. In future research, we plan to build on these previous works to further explore and enhance the eﬀectiveness of the optimization problem on the eﬃcient set.

4

Methodology

In our survey, VRP is a long-standing problem with many studies conducted. These solutions are usually divided into two categories: the group of exact solution methods, typically the branching method with the mixed integer programming problem model [1,3], or the group of metaheuristic methods such as evolutionary search locally algorithms, local search [4]. When considering practical applications, transportation problems will often be reduced to planning problems and use a speciﬁc algorithm software such as CPLEX, MATLAB or AMPL to solve. In this article, we will use opensource software called OR-Tools. The most signiﬁcant diﬀerence between ORTools and other software is that it has prepackaged components and constraints of some familiar optimization problems with popular functions that users can use directly. This makes the model simpler and easier to visualize when setting constraints, thereby saving product research and development time.

Multiobjective Optimization for Automated ATM Cash Replenishment

53

Some problems that ORTools supports are assignment, scheduling, vehicle routing, and network ﬂow problems. This study uses the tool of vehicle routing problems. Main steps in problem-solving by ORTools are described below: Initialize data for the problem: information about the distance matrix between the points, the operating time of the depots and ATMs, the amount of money to be transferred to the points, information about the used vehicles... In fact, we will balance the ﬁnancial costs (interest on deposits in the ATM) and the transportation costs (travel costs of the vehicle) when transporting money to the ATM. Therefore, we perform a simple test by dividing a large money amount into several small amounts and estimating the ﬁnancial cost proportional to the number of scheduled days. Combine ﬁnancial and transportation costs to obtain a cost matrix. The problem will be optimized based on this aggregate cost. Build models and load data using available functions in OR-Tools. Set the limit of the problem by 2 conditions: for vehicles and for ATMs. Set the algorithm, maximum program running time and run it. After doing those steps, the result obtained from OR-Tools is a transportation plan for a speciﬁed period, with details of which points each transportation will pass in a day, the order of its traveling, and how much will be transported.

5

Testing and Evaluation

To perform the test, we use the data from actual ATM locations of a bank in Hanoi (Vietnam). The distance matrix between ATMs is collected from Google Maps to get the closest distance to reality, which includes the possible route, prohibited roads, ... Operating hours, stopping time at a point, and specialized vehicles transporting money are estimated and taken parameters based on an actual business survey at a domestic bank. For cost data, we refer to staﬀ salaries, petrol, vehicle rental prices, and average deposit interest rates in the market. We calculate the travel cost proportional to the distance and the ﬁnancial cost depending on the amount with the estimated backlog days. The key issue is to balance the cost of money deposited at the ATM with the costs incurred from transporting money. To evaluate the eﬀectiveness, we consider 2 cases: (i) Case 1: Set a ﬁxed amount of money to be deposited for ATMs and only perform cash replenishment for ATMs when this amount is nearly depleted. (ii) Case 2: Divide ATM deposits into smaller amounts based on the forecasted changes in ATM withdrawals over a period of time. We experiment with 2 diﬀerent samples, including from 2 cash depots in diﬀerent areas and within a 7-day period. The key issue to be addressed is to balance ﬁnancial costs with transportation costs and ﬁxed vehicle costs. We use the method of splitting large amounts of money into smaller amounts, simply called splitting orders. The idea is to determine the lower and upper bounds for small orders, and based on that calculate the number of split orders that can be made for the initial large order, and let the algorithm choose one of

54

B. T. Thanh et al.

the calculated ways of splitting orders to optimize as much as possible. The experimental parameters and corresponding results are presented in Table 2. Based on the test results presented in Table 2 for both scenarios, it is evident that there should be a trade-oﬀ between transportation cost and ﬁnancial aspects. As the number of scheduled vehicle trips increases, there is a corresponding rise in the frequency of vehicles arriving at the ATM, leading to a proportional increase in the overall distance covered. Consequently, it is anticipated that the ﬁnancial cost will decrease due to a reduction in the monetary inventory held within the ATM. Table 2 shows that by splitting the number of replenishments according to the forecasting, the total cost, including transportation and ﬁnancial costs, is improved by 26–38% with tested data. Although splitting single transportation into multiple ones with smaller amounts of money will increase the vehicle working time and the total distance and transportation costs, the cost of deposits at ATMs will decrease and oﬀset the total costs. Accordingly, there is potential for using optimized models and tools to lower the bank’s operating costs. Table 2. Test Results

Test 1 Test 2 No single split Single Split No single split Single split Input Number of warehouses

2 warehouses

Number of vehicles

2 vehicles per warehouse

Number of days

7 days

Estimated total amount Ranges from 2.5 to 3.5 billion VND per ATM Rough input per period Ranging from 1.0 - 1.4 billion VND per ATM per provide Bank interest rate

5% per year

Number of ATMs

28

58

Total amount oﬀered

84.412.000.000 VND

176.999.000.000 VND

Results when running data with OR-Tools Total number of trips

5

8

7

12

Total distance (km)

326,8

621,9

520,4

976,9

Shipping cost (VND)

5.707.054

16.181.443

10.878.837

31.078.885

Financial cost (VND)

36.317.446

14.748.057

72.733.163

20.986.315

Total cost (VND)

42.024.500

30.929.500

83.612.000

52.065.200

Cost improvement (%)

26,4

6

37.7

Conclusion

This article provides a comprehensive examination of the cash replenishment logistics problem concerning bank ATMs. It introduces a model and tool aimed

Multiobjective Optimization for Automated ATM Cash Replenishment

55

at supporting cash replenishment planning. The outcomes reveal a noteworthy enhancement of total costs by 26–38%, demonstrating the potential for optimizing operating expenses through multiobjective optimization in the ATM cash replenishment process. In contrast to the conventional approach involving periodic review policies for cash inventory at the ATM, the replenishment method proposed in this study yields dual beneﬁts: it minimizes the cash stock held within the ATM and reduces the overall travel distance of cash transport vehicles from banks to ATMs. Given the current context, the optimized travel distance not only lowers transportation costs but also contributes to environmental sustainability by mitigating pollution. Consequently, this research achieves a simultaneous optimization of the bank’s cash circulation by reducing the cash inventory at ATMs and optimizing the cash transport routes, thereby oﬀering valuable insights for eﬃcient cash management in the banking sector. The article acknowledges certain limitations that should be considered. It does not account for certain actual operational costs, such as indirect personnel expenses. Additionally, special scenarios, such as planning for holidays with sudden increases in demand, or dealing with ATM breakdowns and other technical issues, have not been explicitly addressed. However, despite these limitations, the study anticipates that the proposed recommendations will yield substantial beneﬁts in terms of procedure automation, enhanced management practices, and reduced operating costs for banks in Vietnam. While further research could address the mentioned constraints and delve into these speciﬁc cases, the present ﬁndings oﬀer valuable insights that have the potential to positively impact the banking sector in the country.

References 1. Aggarwal, D., Kumar, V.: Mixed integer programming for vehicle routing problem with time windows. Int. J. Intell. Syst. Technol. Appl. 18(1–2), 4–19 (2019) 2. Ekinci, Y., Lu, J.C., Duman, E.: Optimization of ATM cash replenishment with group-demand forecasts. Expert Syst. Appl. 42(7), 3480–3490 (2015) 3. Golden, B., Wang, X., Wasil, E.: The evolution of the vehicle routing problem– a survey of VRP research and practice from 2005 to 2022. In: The Evolution of the Vehicle Routing Problem. Synthesis Lectures on Operations Research and Applications, pp. 1–64 Springer, Cham (2023). https://doi.org/10.1007/978-3-03118716-2 1 4. Golden, B.L., Raghavan, S., Wasil, E.A., et al.: The Vehicle Routing Problem: Latest Advances and New Challenges, vol. 43. Springer, New York (2008). https:// doi.org/10.1007/978-0-387-77778-8 5. Kiyaei, M., Kiaee, F.: Optimal ATM cash replenishment planning in a smart city using deep q-network. In: 2021 26th International Computer Conference, Computer Society of Iran (CSICC), pp. 1–5 (2021) 6. Peres, I.T., Repolho, H.M., Martinelli, R., Monteiro, N.J.: Optimization in inventory-routing problem with planned transshipment: a case study in the retail industry. Int. J. Prod. Econ. 193, 748–756 (2017) 7. Serengil, S.I., Ozpinar, A.: ATM cash ﬂow prediction and replenishment optimization with ANN. Int. J. Eng. Res. Dev. 11(1), 402–408 (2019)

56

B. T. Thanh et al.

8. Thang, T.N.: Outcome-based branch and bound algorithm for optimization over the eﬃcient set and its application. In: Dang, Q.A., Nguyen, X.H., Le, H.B., Nguyen, V.H., Bao, V.N.Q. (eds.) Some Current Advanced Researches on Information and Computer Science in Vietnam. AISC, vol. 341, pp. 31–47. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-14633-1 3 9. Thang, T.N., Luc, D.T., Kim, N.T.B.: Solving generalized convex multiobjective programming problems by a normal direction method. Optimization 65(12), 2269– 2292 (2016) 10. Thang, T.N., Solanki, V.K., Dao, T.A., Thi Ngoc Anh, N., Van Hai, P.: A monotonic optimization approach for solving strictly quasiconvex multiobjective programming problems. J. Intell. Fuzzy Syst. 38(5), 6053–6063 (2020) 11. World Bank: real interest rates in Vietnam (2023). https://data.worldbank.org/ indicator/FR.INR.RINR?end=2022&locations=VN&start=1993

Adaptive Conflict-Averse Multi-gradient Descent for Multi-objective Learning Dinh Van Tuan, Tran Anh Tuan, Nguyen Duc Anh, Bui Khuong Duy, and Tran Ngoc Thang(B) School of Applied Mathematics and Informatics, Hanoi University of Science and Technology, Hanoi, Vietnam {tuan.dv211093m,duy.bk195864}@sis.hust.edu.vn, [email protected]

Abstract. One of the most signiﬁcant challenges of multi-task learning is that diﬀerent tasks may have conﬂicting gradients during joint optimization. Previous approaches have attempted to address this problem by modifying the gradients based on speciﬁc criteria. However, most of these methods do not guarantee convergence or may converge to any Pareto-stationary point. The researchers developed a recent method called Conﬂict-Averse Gradient Descent (CAGrad) to handle this issue, but they face a challenge in accurately estimating the necessary Lipschitz constant. Additionally, choosing an excessively small learning rate to avoid this problem can result in slow convergence. In response to these challenges, we introduce Adaptive Conﬂict-Averse Multi-Gradient Descent (AdaCAGrad) algorithm. Our approach has a unique feature in that it can adjust the learning rate adaptively based on a speciﬁc condition, which sets it apart from other existing methods. Keywords: adaptive learning rate · multi-objective optimization · multi-objective learning · multi-task learning · multi-task gradient descent

1

Introduction

Multi-task learning (MTL) is an approach in machine learning where models are trained using data from multiple correlated tasks at the same time. The objective is to enhance the performance of each task by transferring shared information between them using an inductive mechanism. Deep neural network architectures based on MTL have become cutting-edge models in numerous ﬁelds, including computer vision [8], natural language processing [10], and multi-modal learning [18]. Multi-objective optimization (MOO) deals with optimizing a collection of potentially conﬂicting objectives. MOO approaches have been suggested to address multi-tasking problems, where the goal is to ﬁnd Pareto optimal solutions that represent diﬀerent trade-oﬀs between objectives [12]. According to this approach, MTL problem is called a multi-objective learning (MOL) problem, where each objective is one of loss functions of MTL. Some studies [1,14,15] c The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 N.-N. Dao et al. (Eds.): ICIT 2023, LNDECT 187, pp. 57–67, 2023. https://doi.org/10.1007/978-3-031-46573-4_6

58

D. V. Tuan et al.

have focused on approximating the entire Pareto front, which is the set of optimal solutions for MOO problems. MOL faces a signiﬁcant challenge known as negative transfer [11], where the performance of a particular task is negatively impacted due to the learning of other tasks. Negative transfer can lead to an overall decrease in performance compared to learning each task independently. This occurs because the individual tasks can compete, and unrelated information from each task may hinder the learning of shared structures. Conﬂicting gradients are one cause of negative transfer. Conﬂicting gradients occur when the gradients of two tasks point in opposite directions, causing the update of one task to hurt the other. Prior works have attempted to tackle conﬂicts between tasks and gradients by employing methods such as balancing the tasks through task reweighting or manipulating the gradients. Task reweighting methods adjust the loss functions by balancing the pace of task learning [2]. Gradient manipulation methods directly alter the gradients by reducing the impact of conﬂicting gradients through diﬀerent criteria [6,12,20]. However, these methods cannot explicitly control which speciﬁc point in the Pareto front they will converge to, except for the method proposed in [6]. While the approach in [6] guarantees convergence to a minimum point of average loss, it has a limitation in requiring an appropriate learning rate to meet a particular condition. Several adaptive learning rate methods have been proposed in recent years. The learning rate scheme known as AdaDelta, introduced in [21], utilizes Hessian computation of the loss function, which can demand substantial computational resources. WNGrad, presented in [16], employs weight normalization for adaptive learning rate adjustment. AdaBound and AMSBound, proposed in [9], estimate moments to determine adaptive learning rates. The method described in [7] requires knowledge of the optimal objective function value to compute the Polyak learning rate. The approach described in [19] computes the adaptive learning rate based on the Lipschitz constant of the loss function. The self-adaptive strategy introduced in [13] does not rely on Hessian computation, moments, or pre-deﬁned Lipschitz constants. Our proposed method, AdaCAGrad, is an approach that builds upon the CAGrad algorithm introduced in [6]. The main feature of AdaCAGrad is the adaptive variation of the learning rate based on an inequality check, distinguishing it from the original CAGrad algorithm. In our AdaCAGrad approach, we adopt the method in [13] to compute an adaptive learning rate.

2 2.1

Conflict-Averse Methods for MOL Multi-objective Learning Problems

The multi-objective learning (MOL) problem is deﬁned as follows:

minw (w) = min{(1 (w), . . . , p (w)) | w ∈ Rn },

(MOL)

where i (w) = Ex (i (w, x)) for i ∈ [p], and x is data points in the dataset. Here we denote [p] = {1, 2, . . . , p}. Our goal is to identify the points that cannot be

AdaCAGrad for Multi-objective Learning

59

enhanced simultaneously for all objective functions. This leads to the concept of Pareto optimality. Definition 1 (Pareto Dominance). Let (w) be a set of diﬀerentiable loss functions from Rn to R, deﬁned as (w) = {i (w) ∶ i ∈ [p]}. Consider two points w, w ∈ Rn . We say that w dominates w if i (w) ≤ i (w ) for all i ∈ [p] and (w) ≠ (w ). Definition 2 (Pareto Optimality). If a point w ∈ Rn is not dominated by any w ∈ Rn , we say w is Pareto optimal. The set of Pareto optimal points is referred to as the Pareto set. The collection of loss function values (w∗ ) for all Pareto optimal points w∗ is called the Pareto front. Definition 3 (Pareto Stationary). Denote the probability simplex Λ = {λ ∶ p λ = 1, λ i i i p≥ 0, i ∈ [p]}. A point w is called Pareto-stationary if there exists λ ∈ Λ such that i=1 λi ∇i (w) = 0. Like the single-objective case, Pareto stationarity is a necessary condition for Pareto optimality. 2.2

Conflicting Gradients

One can solve the MOL problem by minimizing a multi-task learning objective: w∗ = arg min w

p

λi i (w).

(1)

i

The weights λi are assigned either beforehand or dynamically computed based on the nature of diﬀerent tasks. The most commonly used method is to assign p equal weights to each task, resulting in the average loss deﬁned as 0 (w)= p1 i=1 i (w). Despite this, minimizing the multi-task objective is complex, and conﬂicting gradients are a well-known factor contributing to its diﬃculty. The notation gi = ∇i (w) denotes the gradient of the loss function i (w) with respect to a speciﬁc parameter w for task i. To make a small adjustment to w in the direction of negative gi , we use the formula w ←w −αgi , where α is a step size chosen to be suﬃciently small. The inﬂuence of this change on the performance of a diﬀerent task j is quantiﬁed by: Δj = j (w − αgi ) − j (w) = −αgi , gj + O(α).

(2)

The notation ·, · denotes the inner product on Rn . We obtain the second equation by using the ﬁrst-order Taylor approximation. Similarly, when the model parameters w are updated in the opposite direction of the gradient of task j (i.e., −gj ) by a suﬃciently small step size α, the impact on the performance of task i is measured by Δi = −αgi , gj + O(α). It is worth noting that if gi , gj < 0, then the model update for task i has a negative eﬀect on task j as it leads to an increase in the loss of task j, and vice versa. The deﬁnition for conﬂicting gradients we adopt in this study is presented in [20], which is given below.

60

D. V. Tuan et al.

Definition 4 (Conflicting Gradients). The occurrence of conﬂicting gradients between gi and gj (i ≠ j) is deﬁned by the condition cos φij < 0, where φij is the angle between the two gradients. As described in [20], gradient conﬂicts can create signiﬁcant diﬃculties in optimizing the multi-task objective deﬁned in Eq. (1). When performing p gradient descent using the average gradient, represented as g0 = ∇0 (w) = p1 i=1 gi , the performance of individual tasks may suﬀer, especially when there is a considerable diﬀerence in the magnitudes of the gradients. 2.3

Convergence and Learning Rate Issues

Several approaches have been developed to ﬁnd a better gradient direction. MGDA [12] minimizes the norm of the weighted sum of task gradients to ﬁnd the weights of the multi-task loss function, while GradNorm [2] dynamically tunes gradient magnitudes to learn task weights. PCGrad [20] reduces the impact of gradient conﬂicts by projecting each gradient onto the normal plane of another gradient and then computing the average of these projected gradients for updating the gradient. These methods have achieved some success, but they can only ensure that convergence occurs at a Pareto-stationary point without specifying the exact location of convergence (see Fig. 1). The introduction of CAGrad in [6] was motivated by the shortcomings of these existing methods. CAGrad guarantees convergence to a minimum of the average loss over tasks. However, to apply CAGrad, a suﬃciently small learning rate must be chosen to satisfy a condition dependent on the Lipschitz constant of the gradients of the loss functions. This requirement poses two issues. First, estimating the Lipschitz constant accurately and eﬃciently can be a challenging task [17]. Second, choosing an excessively low learning rate can result in slow convergence and even cause the iterates to become trapped at unfavorable local minima. The methods mentioned use either a ﬁxed learning rate or a learning rate decay strategy. 2.4

AdaCAGrad: Adaptive Conflict-Averse Multi-gradient Descent

An essential aspect of AdaCAGrad involves the reduction of the learning rate until a speciﬁc condition is met. While convergence requires the gradient of the objective function to be Lipschitz continuous, the method does not depend on a pre-deﬁned Lipschitz constant to determine the suitable learning rate value for the initial step. We incorporate the method CAGrad proposed in [6] into our algorithm for computing a gradient update vector. Based on the Eq. (2), when updating the parameter vector w in the direction opposite to g using w ← w − αg, we can approximate the change in the loss function for task i as Δi ≈ −αgi , g. We aim to ﬁnd an update vector g that minimizes both the average loss 0 (w) and each of the individual losses. To achieve this, we examine the minimum reduction rate across all losses given by 1 ζ(w, g) = max Δi ≈ −mingi , g. i∈[p] α i∈[p]

AdaCAGrad for Multi-objective Learning

61

Fig. 1. A toy example from [6] to illustrate the optimization diﬃculties encountered by gradient descent (GD), MGDA [4], PCGrad [20], and CAGrad [6]. These methods are employed along with the Adam optimizer. Three optimization runs are performed for each method, starting from distinct initial points (denoted by •). Each optimization trajectory is shown fading from red (start) to yellow (end). The pink bar represents the Pareto front, and the black ★ represents the point in the Pareto front with equal objective weights. Diﬀerent learning rates α are used in the experiments.

Fig. 2. AdaCAGrad with learning rate changing for each iteration on a toy optimization example [6] with diﬀerent initial parameters winit1 , winit2 , winit3 . (a) We set α0 = 0.001, κ = 0.99, and σ = 0.5. (b) We set α0 = 0.8, κ = 0.7, and σ = 0.5.

62

D. V. Tuan et al.

If ζ(w, g) < 0, it means that the update with the given α reduces all losses. Thus, ζ(w, g) can be considered as a measure of conﬂict among objectives. By using the mentioned measurement, we aim to minimize the conﬂict among objectives while simultaneously achieving convergence to an optimal solution for the primary objective 0 (w), which is done by solving the following optimization problem: max mingi , g n g∈R

i∈[p]

s.t. g − g0 ≤ c g0 .

(3)

We denote · as the Euclidean norm on Rn . Here, c ∈ [0, 1) is a hyper-parameter pre-deﬁned to regulate the convergence rate. However, directly solving the optimization problem (3) for g is impractical due to its high dimension that equals the number of parameters in w, which can be extremely large in deep neural networks. Therefore, the authors [6] derived an alternative equation that involves a much lower-dimensional variable of λ ∈ Rp , as given below: λ∗ = argmin gλ g0 + ψ gλ . λ∈Λ

The notation gλ =

p

i=1

2

λi gi and ψ = c2 g0 are used in the equation. Finally, 1/2

ψ gλ . the gradient update vector is computed by g = g0 + g λ After computing the gradient update vector, we employ an adaptive learning rate strategy from [13]. Let us consider that C is a nonempty, closed, and convex set in Rn , and f ∶ Rn → R is a diﬀerentiable function on an open set that contains C. We use the notation PC (w) to represent the projection of vector w onto set C, which can be described as follows:

PC (w) = argmin{u − w ∶ u ∈ C}. We present the GDA algorithm from [13], as shown in Algorithm 1. Algorithm 1. GDA 1: Input Initial parameter vector w0 , coeﬃcients κ, σ ∈ (0, 1), initial learning rate α0 ∈ R+ , and t = 0. 2: repeat 3: t=t+1 4: wt = PC (wt−1 − αt−1 ∇f (wt−1 )) 5: if f (wt ) ≤ f (wt−1 ) − σ ∇f (wt−1 ), wt−1 − wt then t t−1 6: α =α 7: else 8: αt = καt−1 9: end if 10: until wt = wt−1

We assume that each loss function i ∶Rn →R is diﬀerentiable and has a Lipschitz continuous gradient with constant H ∈ (0, ), i ∈ [p]. This means that for any w1 , w2 ∈ Rn , it holds that ∇i (w1 ) − ∇i (w2 ) ≤ H w1 − w2 . We have:

AdaCAGrad for Multi-objective Learning p

λti i (wt )≤

i=1

p i=1

63

1−σ H wt − wt−1 2 , (4) λti i (wt−1 )−σ g, wt−1 − wt − − t−1 α 2

where σ ∈ (0, 1). We assert that the sequence αt−1 remains bounded away from zero, indicating that the learning rate undergoes a ﬁnite number of changes. To prove this, we assume the contrary, that αt−1 → 0. Equation (4) implies the existence of a speciﬁc t0 ∈ N such that p

λti i (wt ) ≤

i=1

p

λti i (wt−1 ) − σ g, wt−1 − wt , ∀t ≥ t0 .

i=1

Based on the construction of αt−1 , the inequality implies that αt−1 = αt0 −1 holds for all t ≥ t0 . However, this contradicts our assumption. Therefore, there exists t1 ∈ N such that for all t ≥ t1 , we have αt−1 = αt1 −1 and the inequality becomes: p

λti i (wt )

i=1

≤

p

λti i (wt−1 ) − σ g, wt−1 − wt .

i=1

t

Since g, wt−1 − w ≥ 0, we can conclude that We describe our AdaCAGrad in Algorithm 2.

p

i=1

λti i (wt ) ≤

p

i=1

λti i (wt−1 ).

Algorithm 2. AdaCAGrad for MOL 1: Input Initial parameter vector w0 , diﬀerentiable loss functions {i }pi=1 , coeﬃcients κ, σ ∈ (0, 1), a hyper-parameter c ∈ [0, 1), and initial learning rate α0 ∈ R+ . 2: repeat 3: At step t, deﬁne g0 = p1 pi=1 ∇i (wt−1 ) and ψ = c2 g0 2 . 4: Solve min F (λt ) = gλt g0 +

λt ∈Λ

5: 6: 7:

p 1 t ψ gλt , where gλt = λi ∇i (wt−1 ). p i=1

Update wt = wt−1 − αt−1 g0 +

ψ 1/2 g t gλt λ

.

Compute learning rate: p p t t t t−1 t−1 λ (w ) ≤ λ (w ) − σα if i=1 i i i=1 i i g0 +

8: αt = αt−1 9: else 10: αt = καt−1 11: end if 12: until convergence

2 ψ 1/2 t g λ g λt

then

For α, we choose a value from the interval (0, 2/H) to guarantee convergence. Alternatively, if a ﬁxed α is preferred, we set it to α = 1/H. As for κ, its purpose is to decrease the value of α. When α ≥ 2/H, we utilize κ to gradually reduce its

64

D. V. Tuan et al.

value. Typically, we select α from the set {5/H, 10/H, . . . } and then pair it with a speciﬁc value of κ, such as κ = 0.8, 0.9, or 0.95. When setting σ, the common practice is to use σ = 1/2. For improved optimization, consider choosing a value of σ greater than 1/2.

3

Experiments

We validate our approach through experiments on a toy optimization example and compare it empirically with state-of-the-art MTL algorithms using a widelyused real-world MTL benchmark for image classiﬁcation. The source code is available at: https://github.com/TuanVDinh/AdaCAGrad. 3.1

Toy Optimization Example

We use a toy optimization example from [6]. Let w = (w1 , w2 ) ∈ R2 , and consider the following loss functions: 1 (w) = φ1 (w)μ1 (w) + φ2 (w)η1 (w) 2 (w) = φ1 (w)μ2 (w) + φ2 (w)η2 (w) where we deﬁne

μ1 (w) = log max(|0.5 ∗ (−w1 − 7) − tanh(−w2 )|, 0.000005) + 6,

μ2 (w) = log max(|0.5 ∗ (−w1 + 3) − tanh(−w2 ) + 2|, 0.000005) + 6,

η1 (w) = (−w1 + 7)2 + 0.1 ∗ (−w2 − 8)2 /10 − 20,

η2 (w) = (−w1 − 7)2 + 0.1 ∗ (−w2 − 8)2 /10 − 20, φ1 (w) = max(tanh(0.5 ∗ w2 ), 0) and φ2 (w) = max(tanh(−0.5 ∗ w2 ), 0). We initialize the algorithms with three points: winit1 = (−8.5, 7.5), winit2 = (−8.5, −5.0), and winit3 = (9.0, 9.0). In the second row of Fig. 1, all plots are generated using a small learning rate of α = 0.001. These plots show that when using Adam, GD gets stuck on two of the initial points (see Fig. 1(d)), while MGDA and PCGrad stop optimizing as soon as they reach the Pareto set (see Fig. 1(e) and Fig. 1(f)). CAGrad and AdaCAGrad still converge to a minimum of the average loss (see Fig. 1(g) and Fig. 1(h)). However, it is important to note that this small learning rate can result in slow convergence. We then try a more signiﬁcant learning rate of α = 0.8 and plot the optimization trajectory in the third row. It can be seen that only GD, CAGrad, and AdaCAGrad converge to an optimal point of the average loss (see Fig. 1(i), Fig. 1(l), and Fig. 1(m)). However, as they approach closer to the Pareto front, GD and CAGrad exhibit an oscillation phenomenon that can slow down their convergence. In contrast, AdaCAGrad dampens these oscillations and results in a smoother optimization trajectory. For AdaCAGrad, we plot the learning rate changing for each iteration as shown in Fig. 2.

AdaCAGrad for Multi-objective Learning

3.2

65

Image Classification

We conduct experiments on the Multi-Fashion+MNIST multi-task classiﬁcation benchmark [5]. The benchmark includes images created by overlaying an image from the FashionMNIST dataset on top of another image from the MNIST dataset. We aim to solve a two-objective MTL problem to classify the top-left image (task 1) and the bottom-right image (task 2). To achieve this, we use Multi-LeNet architecture [12] to train the model using the Adam optimizer with a learning rate of 0.05 for 50 epochs, with a batch size of 256. For both CAGrad and AdaCAGrad, we set the hyper-parameter c = 0.5. We compare our approach with several baselines, including the independent method, which trains individual tasks separately, as well as MGDA [12], PCGrad [20], GradDrop [3], and CAGrad [6]. The results are presented in Table 1 and Fig. 3. Our experiments show that our AdaCAGrad outperforms all benchmarks in terms of both testing accuracy and average training loss. Table 1. Results on Multi Fashion+MNIST test dataset.

Method

Task 1: Top-Left Task 2: Bottom-Right Acc. ⇑ Loss.⇓ Acc.⇑ Loss.⇓

Independent MGDA [12] PCGrad [20] GradDrop [3] CAGrad [6]

33.98 32.26 32.88 34.39 32.62

1.85 1.88 1.99 1.84 1.88

AdaCAGrad (ours) 62.75 1.01

55.58 49.56 53.09 50.54 48.52

1.19 1.32 1.25 1.27 1.33

66.52 0.91

Fig. 3. Results for training Multi-LeNet on Multi Fashion+MNIST dataset.

4

Conclusion

Previous gradient modiﬁcation approaches lack convergence guarantees or may converge to any Pareto-stationary point. CAGrad tackles this issue, but accurately estimating the necessary Lipschitz constant is challenging. Using an excessively small learning rate to mitigate this problem can lead to slow convergence.

66

D. V. Tuan et al.

AdaCAGrad is introduced as a solution to these challenges, with the unique feature of adaptively adjusting the learning rate based on a speciﬁc condition. This sets AdaCAGrad apart from other existing methods and could improve the eﬃciency of MOL. Acknowledgments. This research is funded by Hanoi University of Science and Technology (HUST) under project number T2022-PC-061.

References 1. Anh, T.T., Long, H.P., Dung, L.D., Thang, T.N.: A framework for controllable pareto front learning with completed scalarization functions and its applications. arXiv preprint: arXiv:2302.12487 (2023) 2. Chen, Z., Badrinarayanan, V., Lee, C.Y., Rabinovich, A.: GradNorm: gradient normalization for adaptive loss balancing in deep multitask networks. In: International Conference on Machine Learning, pp. 794–803. PMLR (2018) 3. Chen, Z., et al.: Just pick a sign: optimizing deep multitask models with gradient sign dropout. In: Advances in Neural Information Processing Systems, vol. 33, pp. 2039–2050 (2020) 4. D´esid´eri, J.A.: Multiple-gradient descent algorithm (MGDA) for multiobjective optimization. C.R. Math. 350(5–6), 313–318 (2012) 5. Lin, X., Zhen, H.L., Li, Z., Zhang, Q.F., Kwong, S.: Pareto multi-task learning. In: Advances in Neural Information Processing Systems, vol. 32 (2019) 6. Liu, B., Liu, X., Jin, X., Stone, P., Liu, Q.: Conﬂict-averse gradient descent for multi-task learning. In: Advances in Neural Information Processing Systems, vol. 34, pp. 18878–18890 (2021) 7. Loizou, N., Vaswani, S., Laradji, I.H., Lacoste-Julien, S.: Stochastic Polyak stepsize for SGD: an adaptive learning rate for fast convergence. In: International Conference on Artiﬁcial Intelligence and Statistics, pp. 1306–1314. PMLR (2021) 8. Long, H.P., Dung, L.D., Anh, T.T., Thang, T.N.: Improving pareto front learning via multi-sample hypernetworks. In: The Thirty-Seventh AAAI Conference on Artiﬁcial Intelligence (AAAI-23) (2023) 9. Luo, L., Xiong, Y., Liu, Y., Sun, X.: Adaptive gradient methods with dynamic bound of learning rate. arXiv preprint: arXiv:1902.09843 (2019) 10. Majumder, N., Poria, S., Peng, H., Chhaya, N., Cambria, E., Gelbukh, A.: Sentiment and sarcasm classiﬁcation with multitask learning. IEEE Intell. Syst. 34(3), 38–43 (2019) 11. Ruder, S.: An overview of multi-task learning in deep neural networks. arXiv preprint: arXiv:1706.05098 (2017) 12. Sener, O., Koltun, V.: Multi-task learning as multi-objective optimization. In: Advances in Neural Information Processing Systems, vol. 31 (2018) 13. Thang, T.N., Hai, T.N.: Self-adaptive algorithms for quasiconvex programming and applications to machine learning. arXiv preprint: arXiv:2212.06379 (2022) 14. Thang, T.N., Luc, D.T., Kim, N.T.B.: Solving generalized convex multiobjective programming problems by a normal direction method. Optimization 65(12), 2269– 2292 (2016) 15. Vuong, N.D., Thang, T.N.: Optimizing over Pareto set of semistrictly quasiconcave vector maximization and application to stochastic portfolio selection. J. Industr. Manag. Optim. 19(3), 1999–2019 (2023)

AdaCAGrad for Multi-objective Learning

67

16. Wu, X., Ward, R., Bottou, L.: WNGrad: learn the learning rate in gradient descent. arXiv preprint: arXiv:1803.02865 (2018) 17. Xue, A., Lindemann, L., Robey, A., Hassani, H., Pappas, G.J., Alur, R.: Chordal sparsity for Lipschitz constant estimation of deep neural networks. In: 2022 IEEE 61st Conference on Decision and Control (CDC), pp. 3389–3396. IEEE (2022) 18. Yang, L., Ng, T.L.J., Smyth, B., Dong, R.: HTML: hierarchical transformer-based multi-task learning for volatility prediction. In: Proceedings of the Web Conference 2020, pp. 441–451 (2020) 19. Yedida, R., Saha, S., Prashanth, T.: LipschitzLR: using theoretically computed adaptive learning rates for fast convergence. Appl. Intell. 51, 1460–1478 (2021) 20. Yu, T., Kumar, S., Gupta, A., Levine, S., Hausman, K., Finn, C.: Gradient surgery for multi-task learning. In: Advances in Neural Information Processing Systems, vol. 33, pp. 5824–5836 (2020) 21. Zeiler, M.D.: ADADELTA: an adaptive learning rate method. arXiv preprint: arXiv:1212.5701 (2012)

Multicriteria Portfolio Selection with Intuitionistic Fuzzy Goals as a Pseudoconvex Vector Optimization Vuong D. Nguyen1 , Nguyen Kim Duyen2 , Nguyen Minh Hai3 , and Bui Khuong Duy4(B) 1

Department of Computer Science, University of Houston, Houston, TX, USA [email protected] 2 Faculty of Banking and Finance, Foreign Trade University, Hanoi, Vietnam [email protected] 3 Hanoi University of Science, Vietnam National University, Hanoi, Vietnam nguyenminhhai [email protected] 4 School of Applied Mathematics and Informatics, Hanoi University of Science and Technology, Hanoi, Vietnam [email protected]

Abstract. Portfolio selection involves optimizing simultaneously ﬁnancial goals such as risk, return and Sharpe ratio. This problem holds considerable importance in economics. However, little has been studied related to the nonconvexity of the objectives. This paper proposes a novel generalized approach to solve the challenging Portfolio Selection problem in an intuitionistic fuzzy environment where the objectives are soft pseudoconvex functions, and the constraint set is convex. Speciﬁcally, we utilize intuitionistic fuzzy theory and ﬂexible optimization to transform the fuzzy pseudoconvex multicriteria vector into a pseudoconvex programming problem that can be solved by recent gradient descent methods. We demonstrate that our method can be applied broadly without special forms on membership and nonmembership functions as in previous works. Computational experiments on real-world scenarios are reported to show the eﬀectiveness of our method. Keywords: Fuzzy Portfolio Selection · Multicriteria Pseudoconvex Programming · Flexible Optimization · Sharpe Ratio

1

Introduction

Portfolio selection is crucial for managing investment risks and optimizing returns. It involves allocating investment assets to achieve speciﬁc investment goals, such as maximizing returns or minimizing risks. This requires decisions on the allocation of the weight of assets for diﬀerent investments, including stocks, bonds, cash, and other assets. Portfolio optimization tools have received attention and development to automate the asset allocation process. c The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 N.-N. Dao et al. (Eds.): ICIT 2023, LNDECT 187, pp. 68–79, 2023. https://doi.org/10.1007/978-3-031-46573-4_7

Multicriteria Portfolio Selection with Intuitionistic Fuzzy Goals

69

Markowitz’s Portfolio Optimization model, alternatively named the MeanVariance model, is a highly prevalent portfolio optimization model. Investors want to maximize the expected total return while keeping the portfolio’s volatility to a minimum or at a certain threshold, leading to a bi-criteria convex optimization problem. Later researchers widely extended Markowitz’s model by taking into account risk aversion index [17], value-at-risk [8] or Skewness [13]. In this paper, we further consider the Sharpe ratio because it has the function with necessary properties that suits the main problem proposed as shown in [21]. Sharpe ratio is an important factor that measures the performance of the portfolio via the ratio of expected return to standard deviation. Traditional portfolio selection models based on crisp (i.e., deterministic) optimization techniques have several limitations, such as the inability to capture uncertainties in the decision-making process of investors. Later works approached Problem (IFMOP) by interactive programming that requires frequent interaction with the decision makers and hard constraints on the activation of the fuzzy objective. Furthermore, little work has been proposed to solve Problem (IFMOP). Thus, in this study, we propose to solve a more practical portfolio selection problem in an intuitionistic fuzzy environment with soft goals (IFPS). In previous works, researchers addressed the general intuitionistic fuzzy portfolio selection problem using two main approaches. The ﬁrst approach involved formulating the problem using well-known structures such as quadratic programming [5,24] and a multi-objective nonlinear model for fuzzy portfolio selection [23]. The second approach utilized heuristic methods, including a hybrid intelligent algorithm [6,7], a compromise-based GA [9], among others. However, in this research, we extend Markowitz’s model by considering Sharpe Ratio. Speciﬁcally, we ﬁrst propose the intuitionistic fuzzy multicriteria portfolio selection problem. We then present the construction of our method to solve the equivalent (IFMOP) in a generalized framework based on the nice property of the Sharpe ratio function. We utilize ﬂexible optimization to transform the original fuzzy multicriteria optimization problem to a pseudoconvex programming problem, then we exploit the algorithm in [15] to solve the equivalent problem. Unlike previous works, our method not only requires the least interaction with the decision makers but also works eﬀectively without any hard assumption on activating the soft goals. This makes our method more ﬂexible and robust, enabling to handle imprecise and uncertain data, multiple conﬂicting objectives, and various constraints. We organize the remaining of the article in 4 sections as follows. In Sect. 2, we introduce the main portfolio selection problem. Section 3 presents the preliminaries and our methodology to solve the equivalent problem (IFMOP). Section 4 demonstrates the eﬀectiveness of our method via experiments on real-world portfolio selection problems. Section 5 presents our conclusion.

2

Multicriteria Portfolio Selection Problem

Consider a portfolio vector x = (x1 , . . . , xn ) where xk , k = 1, ..., n is the proportion invested in k th asset. In practical investment scenarios, there is a constraint set on x, and we denote it as X = {x ∈ Rn+ | x1 + · · · + xn = 1}.

70

V. D. Nguyen et al.

Let there be n assets with random returns represented by a random vector R = (R1 , R2 , . . . , Rn )T , and the expected returns of those n assets are denoted . , Ln )T . Thus, the total random return of n assets by vector L = (L1 , L2 , . . n T is represented by R x = k=1 Rk xk , which is a linear stochastic function. In reality, investors usually consider expected returns of n asset classes as follows E(x) = E(RT x) =

n

Lk xk .

(1)

k=1

Let Q = (σij )n×n be the covariance matrix of random vector R. Then, the variance of returns, i.e. risk of the portfolio, can be denoted as V(x) = V ar(RT x) =

n n

σik xi xk

(2)

i=1 k=1 2 where, σii represents the variance of Ri , and σik denotes the correlation coeﬃcient between Rk and Ri , i, k = 1, 2, ..., n. Beyond return and risk, investors demand to understand the return of an investment compared to its risk, which is represented as the Sharpe ratio

Sr(x) =

E(x) − prf V(x)

(3)

where Sr(x) denotes the Sharpe ratio, prf denotes the rate ofa zero-riskportfolio’s return. The benchmark in return is then divided by V(x) which measures how much the portfolio excesses standard deviation of return. According to Markowitz’s model, the investor wants to optimize two goals Max E(x) Min V(x) s.t. x ∈ X .

(MV)

where objective E(x) is a linear function and objective V(x) is a convex function. In this research, by rewriting E ∗ (x) = −E(x) and Sr∗ (x) = −Sr(x), we propose the tri-criteria vector minimization problem as follows Min (E ∗ (x), V(x), Sr∗ (x))T s.t. x ∈ X .

(MVS)

where X is a non-empty convex set. The property of Sr∗ (x) which plays a key role in the construction of our method will be presented in Sect. 3.

3 3.1

Multicriteria Portfolio Selection with Intuitionistic Fuzzy Goals Intuitionistic Fuzzy Goals

To amplify the uncertainty in Fuzzy portfolio selection problem, intuitionistic fuzzy goals allow decision makers to express ambiguity in their goals. Given a

Multicriteria Portfolio Selection with Intuitionistic Fuzzy Goals

71

universal set X, the generalization of fuzzy sets that allow for more nuanced and ﬂexible representation of uncertainty is called an intuitionistic fuzzy set of X, ˜ Consider the following two mappings: the membership mapping denoted by A. μA˜ : X → [0, 1] and the non-membership mapping νA˜ : X → [0, 1], we present the deﬁnition of the intuitionistic fuzzy set A˜ as follows Definition 1. [1] Conditioning 0 ≤ μA˜(x) + νA˜(x) ≤ 1 for all x ∈ X, then A˜ = {x, μA˜(x), νA˜(x) | x ∈ X},

(4)

The numbers μA˜(x) and νA˜(x) represent the membership and non˜ respectively. Now, the general multi-objective vector membership of x in A, optimization problem with the following formulation is considered Min F(x) = (F1 (x), ..., Fk (x))T s.t. x ∈ X

(MOP)

where X is a nonempty compact convex set. General fuzzy optimization refers to the formulation of optimization problems using fuzzy sets, where the constraints and objectives are ﬂexible, approximate, or uncertain. We consider the main problem with fuzzy form as Min F(x) = (F1 (x), ..., Fk (x))T s.t. x ∈ X

(IFMOP)

represents “to minimize as well as possible based on the demand of where Min the decision makers”. The approach to Problem (MOP) with fuzzy objectives is widely applied by decision-makers in many real-world problems. In this study, we propose models based on an intuitionistic fuzzy set. Therefore, we use the membership and non-membership functions to associate the input data. These functions are vital in intuitionistic fuzzy optimization. Consider a monotonic decreasing function mi (·), the membership function μi with respect to Fi has the form as following ⎧ ⎪ if Fi (x) ≥ yi0 , ⎨0 μi (Fi (x)) = mi (Fi (x)) if yi0 ≥ Fi (x) ≥ yi1 , (5) ⎪ ⎩ 1 if Fi (x) ≤ yi1 , where yi0 is the maximum value of Fi if μi (Fi (x)) = 0 and yi1 is the minimum value of fi if μi (Fi (x)) = 1. On the other hand, consider a monotonic increasing function ni (·), the non-membership function νi with respect to Fi has the form ⎧ ⎪ if Fi (x) ≥ yi0 , ⎨1 νi (Fi (x)) = ni (Fi (x)) if yi0 ≥ Fi (x) ≥ yi1 , (6) ⎪ ⎩ 1 0 if Fi (x) ≤ yi ,

72

V. D. Nguyen et al.

where yi0 is the minimum value of Fi if νi (Fi (x)) = 1 and yi1 is the maximum value of Fi if νi (Fi (x)) = 0. The mappings μi (Fi (x)) and νi (Fi (x)) are also called intuitionistic fuzzy mappings which map Fi (x) to an intuitionistic fuzzy number belonging to interval [0, 1] and satisfying 0 ≤ μi (Fi (x)) + νi (Fi (x)) ≤ 1 according to Deﬁnition 1. Several conditions on characteristics of a general fuzzy mapping have been proposed in [12]. However, we propose the following proposition about the important properties of intuitionistic fuzzy mappings that will be utilized in transforming the problem model in the following section. Proposition 1. μi , i = 1, . . . , k is monotonic nonincreasing function, and νi , i = 1, . . . , k is monotonic nondecreasing function. Proof of Proposition 1 can be easily infered from the deﬁnition of the fuzzy mappings. By using Proposition 1, we will present our novel method to transform the intuitionistic fuzzy multicriteria decision problem into a deterministic problem that can be solved eﬀectively using convex programming algorithms. The same scheme can be applied to the picture fuzzy set (see [19,20]) to build the multicriteria portfolio selection with picture fuzzy goals. This expansion will be developed by us in the future. 3.2

Transformation to Deterministic Model

Recall Problem (MVS), as mentioned above, this problem is not a multiobjective convex programming problem. This is a challenging problem and little attention has been paid to solve it. In this paper, we propose to solve Problem (MVS) by demonstrating its nice property based on the property of considered objective functions. Definition 2. (Pseudoconvex function (see [10])). Given a non-empty convex set X and a diﬀerentiable function f : Rn → R on X. We say that f is a pseudoconvex function on X if for all x1 , x2 in X, it holds that: f (x2 ) < f (x1 ) ⇒ ∇f (x1 ), x2 − x1 < 0.

(7)

If f is a pseudoconvex function, then −f is called a pseudoconcave function. The following proposition will be veriﬁed by us about the pseudoconcavity of the function Sharpe ratio Proposition 2. Sr(x) is a pseudoconcave function, and Sr∗ (x) is a pseudoconvex function. Proof. Note that E(x) − prf is positive linear as prf is a constant, while V(x) is convex. On the other hand, given two functions ϕ1 and ϕ2 deﬁned on a set X, if ϕ1 is a positive and concave functions, ϕ2 is a positive convex function on 1 X satisﬁed ϕ1 , ϕ2 are diﬀerentiable functions on X, then fractional function ϕ ϕ2 is pseudoconcave function on X (see in [2]). Therefore, the function Sr(x) is a pseudoconcave function, and Sr∗ (x) = −Sr(x) is a pseudoconvex function.

Multicriteria Portfolio Selection with Intuitionistic Fuzzy Goals

73

From Proposition 2, we have shown that the Problem (MVS) is a multiobjective pseudoconvex programming problem. We consider this problem in an intuitionistic fuzzy environment. Then the problem has the formulation as (E ∗ (x), V(x), Sr∗ (x))T Min s.t. x ∈ X .

(IFMVS)

Associating the input data requires using membership and non-membership functions. Therefore, it is needed to develop a way to deﬁne the functions μi and νi according to Deﬁnition 1 and Proposition 1. Prior studies have assumed that decision-makers engage in frequent interaction, i.e. the decision-makers can specify yi1 and yi0 within yimin and yimax [14]. Furthermore, previously proposed methods are restricted to apply only certain types of fuzzy mappings, such as the popular linear monotonic decreasing and increasing mappings given as μL i (Fi (x)) =

yi0 − Fi (x) yi0 − yi1

and

νiL (Fi (x)) =

Fi (x) − yi1 yi0 − yi1

(8)

We instead propose a method that can be eﬀectively applied with any arbitrary membership function. More importantly, our framework adapts to a more realistic environment where frequent interaction with the decision-makers is not feasible. Speciﬁcally, we propose to ﬁnd the appropriate values for yi0 and yi1 by considering the following problems minx∈X Fi (x), i = 1, . . . , k,

(Pim )

maxx∈X Fi (x), i = 1, . . . , k.

(PiM )

Denote yimin as the optimal solution to Problem (Pim ). Consider the following remark Remark 1. If x ˆ is a local optimal solution to a pseudoconvex programming problem, then x ˆ is also a global optimal solution (see [11]). Recall that Fi (x), i = 1, . . . , k is pseudoconvex function, therefore, by Remark (1), yimin can be easily found by using convex programming tools. To deal with Problem (PiM ) which is a non-convex problem, ﬁnding the optimal solution is challenging. We instead propose to ﬁnd an upper bound yimax of this problem (see [3] for details and illustration). Then, for i = 1, . . . , k, yi0 and yi1 are calculated by yi1 = yimin , yi0 = yimax

(9)

Now, we can rewrite the deterministic form of the Problem (IFMVS) as Max (μ1 (E ∗ (x)), μ2 (V(x)), μ3 (Sr∗ (x)))T Min (ν1 (E ∗ (x)), ν2 (V(x)), ν3 (Sr∗ (x)))T s.t. x ∈ X .

(BFMVS)

74

V. D. Nguyen et al.

where μi , νi are the mappings of the intuitionistic fuzzy set corresponding to E ∗ (x), V(x), Sr∗ (x). Set μi (·) = 1 − μi (·), Problem (BFMVS) be rewritten as Min (μ1 (E ∗ (x)), μ2 (V(x)), μ3 (Sr∗ (x)), ν1 (E ∗ (x)), ν2 (V(x)), ν3 (Sr∗ (x)))T s.t. x ∈ X . (MFMVS) This problem is multiobjective pseudoconvex programming, which has been studied in several works (see [16,21]). However, instead of directly solving this problem which requires signiﬁcant computational cost, inspired from [4], we propose to convert Problem (MFMVS) into the following problem min max{μ1 (E ∗ (x)), μ2 (V(x)), μ3 (Sr∗ (x)), ν1 (E ∗ (x)), ν2 (V(x)), ν3 (Sr∗ (x))} s.t. x ∈ X . (TFMVS) This single objective problem has received great attention from many authors (see in [18]). However, pseudoconvexity of objectives makes the problem challenging and not fully resolved in previous works. In this paper, we present the properties of this problem as follows Proposition 3. (TFMVS) is a pseudoconvex programming problem. Proof. Recall that E, V are linear and convex, and Sr∗ is pseudoconvex by Proposition 2. Combined with by Proposition 1, the objective functions of the Problem (MFMVS) are pseudoconvex (see [2]). Taking the maximum of pseudoconvex functions results in a pseudoconvex function.

For the pseudoconvex programming problem, Thang et al. [15] have demonstrated the convergence of the global solution and proposed an algorithm based on the gradient direction method. The eﬃciency of this algorithm has been shown by computational results. In this paper, following [15] and Proposition 3, we solve the Problem (TFMVS) by the gradient direction method.

4

Computational Experiment

In this section, to show the eﬀectiveness of our proposed method in real-world scenarios, consider the following portfolio selection problem. Example 1. (see [22]) Consider 7 diﬀerent stocks as a portfolio with expected return given in Table 1 and covariance matrix given in Table 2. We solved two Problems (MV) and (MVS) and their intuitionistic fuzzy versions. By denoting x as the solution of the original unfuzzy problem and xf as the solution of the intuitionistic fuzzy version, the computational results are presented as follows. Table 3 shows the values of optimal solutions x and xf to both Problems (MV) and (MVS), while Table 4 provides an important insight into the values of E(x), V(x) and Sr(x) yielded by unfuzzy and fuzzy solutions.

Multicriteria Portfolio Selection with Intuitionistic Fuzzy Goals

75

Table 1. Expected return of each stock S t C1

stock

S t C2

S t C3

S t C4

S t C5

S t C6

S t C7

Expected return 0.0282 0.0462 0.0188 0.0317 0.01536 0.0097 0.01919 Table 2. Covariance matrix of chosen stocks stock St C1

S t C2

S t C3

S t C4

S t C5

S t C6

S t C1

0.0119

0.0079

0.0017

0.0019

0.0022

−0.0008 0.0032

S t C2

0.0079

0.0157

0.0016

0.0013

0.0005

−0.0026 0.0035

S t C3

0.0017

0.0016

0.0056

−0.0002 0.0030

S t C4

0.0019

0.0013

−0.0002 0.0093

S t C5

0.0022

0.0005

0.0030

S t C6

−0.0008 −0.0026 0.0017

S t C7

0.0032

0.0035

0.0017

−0.0007 0.0010

S t C7

−0.0003 0.0024

−0.0007 0.0110

0.0010

0.0011

0.0010

0.0010

0.0067

0.0014

−0.0003 0.0024

0.0011

0.0014

0.0130

Table 3. Fuzzy optimal solutions to (MV) and (MVS) Problem Sol Value MV

x xf

(0.0287, 0.1150, 0.2274, 0.1857, 0.1111, 0.2653, 0.0668) (0.1078, 0.1268, 0.1740, 0.1526, 0.1257, 0.1981, 0.1150)

MVS

x xf

(0.0289, 0.1147, 0.2274, 0.1857, 0.1111, 0.2654, 0.0668) (0.1026, 0.3680, 0.1016, 0.1265, 0.1006, 0.1000, 0.1007)

Table 4. E(x), V(x) and Sr(x) values of the solution Problem Solution E(x)

V(x)

Sr(x)

MV

x xf

0.02184 0.0022 – 0.0230 0.0024 -

MVS

x xf

0.02183 0.0022 0.3562 0.0302 0.0041 0.3938

From Tables 3, it can be seen that the solutions to unfuzzy Problem (MV) and Problem (MVS) are stable, leading to a very slight diﬀerence in the optimal values of E(x) and V(x) even when Sr(x) is taken into consideration as shown in Table 4. This shows that restricting to crisp objectives severely undermines the impact of Sharpe ratio on expected return and risk. More importantly, from Table 4, when enabling soft goals, we can observe a remarkable increase in the optimal values of expected return E(x) and risk V(x) in the intuitionistic fuzzy version. Speciﬁcally, the expected return is signiﬁcantly larger, i.e. 0.0302, compared to 0.02184, which shows that our proposed model actually softens and eﬀectively exploit the widened goals to achieve a better outcome than the

76

V. D. Nguyen et al.

original rigid versions. The optimal value of Sharpe ratio from our proposed Problem (MVS) is also reported, which helps investors make better investment decisions in terms of expected return per unit of risk. Moreover, it is also shown in Table 4 that investors may ﬂexibly accept higher risk. Example 2. (see [18]) In this example, we analyze a dataset consisting of ﬁve securities codes represented by symbols: ULTA, MLM, NFLX, AMZN, and NVDA, covering the period from 1/23/2015 to 6/12/2017. Figure 1 data visualization, while Table 5 presents the corresponding expected returns and covariance.

Fig. 1. Five securities code information Table 5. Expected return and covariance matrix of chosen stocks stock

Expected return Covariance matrix

ULTA

0.15672

MLM

0.15874

1.12491 4.07482 1.96306 1.28708 1.53560

NFLX 0.20462

2.31042 1.96306 9.13912 2.33831 1.98378

AMZN 0.21693

1.44398 1.28708 2.33831 4.43169 1.67068

NVDA 0.34876

1.39347 1.53560 1.98378 1.67068 5.31435

4.41513 1.12491 2.31042 1.44398 1.39347

In this instance, we conduct experimental solutions for both the (MVS) Problem and its intuitionistic fuzzy version. The notation for the solution is the same as in Example 1, and the outcomes are outlined in Table 6.

Multicriteria Portfolio Selection with Intuitionistic Fuzzy Goals

77

Table 6. E(x), V(x) and Sr(x) values of the solution Problem Solution E(x) MVS

V(x)

Sr(x)

Selection

x

0.2039 2.1581 0.1238 (0.2766, 0.3135, 0.0006, 0.2426, 0.1664)

xf

0.2096 2.2694 0.1245 (0.2332, 0.2562, 0.1250, 0.2128, 0.1726)

Based on the ﬁndings presented in Table 6, the x solution has signiﬁcantly underestimated the weight allocated to the stock code NFLX, which may not be ideal from an investor’s perspective. Investors often consider the possibility of allocating more investment weight to certain stocks with higher risk but promising returns, which is precisely what the intuitionistic fuzzy version of the (MVS) Problem achieves by avoiding the underestimation of NFLX’s proportion. Upon analyzing the results of Tuoi et al. [18], they imposed a condition of E(x) ≥ 0.25 for their model and obtained a result with an expected return value 0.05 higher than the outcome obtained from the intuitionistic fuzzy version of the (MVS) Problem. However, it’s important to note that the risk, as measured by V(x), that investors have to accept in their approach is 2.6151, which is 16% larger than the result from Table 6.

5

Conclusion

In this article, we consider a generalized multicriteria portfolio selection model and examine them in an intuitionistic fuzzy environment. We ﬁrst show that this problem is a pseudoconvex programming problem. We then propose a method to convert the problem to the equivalent single-criteria problem. Our method outperforms previous approaches in both simplicity and eﬀectiveness, where no assumption needs to be made on the frequent interaction with the investors and no hard constraint is set on the membership function. Finally, the results show that the intuitionistic fuzzy multicriteria portfolio selection problem that we propose has given us other better options regarding actual return expectations. In future work, we aim at solving the Problem (MFMVS) on an eﬀective set of solutions, from which investors have more recommendations to choose from when deciding on asset allocation in the actual investment.

References 1. Atanassov, K.T.: Intuitionistic fuzzy sets. VII ITKR’s Session, Sofa (deposited in Central Science and Technical Library of The Bulgarian Academy of Sciences 1697/84) (1983) 2. Avriel, M., Diewert, W.E., Schaible, S., Zang, I.: Generalized concavity. In: SIAM (1988)

78

V. D. Nguyen et al.

3. Benson, H.P.: An outer approximation algorithm for generating all eﬃcient extreme points in the outcome set of a multiple objective linear programming problem. J. Global Optim. 13, 1–24 (1998) 4. Collette, Y., Siarry, P.: Multiobjective Optimization: Principles and Case Studies. Springer, Berlin (2004). https://doi.org/10.1007/978-3-662-08883-8 5. Hasuike, T., Katagiri, H., Ishii, H.: Portfolio selection problems with random fuzzy variable returns. Fuzzy Sets Syst. 160(18), 2579–2596 (2009) 6. Huang, X.: Two new models for portfolio selection with stochastic returns taking fuzzy information. Eur. J. Oper. Res. 180(1), 396–405 (2007) 7. Huang, X.: Mean-variance models for portfolio selection subject to experts’ estimations. Expert Syst. Appl. 39(5), 5887–5893 (2012) 8. Khanjani Shiraz, R., Tavana, M., Fukuyama, H.: A random-fuzzy portfolio selection DEA model using value-at-risk and conditional value-at-risk. Soft. Comput. 24, 17167–17186 (2020) 9. Li, J., Xu, J.: Multi-objective portfolio selection model with fuzzy random returns and a compromise approach-based genetic algorithm. Inf. Sci. 220, 507–521 (2013) 10. Mangasarian, O.L.: Pseudo-convex functions. In: Stochastic optimization models in ﬁnance, pp. 23–32. Elsevier (1975) 11. Mangasarian, O.L.: Nonlinear Programming. Society for Industrial and Applied Mathematics (1994) 12. Osuna-G´ omez, R., Chalco-Cano, Y., Ruﬁ´ an-Lizana, A., Hern´ andez-Jim´enez, B.: Necessary and suﬃcient conditions for fuzzy optimality problems. Fuzzy Sets Syst. 296, 112–123 (2016) 13. Pahade, J.K., Jha, M.: Credibilistic variance and skewness of trapezoidal fuzzy variable and mean-variance-skewness model for portfolio selection. Results Appl. Math. 11, 100159 (2021) 14. Sakawa, M.: Fuzzy Sets and Interactive Multiobjective Optimization. Springer science & business media (2013) 15. Thang, T.N., Hai, T.N.: Self-adaptive algorithms for quasiconvex programming and applications to machine learning. arXiv:2212.06379 (2022) 16. Thang, T.N., Kim, N.T.B.: Outcome space algorithm for generalized multiplicative problems and optimization over the eﬃcient set. JIMO 12(4), 1417–1433 (2016) 17. Thang, T.N., Vuong, N.D.: Portfolio selection with risk aversion index by optimizing over pareto set. In: Intelligent Systems and Networks. pp. 225–232. Springer Singapore, Singapore (2021) 18. Tuoi, T.T.T., Khang, T.T., Anh, N.T.N., Thang, T.N.: Fuzzy portfolio selection with ﬂexible optimization via quasiconvex programming. Tech. rep., Lecture Notes in Networks and Systems, vol 471 (2022) 19. Van Pham, H., Khoa, N.D., Bui, T.T.H., Giang, N.T.H., Moore, P.: Applied picture fuzzy sets for group decision-support in the evaluation of pedagogic systems. Int. J. Math. Eng. Manag. Sci. 7(2), 243 (2022) 20. Van Pham, H., Moore, P., Cuong, B.C.: Applied picture fuzzy sets with knowledge reasoning and linguistics in clinical decision support system. Neurosci. Inf. 2(4), 100109 (2022) 21. Vuong, N.D., Thang, T.N.: Optimizing over pareto set of semistrictly quasiconcave vector maximization and application to stochastic portfolio selection. J. Ind. Manage. Optim. 19, 1999–2019 (2023)

Multicriteria Portfolio Selection with Intuitionistic Fuzzy Goals

79

22. Watada, J.: Fuzzy portfolio model for decision making in investment. Dynamical aspects in fuzzy decision making, pp. 141–162 (2001). https://doi.org/10.1007/ 978-3-7908-1817-8 7 23. Yu, G.F., Li, D.F., Liang, D.C., Li, G.X.: An intuitionistic fuzzy multi-objective goal programming approach to portfolio selection. Int. J. Inf. Technol. Decis. Making 20(05), 1477–1497 (2021) 24. Zhang, W.G., Wang, Y.L.: An analytic derivation of admissible eﬃcient frontier with borrowing. Eur. J. Oper. Res. 184(1), 229–243 (2008)

Research and Develop Solutions to Traﬃc Data Collection Based on Voice Techniques Ty Nguyen Thi1,2 and Quang Tran Minh1,2(B) 1

Department of Information Systems, Faculty of Computer Science and Engineering, Ho Chi Minh City University of Technology (HCMUT), 268 Ly Thuong Kiet, District 10, Ho Chi Minh City, Vietnam {ty.nguyen.imp212,quangtran}@hcmut.edu.vn 2 Vietnam National University Ho Chi Minh City (VNU-HCM), Linh Trung Ward, Thu Duc District, Ho Chi Minh City, Vietnam

Abstract. This paper addresses two primary challenges within the context of the current intelligent traﬃc system, Urban Traﬃc Estimation System (UTraﬃc). Firstly, it endeavors to identify and explore an additional traﬃc data source, supplementing the two existing data sources utilized for training the Automatic Speech Recognition (ASR) model. Secondly, it aims to conduct experiments using the newfound dataset in conjunction with advanced ASR models to ascertain the most optimal ASR model for integration into UTraﬃc. The key methodologies employed to tackle these issues include collecting traﬃc reports from a radio station, processing the data for training ASR models, and experimenting with diﬀerent ASR models. In essence, this research endeavor strives to generate an enhanced dataset comprising authentic real-world data, leading to superior ASR model accuracy compared to the presently deployed ASR model within UTraﬃc. Keywords: Automatic Speech Recognition · speech data preprocessing · speech enhancement · hybrid ASR approach

1

Introduction

This paper is motivated by two factors. Firstly, the discovery of a new traﬃc data source to generate training data for our ASR model. Secondly, the emergence of the hybrid Connectionist Temporal Classiﬁcation (CTC)/attention architecture for end-to-end speech recognition, outperforming traditional methods based on Hidden Markov Models (HMM)/Deep Neural Network (DNN), attention-based methods, and CTC methods. Additionally, the availability of the pretrained Convolutional Time-Domain Audio Separation Network (Conv-TasNet) model on the CHIME4 dataset oﬀers the potential to enhance speech from single-channel recordings. Leveraging these motivations, we aim to improve ASR accuracy in the UTraﬃc [1] intelligent traﬃc system. c The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 N.-N. Dao et al. (Eds.): ICIT 2023, LNDECT 187, pp. 80–90, 2023. https://doi.org/10.1007/978-3-031-46573-4_8

Traﬃc Data Collection - Voice Techniques

81

The integration of ASR in the UTraﬃc system enables users to submit trafﬁc reports with crucial data such as congestion status, vehicle velocities, and locations. The ASR model then converts these speech reports into text, providing traﬃc information to other users. Our ﬁrst challenge is constructing a suitable training dataset for the ASR model. Two proposed approaches in paper [4] include user-generated reports through the UTraﬃc system and utilizing the Vbee tool [5] for synthesized speech. However, the ﬁrst approach relies on user engagement and lacks suﬃcient data beyond prepared transcripts. The second approach generates a large volume of data but lacks naturalness and contains noise. User reports often lack complete information and suﬀer from inconsistencies. The current ASR model exhibits a bias towards synthesized data, hampering its recognition of real-life reports. Our second challenge is to improve the accuracy of the deployed ASR model in the UTraﬃc system. To address the challenges, we propose three key steps. Firstly, we aim to identify an alternative traﬃc data source with distinct features, encompassing reallife traﬃc information while ensuring accuracy and timeliness. Secondly, we seek a suitable approach to process the acquired dataset, combining existing and new sources to enhance our ASR model’s performance. Lastly, we select an appropriate architecture, prioritizing advanced options from the End-to-End Speech Processing Toolkit (ESPnet) toolkit [3]. ESPnet remains at the forefront of ASR research, regularly updating models with state-of-the-art techniques. Leveraging ESPnet’s resources allows users to beneﬁt from the latest innovations in the ﬁeld. The primary contributions of this paper are twofold as follows. • We present a novel dataset addressing scarcity of traﬃc-domain Vietnamese speech data, beneﬁcial for ASR research. • We oﬀer a pipeline to build a high-performance ASR model using ESPnet toolkit (v202301). Through our analysis, researchers can enhance UTraﬃc’s ASR model. The remainder of the paper is structured as follows: Section 2 presents related work. Section 3 outlines the systematic procedure for constructing the ASR model. Section 4 details the experimental setup and results. Lastly, Sect. 5 provides future directions and conclusions for deploying the ASR model in UTraﬃc.

2

Related Work

Various ASR technologies used in intelligent traﬃc systems include DNN, Convolutional Neural Network (CNN), Recurrent Neural Network (RNN), Transformer-based models, and their hybrid approaches. Each architecture has strengths and weaknesses. DNNs can suﬀer from overﬁtting and require signiﬁcant computational resources [6]. CNNs excel at local feature extraction but struggle with capturing long-range dependencies [7]. RNNs may encounter gradient issues and struggle with varying input lengths [8]. Transformer-based models require abundant training data and computational power [9]. Hybrid approaches combining HMM/Gaussian Mixture Model (GMM) with neural networks can

82

T. N. Thi and Q. T. Minh

be complex to train and maintain [10]. Additionally, these models exhibit general weaknesses, including sensitivity to noise, out-of-vocabulary words, and speaker variability. Overcoming these weaknesses involves data preprocessing and advanced training techniques to improve accuracy and robustness in traﬃcrelated speech recognition tasks. The hybrid CTC/attention approach combines the strengths of CTC and attention mechanisms in ASR. CTC enables alignment of unsegmented speech with output labels, while attention dynamically aligns input and output sequences during decoding. This approach improves performance, especially with limited labeled data and long speech. It oﬀers alignment ﬂexibility, robustness, eﬀective handling of variable-length sequences, and a simpler architecture compared to traditional approaches combining acoustic modeling with HMMs or GMMs [10]. The Transformer-based, Conformer-based, and Branchformer-based encoders are chosen for the CTC/attention architecture used in our ASR models. All encoders excel at capturing contextual information, leveraging self-attention mechanisms to focus on relevant parts of the input sequence and eﬀectively capture long-range dependencies. They handle variability in speech data, adapting to noise, speaker variations, and diﬀerent acoustic conditions, making them suitable for real-world ASR applications. Designed speciﬁcally for the CTC/attention approach, these encoders ensure compatibility and optimal integration into ASR architectures [11,12]. They have demonstrated strong performance, achieving state-of-the-art results in recognition accuracy and word error rate (WER). Speech enhancement techniques improve speech quality in challenging acoustic environments, enabling applications like hands-free communication and automatic speech recognition. Notable techniques include spectral subtraction, Wiener ﬁltering, and adaptive ﬁltering, which suppress background noise through noise estimation and spectral modiﬁcations. Statistical model-based approaches utilize advanced probabilistic models such as GMMs and HMMs to represent the statistical characteristics of both clean speech and noise, enabling the estimation of clean speech from observed noisy signals [13]. Deep learning-based approaches, particularly CNNs and RNNs, have achieved remarkable success in speech enhancement tasks by learning complex mappings between noisy and clean speech using large-scale datasets [14]. Conv-TasNet is a sophisticated deep learning model employed for speech enhancement. The convergence of ConvTasNet’s aptness for single-channel speech enhancement, its proﬁciency in isolating speech from noise and interference, and its substantiated eﬃcacy in realworld scenarios renders it an exceptionally compelling selection for augmenting the quality of audio data in real-time ASR models [15]. In this paper, we will use this model to enhance the quality of our audio data.

3

Deﬁnition of Problem and End-to-End ASR System

Our ASR system creation process adheres to a meticulous pipeline. The Fig. 1 illustrates the pipeline for creating a highly suitable ASR model with the support of the ESPnet toolkit. This pipeline ensures a systematic approach to developing an ASR model that meets the speciﬁc requirements of the UTraﬃc system.

Traﬃc Data Collection - Voice Techniques

83

Each step, including data collection, data preprocessing, language modeling, training the end-to-end ASR model, decoding and transcription, and evaluation, contributes to the overall eﬀectiveness and performance of the ASR system. 3.1

Data Collection

To enhance the ASR system, diverse audio data from various sources is crucial. Our dataset includes community-collected traﬃc data and data generated with the Vbee tool. However, the synthesized data falls short of accuracy standards. As an alternative, we included traﬃc reports from the Voice of Ho Chi Minh City (VOH) 95.6 MHz channel [2] in our dataset. Among the channels in Ho Chi Minh City oﬀering real-time traﬃc updates, notable examples include VOV Traﬃc Channel (FM 91.0 MHz), VTC14 - Traﬃc News (Digital TV Channel), and VOH 95.6 MHz. VOH 95.6 MHz stands out for its extensive coverage, speciﬁc focus on traﬃc-related information, timely reports, integration with other programming, and broad reach catering to commuters. We have included it as an additional data source for training our ASR models due to its strengths in traﬃc reporting. We now present a more detailed description of the input our problem. The initial data came from two channels: user contributions on the website or app (3251 s) and synthesized speech by the Vbee tool (122569 s). In total, they amassed about 35 h of audio. An additional 7 h from the VOH 95.6 MHz channel were acquired. The Fig. 2 represents the distribution of audio hours among these three data sources.

Fig. 1. ASR Model Creation Pipeline.

3.2

Fig. 2. Distribution of Audio Hours among Three Data Sources.

Data Preprocessing

The collected audio data undergoes preprocessing to enhance its quality and suitability for ASR. This includes resampling and noise removal to ensure consistent audio ﬁle characteristics. Most audio ﬁles in our dataset have a sampling rate of

84

T. N. Thi and Q. T. Minh

16000 Hz and one channel. However, a subset of ﬁles from the VOH 95.6 MHz channel has a diﬀerent sampling rate of 44100 Hz and two channels. To ensure consistency, we convert the VOH ﬁles to 16000 Hz and one channel. The ESPnet toolkit, used for our ASR experiments, prioritizes compatibility and recommends preprocessing all training audio ﬁles to 16000 Hz and one channel, even though it supports various conﬁgurations. This standardization facilitates seamless integration and training of ASR models within the toolkit. Once the sampling rate and channel conﬁguration are standardized, we apply the Conv-Tasnet speech enhancement model for continuous data processing. This model utilizes CNNs to capture complex relationships between noisy and clean speech signals. It focuses on enhancing single-channel speech by isolating desired speech from background noise and interference. We choose a pre-trained Conv-TasNet model trained on the CHiME-4 dataset, which closely matches our requirements. Our aim is to enhance audio quality and intelligibility for improved ASR performance. We conducted an experimental study to provide evidence that the performance of the ASR model can be signiﬁcantly enhanced through the suggested preprocessing of the training data. Standardizing the sampling rate and the number of channels to 16000 Hz and one channel, respectively, we used the ConvTasNet speech enhancement model to improve audio quality before training. The Transformer-based encoder-decoder (endec) ASR model served as the baseline. Performance evaluation used WER, a metric for transcription accuracy. Table 1 presents WERs under four scenarios: 1. No sampling rate modiﬁcation or speech enhancement; 2. Sampling rate modiﬁcation without speech enhancement; 3. No sampling rate modiﬁcation with speech enhancement; 4. Sampling rate modiﬁcation with speech enhancement. 3.3

Language Modeling

In end-to-end ASR models, a language model is integrated to enhance transcription accuracy [16,17]. Consequently, the incorporation of a language model has been employed in our end-to-end ASR models. 3.4

Training End-to-End ASR

We conduct three experiments using three distinct hybrid CTC/attention endto-end ASR architectures to identify the optimal performing ASR model for integration within the UTraﬃc system. Speciﬁcally, these architectures consist of a Transformer-based encoder, a Conformer-based encoder, or a Branchformerbased encoder, accompanied by a Transformer-based decoder. To train our ASR model, the prepared dataset is used. Input speech features are fed into the encoder, which transforms them into encoded representations using self-attention and feed-forward layers in a Transformer-based encoder. A Conformer-based encoder combines convolutional layers, self-attention, and feedforward layers [18]. A Branchformer-based encoder utilizes branches to capture contextual information [19]. Our training involves an endec architecture with a Transformer-based decoder. Models minimize the discrepancy between predicted

Traﬃc Data Collection - Voice Techniques

85

and ground truth transcriptions using CTC loss and attention loss. Gradients ﬂow from decoder to encoder for joint learning. Fine-tuning adjusts hyperparameters based on validation set evaluations. 3.5

Decoding and Transcription

The ASR decoder in the ESPnet toolkit combines CTC and attention mechanisms to generate transcriptions. For our experiments, we chose a CTC weight of 0.3, emphasizing attention-based output (0.7 weight) for accurate, ﬁne-grained alignments. While CTC captures the overall structure, it may struggle with precise alignments. By assigning a smaller weight to CTC and a larger weight to attention-based output, we achieve a balanced weight distribution, leveraging the strengths of each approach [10]. This decoding conﬁguration maximizes transcription quality and beneﬁts from the collaboration between CTC and attention models, along with acoustic and language models, in the ESPnet toolkit. Furthermore, the Beam search algorithm explores the search space and determines the ﬁnal transcription. For our experiments, we chose a Beam size of 10, considering the top 10 probable paths at each decoding step, which enhances the accuracy of transcriptions. The hybrid CTC/attention decoder, incorporating a speciﬁed CTC weight of 0.3, a language model weight of 0.1, and a beam size of 10, facilitates precise and dependable transcriptions for speech recognition tasks. On the other hand, evaluation metrics are used to quantify ASR system accuracy and performance, guiding improvements. In our experiments, we prioritize WER as it reﬂects practical relevance in real-world scenarios.

4 4.1

Experiment Experimental Setup

Our pipeline incorporates ESPnet, an open-source ASR framework, with highperformance GPUs for eﬃcient training and inference. Hardware requirements for hybrid CTC/attention ASR models in ESPnet vary based on dataset size and conﬁguration. NVIDIA Tesla GPUs (e.g., V100, P100, T4) are recommended for their computational power. Suﬃcient GPU memory is crucial, depending on model and dataset size. A GPU with at least 16 GB of memory is recommended. In our experiments, the Tesla T4 GPU was used for optimal performance, ensuring the eﬃcient development and evaluation of ASR systems using ESPnet. Our experiments followed the mentioned pipeline, employing techniques like resampling audio to 16000 Hz and using a pre-trained Conv-TasNet speech enhancement model. We incorporated a language model into our end-to-end ASR system and explored three hybrid CTC/attention architectures: Transformerbased (50 epochs) (Model 1), Conformer-based (Model 2), and Branchformerbased (70 epochs each) (Model 3). Operating with a 16 GB memory constraint, we adjusted the training conﬁguration by reducing the batch bin size to 3500000 for Conformer-based and Branchformer-based encoders. This adaptation ensures eﬃcient utilization of system resources while maintaining optimal performance.

86

T. N. Thi and Q. T. Minh

4.2

Experimental Result

We evaluated the ASR systems using the processed test set with a duration of 24,187 s. The test set includes community-collected data (2%), synthesized data (82%), and data from the VOH 95.6 MHz channel (16%). The percentage of data from each source in the test set is almost the same as that in our whole dataset. Performance assessment used metrics, with WER as a key measure. Table 2 summarizes WER values for the three ASR models, highlighting their comparative performance. Apart from WER, considering computational resources and Table 1. WER Comparison in Scenarios. Scenarios

Table 2. ASR Models and WER. ASR Models WER(%)

WER(%)

Scenario 1 6.8%

Model 1

6.7%

Scenario 2 7.0%

Model 2

10.4%

Scenario 3 7.7%

Model 3

5.7%

Scenario 4 6.7%

training time provides valuable insights into the trade-oﬀs and practical considerations of each model. Table 3 presents details of the three ASR models, including the number of trainable parameters (in millions) and training times (in seconds). On the other hand, latency and real-time factor (RTF) are crucial for evaluating real-time ASR models. Low latency ensures prompt reporting in time-sensitive applications like our UTraﬃc system, while lower RTF values indicate faster processing and immediate text reporting. Table 4 presents latency and RTF values for our experimental ASR models. Table 3. ASR Model Trainable Parameters & Table 4. ASR Model Latency and Training Time. RTF. Models

Trainable parameters Training time (Millions) (Seconds)

Models

Latency RTF (ms/sentence)

Model 1 27.18

23,524

Model 1 7424.060

0.931

Model 2 108.63

40,583

Model 2 5635.042

0.707

Model 3 109.37

46,527

Model 3 6132.272

0.769

Additionally, in order to demonstrate the impact of a new data source, which includes traﬃc speech reports from the VOH 95.6 MHz channel, an experiment has been devised. In this experiment, our language model is trained for 40 epochs, and the ASR model with Branchformer-based encoder architecture is trained for 70 epochs using two diﬀerent training datasets: one containing audio data collected from the VOH 95.6 MHz channel (Model 3) and one without it (Model 0). Subsequently, we evaluate the performance of these ASR models on two distinct

Traﬃc Data Collection - Voice Techniques

87

test sets. One test set comprises not-speech-enhancement-processed audio data collected from three data sources (Test set 1), while the other contains speechenhancement-processed audio data from the same three data sources (Test set 2). The percentage of data from each source in both test sets is the same as that of our whole dataset. Table 5 presents the WER values for the two ASR models: one trained with audio data from the VOH 95.6 MHz channel (Model 3) and the other trained without it (Model 0) on the two aforementioned test sets. Additionally, the table shows the latency and RTF values of these ASR models on these two test sets. Table 5. ASR Model WER, Latency, and RTF with Diﬀerent Models and Test Sets. Metric WER

Model

Test set 1 Test set 2

Model 0 16.3% Model 3 8.1%

16.0% 5.7%

Latency (ms/sentence) Model 0 5405.215 6691.508 Model 3 6076.909 6132.272 RTF

4.3

Model 0 0.678 Model 3 0.762

0.839 0.769

Analysis and Discussion

Finally, we extensively analyze the experimental outcomes, evaluating the accuracy, eﬃciency, and overall performance of our integrated ASR models. The ASR model with Branchformer-based encoder achieves the lowest WER of 5.7%, outperforming the other two models in speech recognition accuracy. The ASR model with Conformer-based encoder exhibits the lowest latency (5635.042 ms/sentence) and RTF (0.707), indicating faster and more eﬃcient processing. However, models with their encoder based on Conformer or Branchformer have higher trainable parameters, longer training times, and potentially higher computational costs compared to the Transformer-based encoder model. The WER of 5.7% achieved by the Branchformer-based encoder architecture demonstrates remarkable accuracy in real-time ASR for speech recognition and reporting, even in the presence of challenges like background noise and speech disﬂuencies. This level of performance is ideal for applications requiring reliable speech recognition, such as real-time communication and transcription services. However, the lowest latency (5635.042 ms/sentence) achieved by the Conformer-based architecture renders our ASR models unsuitable for real-time traﬃc reporting. This high latency introduces signiﬁcant delays between audio input and transcription, leading to the dissemination of outdated information. To meet the expectations of UTraﬃc users, where timely and up-to-date information is vital for informed decision-making, optimizing the ASR model for lower latencies is imperative. Furthermore, the Conformer-based encoder achieved a favorable RTF of 0.707, indicating eﬃcient processing.

88

T. N. Thi and Q. T. Minh

Table 5 indicates that Model 3 outperforms Model 0 with a lower WER of 8.1% on Test set 1 and achieves the best result of 5.7% on Test set 2, demonstrating Model 3’s superiority in accurately recognizing spoken words. However, Model 3 exhibits a slightly higher latency of 6076.909 ms/sentence on Test set 1 and 6132.272 ms/sentence on Test set 2. On the other hand, Model 3 achieves a slightly better RTF, suggesting faster processing speed. Model 0, with a WER of approximately 16% and RTF close to 1.0, remains suitable for real-time applications. The selection of Model 3, trained with speech data from the VOH 95.6 MHz channel, as the optimal ASR model for the intelligent traﬃc system, is justiﬁed by its superior performance in terms of accuracy and speed. Balancing latency, RTF, and accuracy is crucial for reliable reports. In intelligent transport systems, accuracy takes precedence to provide meaningful information for decision-making. Our results demonstrate that the ASR model with its encoder based on Branchformer and trained with the dataset incorporating audio data from VOH 95.6 MHz channel is the optimal choice for our intelligent traﬃc system. It achieves a low WER of 5.7%, indicating superior accuracy. Additionally, this model exhibits eﬃcient processing with low RTF, making it highly suitable for real-time applications such as UTraﬃc.

5

Conclusion

While the Branchformer-based encoder Transformer-based decoder model shows promise for our intelligent traﬃc system, there are areas that require further investigation to enhance the ASR model in UTraﬃc. One avenue is exploring hybrid ASR approaches, which combine traditional HMM-based systems with neural models. By leveraging rapid decoding and neural model accuracy, these hybrids can achieve lower latency, improving real-time performance in UTraﬃc. The dataset imbalance, with over 80% comprising synthesized data, may contribute to the performance challenges in our ASR models. Imbalanced datasets can bias model training, as models become more familiar with characteristics of the majority class (synthesized data) while lacking exposure to the minority class. To address this, it is recommended to strive for a more balanced dataset by incorporating real-world data alongside the synthesized data. This includes continuously collecting traﬃc reports from the VOH 95.6 MHz channel. Besides, we need to explore other data sources. Additionally, data augmentation techniques can be used to address the dataset imbalance. By artiﬁcially increasing the representation of underrepresented classes, such techniques enable the models to better capture the inherent variability in real-world speech, leading to improved performance on diverse input. Acknowledgement. We acknowledge Ho Chi Minh City University of Technology (HCMUT), VNU-HCM for supporting this study.

Traﬃc Data Collection - Voice Techniques

89

References 1. UTraﬃc (BKTraﬃc). https://bktraﬃc.com/home/. Accessed 28 May 2023 2. Voice of Ho Chi Minh City 95.6 MHz channel. https://voh.com.vn/radio-kenh-fm956-fm956mhz.html. Accessed 28 May 2023 3. ESPnet: End-to-End Speech Processing Toolkit. https://espnet.github.io/espnet/. Accessed 28 May 2023 4. Thanh, N.T., Hieu, L.T., Huy, N.G.: Urban traﬃc condition aware routing approaches. Graduation thesis (2022) 5. Vbee. https://vbee.vn/. Accessed 28 May 2023 6. Lemley, J., Bazrafkan, S., Corcoran, P.: Deep learning for consumer devices and services: pushing the limits for machine learning, artiﬁcial intelligence, and computer vision. IEEE Consum. Electron. Mag. 6(2), 48–56 (2017). https://doi.org/ 10.1109/MCE.2016.2640698 7. Linsley, D., Kim, J., Veerabadran, V., Windolf, C., Serre, T.: Learning long-range spatial dependencies with horizontal gated recurrent units. In: Proceedings of the 32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr´eal, Canada (2018) 8. Jinglong, C., Hongjie, J., Yuanhong, C., Qian, L.: Gated recurrent unit based recurrent neural network for remaining useful life prediction of nonlinear deterioration process. Reliab. Eng. Syst. Saf. 185, 372–382 (2019). https://doi.org/10.1016/j. ress.2019.01.006 9. Dai, Z., Liu, H., Le, Q.V., Tan, M.: CoAtNet: marrying convolution and attention for all data sizes. In: Ranzato, M., Beygelzimer, A., Dauphin, Y., Liang, P.S., Wortman Vaughan, J. (eds.) Advances in Neural Information Processing Systems, vol. 34, pp. 3965–3977, Curran Associates Inc. (2021) 10. Watanabe, S., Hori, T., Kim, S., Hershey, J.R., Hayashi, T.: Hybrid CTC/attention architecture for end-to-end speech recognition. IEEE J. Sel. Top. Sig. Process. 11(8), 1240–1253 (2017). https://doi.org/10.1109/JSTSP.2017.2763455 11. Wang, Y., et al.: Transformer-based acoustic modeling for hybrid speech recognition. In: ICASSP 2020–2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Barcelona, Spain, pp. 6874–6878 (2020). https:// doi.org/10.1109/ICASSP40776.2020.9054345 12. Sakuma, J., Komatsu, T., Scheibler, R.: MLP-based architecture with variable length input for automatic speech recognition (2022). https://openreview.net/ forum?id=RA-zVvZLYIy 13. Vaseghi, S.V.: Bayesian statistical model-based signal processing. In: Advanced Digital Signal Processing and Noise Reduction, Chap. 17, Sec. 2 14. Wang, D., Chen, J.: Supervised speech separation based on deep learning: an overview. IEEE/ACM Trans. Audio Speech Lang. Process. 26(10), 1702–1726 (2018). https://doi.org/10.1109/TASLP.2018.2842159 15. Li, C., et al.: ESPnet-SE: end-to-end speech enhancement and separation toolkit designed for ASR integration. In: 2021 IEEE Spoken Language Technology Workshop (SLT), Shenzhen, China, pp. 785–792 (2021). https://doi.org/10.1109/ SLT48900.2021.9383615 16. Watanabe, S., Hori, T., Hershey, J.R.: Language independent end-to-end architecture for joint language identiﬁcation and speech recognition. In: 2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU). IEEE (2017) 17. Kannan, A., et al.: An analysis of incorporating an external language model into a sequence-to-sequence model. In: 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE (2018)

90

T. N. Thi and Q. T. Minh

18. Gulati, A.: Conformer: convolution-augmented transformer for speech recognition. In: Proceedings of Interspeech 2020, 16 May 2020. https://doi.org/10.48550/arXiv. 2005.08100 19. Peng, Y., Dalmia, S., Lane, I., Watanabe, S.: Branchformer: parallel MLP-attention architectures to capture local and global context for speech recognition and understanding. In: Chaudhuri, K., Jegelka, S., Song, L., Szepesvari, C., Niu, G., Sabato, S. (eds.) Proceedings of the 39th International Conference on Machine Learning, vol. 162, PMLR, pp. 17627–17643 (2022). https://proceedings.mlr.press/v162/ peng22a/peng22a. Accessed 28 May 2023

Using Machine Learning Algorithms to Diagnosis Melasma from Face Images Van Lam Ho1(B) , Tuan Anh Vu2 , Xuan Viet Tran2 , Thi Hoang Bich Diu Pham2 , Xuan Vinh Le1 , Ngoc Huan Nguyen3 , and Ngoc Dung Nguyen1 1 Faculty of Information Technology, Quy Nhon University, Binh Dinh Quy Nhon, Vietnam

{hovanlam,lexuanvinh,nguyenngocdung}@qnu.edu.vn

2 Quyhoa National Leprosy Dermatology Hospital, Binh Dinh Quy Nhon, Vietnam 3 An Nhon Town Committee of the Party, Binh Dinh Quy Nhon, Vietnam

[email protected]

Abstract. This study aims to develop a model for diagnosing melasma disease based on machine learning algorithms with input data being facial images. It not only supports dermatologists in diagnosing the disease but also helps reduce treatment costs and supports remote treatment. In this study, we built a model for diagnosing melasma by using machine learning algorithms to detect melasma objects to support dermatologists in predicting the risk of melasma in a person after entering his/her facial image. People can use this model through an application to monitor the risk of melasma. We built a dataset of facial images combined with the expertise of melasma experts to classify different types of melasma. Based on this dataset, we statistically described the data characteristics and correlation parameters that may cause melasma, then we used YOLO V8 with machine learning algorithms to detect melasma objects to build a diagnostic model for whether a patient has melasma and with which type of melasma. The results obtained will be applied to support the diagnosis of a person who may have melasma with types of melasma such as central melasma, butterfly-shaped melasma, or mandibular melasma. From this result, further research can be conducted to apply artificial intelligence in supporting melasma treatment. Keywords: Objects detection · melasma disease · machine learning algorithms · melasma diagnostic model

1 Introduction Artificial intelligence (AI) plays an increasing role in medicine and healthcare by leveraging computer control, machine learning computing, and accessibility of massive data from treatment, examination, and health records of wearable devices. Healthcare Expo using AI is growing at a rate of 40% and is predicted to reach $6.6 billion by 2021 [1]. The abundance of healthcare information is making a difference to drive the development of AI applications that ensure enhanced knowledge for healthcare. Massive healthcarerelated data is accessible from sources such as Electronic Recovery Records (EMRs) and © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 N.-N. Dao et al. (Eds.): ICIT 2023, LNDECT 187, pp. 91–101, 2023. https://doi.org/10.1007/978-3-031-46573-4_9

92

V. L. Ho et al.

health screening devices, and images from patients are analyzed and processed to use in the diagnosis and treatment of the patient. The rise of AI in the age of massive information could help doctors, especially radiologists, with the precision tools to move forward. AI is well-suited to handling monotonous workflows, monitoring massive amounts of information, and providing lessons in re-choosing ways to eliminate mistakes. AI will become a regular part of radiologists’ lives and make their work more efficient and accurate [2, 17]. It is likely that in the next 10 years, long-term therapeutic images will be pre-analyzed by an AI device that has been tested by radiologists recently. This tool will perform common reading tasks such as measuring, dividing, and recording. Therefore, AI applied in medicine to support diagnosis and treatment will bring many benefits to patients and doctors [1, 3, 4, 18–20]. The quality of phone cameras is getting better and better every year and can produce viable images that artificial intelligence algorithms can analyze. Dermatology and ophthalmology were early beneficiaries of this trend. Researchers in the United Kingdom have even developed a tool to identify developmental diseases by analyzing images of a child’s face. The algorithm can detect distinct features, such as the position of the child’s jawline, eyes, and nose, and other attributes that may indicate craniofacial abnormalities. We also used the facial data of patients who go for melasma examination at Quyhoa National Dermatology Hospital in combination with machine learning algorithms to build a diagnostic model of melasma to support the diagnosis of doctors in the hospital and can be transferred to other hospitals in need. Melasma is an acquired hyperpigmentation disease with complex etiology and pathogenesis. The primary lesion of the disease is macules and/or dark brown symmetrical patches in sun-exposed areas. Common sites of infection are the cheeks, upper lip, chin, and forehead. Although this disease is benign, it has a great impact on the psychological and aesthetic health of patients [5]. In women, the disease can be idiopathic or related to pregnancy [6]. Machine learning, a field of Artificial Intelligence, is a technique that helps computers learn on their own without establishing decision rules. Normally, a computer program needs rules to be able to perform a certain task, but with machine learning, the computer can automatically execute the task when it receives input data. In other words, machine learning means that computers can think for themselves like humans. Another approach argues that machine learning is a method of drawing lines representing the relationships of a dataset [7]. Combining the expertise of dermatologists with a dataset of images of melasma patients, we built the model to analyze melasma types with the input of the patient’s facial images. The result is the conclusion that the individual has melasma or not, and what type of melasma? The machine learning model is built based on YOLO V8 and is considered to have many advantages [8]. This study also focuses on adjusting the parameters to optimize the model by analyzing some properties of the model such as the confusion matrix, Precision-Recall curve, and data variables affecting the detection of melasma objects. That is to approach several Model Evaluation methods to evaluate the results obtained from the model, evaluate whether the model has achieved the set goals or not, analyze

Using Machine Learning Algorithms to Diagnosis Melasma

93

the achieved criteria of the model, and make decisions to use the results of the analysis in practice. The content of the rest of this paper is arranged as follows: Sect. 2 - Introduction to the data in Melasma and the process of database analysis. Section 3 - algorithms used to build the machine learning model to diagnose melasma with the experimental results. Section 4 is the concluding part of this study.

2 Diagnostic Data for Melasma Melasma is an acquired hyperpigmentation disease with complex etiology and pathogenesis. The primary lesion is macules and/or dark brown symmetrical patches in sunexposed areas. Common sites of infection are the cheeks, upper lip, chin, and forehead. Although the disease is benign, it greatly affects the psychology and aesthetics of the patient [5]. In women, the disease can be spontaneous or related to pregnancy [6]. Conclusions about melasma are based on clinical indications with the following characteristics [6]: - Melasma patches on both cheeks, no itch, no burning, no scab. - Hyperpigmented macules in other areas such as eyebrows, private areas, chin, and nose. Melasma is classified based on clinical [9, 10]: - Central melasma: including cheeks, eyebrows, upper lip, nose, and chin. - Butterfly-shaped melasma: localized hyperpigmentation in cheeks, and nose. - Mandibular melasma: related to the lower jaw area. To effectively treat melasma, a combination of three factors is required: broadspectrum sun protection, topical opacifying agents, and elimination of calculated risks [11, 12]. In any case, melasma treatments are still limited and a challenge in dermatology [13]. Therefore, it is extremely important to predict and process random variables [14, 15, 17, 19]. In this study, we use a set of information on melasma patients including 1,624 photographs of melasma patients, taken from the front. The dataset is collected from many sources, and the brightness of the images is different to meet the needs of different qualities of the information set. Images selected from the image dataset will help classify patients with central, butterfly-shaped, mandibular melasma and no melasma as in Fig. 1.

The dataset will be labeled: - 0: central body - 1: butterfly body - 2: mandibular body - 3: no melasma Fig. 1. Label for images

94

V. L. Ho et al.

To label the images, we use the AnyLabeling application to manually label them, moreover, the experts who are dermatologists of Quyhoa National Leprosy - Dermatology Hospital commented on the images of melasma and no melasma, and what type of melasma (central, butterfly-shaped, mandibular melasma). Each photo can have one or more names in the title depending on the type of melasma and the area. There is also a melasma cream photo in the set, which is marked with the corresponding melasma area depending on the patient. In the problem using the YOLO model, we save the annotation file as.txt. 0 0.384534 0.346535 0.201271 0.250825 Each line in the comment file includes . In which, are the coordinates of the center and size of the object, respectively. These values have been renormalized; therefore, the value is always in the range [0,1]. Object class is an index to mark classes. For the problem with many labels, we assign the same labels with the label order agreed upon in advance. The reason is that the file annotations only save the label’s index (0, 1, 3, 4…), not the label name. After labeling is complete, we put the annotation file and the corresponding image in the same folder.

3 Machine Learning Algorithm 3.1 About YOLO V8 YOLO (You Only Look Once) has been continuously evolving in the computer vision community since it was announced in 2015 by Joseph Redmond. In its early days (versions 1–4), YOLO was updated in C in a custom deep-learning system created by Redmond called Darknet. The YOLOv5 version, once released by Ultralytics, quickly became widely used thanks to its flexible structure. Over the long term, many models have branched out from YOLOv5, including Scaled-YOLOv4, YOLOv6, and YOLOv7. Other models have appeared around the world with their unique adjustments, such as YOLOX and YOLOv6. At the same time, each version of YOLO introduced never-before-seen strategies to encourage strides in the accuracy and productivity of this model. YOLOv8 is the most advanced version of YOLO that can be used for problemsolving and image classification exercises [16]. YOLOv8, created by Ultralytics, the same team that created YOLOv5, has had some success recently. YOLOv8 incorporates more changes and advancements in engineering and user engagement than YOLOv5. YOLOv8 model seems to perform much better than the previous YOLO models. Not only the YOLOv5, but YOLOv8 models also stand above YOLOv7 and YOLOv6 models.

Using Machine Learning Algorithms to Diagnosis Melasma

95

Key reasons why you should consider using YOLOv8 for your next computer vision project: - YOLOv8 has a high accuracy rate measured by COCO and Roboflow 100. - YOLOv8 comes with a lot of developer-convenient features, from an easy-to-use CLI to a well-structured Python package. - There is a large and growing community around YOLOv8 model, meaning there are many people in the computer vision world who can assist you when you need guidance [8]. 3.2 Anchor-Free Detection YOLOv8 is an anchor-free model. This means it predicts directly the center of an object instead of the offset from a known anchor box [8].

Fig. 2. Visualization of an anchor box in YOLO

Anchor boxes were a notoriously tricky part of earlier YOLO models, since they may represent the distribution of the target benchmark’s boxes but not the distribution of the custom dataset. Anchor-free detection reduces the number of box predictions, which speeds up Non-Maximum Suppression (NMS), a complicated post-processing step that sifts through candidate detections after inference. 3.3 Model for Diagnosing Melasma To detect and classify melasma on face images taken by the camera, we use the machine learning program YOLO V8. This program is prepared to identify and classify the components associated with melasma on the face. The labels used in the program are: 0: central body; 1: butterfly body; 2: mandibular body. In addition, the program will recognize the label ‘3’ in cases of no facial melasma. When applying this program to the melasma image dataset, it will identify and name the melasma locations on the face. The names given will help determine the appropriate type of melasma for each defined site of the face. Based on the results of the model, we can distinguish the facial sites affected by melasma and relate them by comparing label names. Specifically, “central body” (name 0) is compared to the phrase ‘central site’, “butterfly body” (name 1) is compared to

96

V. L. Ho et al.

‘butterfly-shaped site’, “mandibular body” (name 2) is compared to ‘chin site’. In the absence of melasma recognized, the term ‘no melasma’ may be used. This approach which improves understanding of the specific melasma condition makes the difference in classification and better telediagnosis. 3.4 Results of Model Evaluation The model was built on Google Collab with hardware-accelerated Python3 and hardwareaccelerated GPU, Type NVIDIA A100-SXM4-40GB, 40514MiB. The number of images in the training dataset is 1,624. The number of images in the testing dataset is 176. The image size is 640 pixels and epochs are 50 times. The datasets include 4 types of labels mentioned above (Figs. 2, 3, 4, 5, 7 and 8).

Fig. 3. Description of melasma object detection data

Fig. 4. Results of training model

Using Machine Learning Algorithms to Diagnosis Melasma

97

An object detection model is based on Precision, Recall, AP, and mAP50–95 parameters. mAP will be the measurement criterion of the object detection model through IoU, Precision, Recall, Precision-Recall Curve, AUC, AP, and mAP metrics. Precision: Evaluate the reliability of the given conclusions (how many % of the model’s conclusions are correct). Recall: Evaluate the model’s ability to find ground truths (how many % positive samples the model recognizes).

Fig. 5. Confusion Matrix

IoU (Intersection over Unit): Measure the overlap between the ground truth bounding box and the bounding box that the model predicts.

Fig. 6. Evaluation of Precision

98

V. L. Ho et al.

The Fig. 6 shows that the reliability rating of the given conclusion has an accuracy of 1 to 0.86.

Fig. 7. Evaluation of the model’s ability to search all labels

Fig. 8. Evaluation of the model’s reliability

mAP = mean Average Precision is the average of the AP values of different classes. The larger the mAP, the better the model. The relationship between precision-recall helps mAP evaluate the accuracy of the classification task. Doing precision-recall changes when the IoU threshold changes (what class is the threshold to predict a box). Therefore, at a given IoU value, it is possible to measure/compare the goodness of the models.

Using Machine Learning Algorithms to Diagnosis Melasma

99

Fig. 9. Results of model evaluation

According to Fig. 9, the mAP50 score at epochs = 50 is 0.839 and the mAP50–95 score is 0.52, and the precision score is 0.776, indicating that the model has detected melasma objects quite well. The illustrations in Fig. 10 when comparing the detection results between the training dataset and the testing dataset show that the YOLO V8 model gives quite accurate results.

Fig. 10. Labeled training dataset and the testing dataset

100

V. L. Ho et al.

4 Conclusions The limitations of the study are that a small number of observed patients with mandibular melasma leads to a low probability of having mandibular melasma and a difference between the number of label data. In this study, we present the steps of the process of data labeling, image classification, and building a machine-learning model using YOLO V8 to predict the possibility of an individual being infected with melasma. With this approach, the proposed method has exploited the patient’s image data to support doctors in screening patients with melasma. The machine learning model has a prediction accuracy higher than 83.9%, therefore, it helps prevent, diagnose, and treat the disease as well as reduce treatment costs. However, to make the model more accurate, it is necessary to collect community data of more individuals, from many different regions, which will take a lot of effort, time, and cost.

References 1. Meskó, B.: Artificial Intelligence is the Stethoscope of the 21st Century (2017) 2. Recht, M., Bryan, R.N.: Artificial intelligence: threat or boon to radiologists? J. Am. Coll. Radiol. 14(11), 1476–1480 (2017) 3. Houssami, N., Lee, C.I., Buist, D.S., Tao, D.: Artificial intelligence for breast cancer screening: opportunity or hype? Breast 36, 31–33 (2017) 4. Magazine, E., Roach, L.: Starting With Retina. Artificial Intelligence (2017) 5. Thường, N.V.: Bệnh rám má. Bệnh học Da liễu. NXB Y học (2017) 6. Salim, A., Rajaratnam, R., Domanne, E.S.M.: Evidence-Based Dermatology, 85–470 (2014) 7. Dhar, V.: Data science and prediction. Commun. ACM 56(12), 64–73 (2013) 8. Solawetz, J.F.: What is YOLOv8? The Ultimate Guide. (2023). Accessed 27 May 2023 9. Balkrishnan, R., Mcmichael, A.J., Camacho, F.T., Saltzberg, F., Housman, T.S., Grummer, S., et al.: Development and validation of health-related quality of life instrument for women with melasma. British J. Dermatol. 149, 7–572 (2003) 10. Katsambas, A., Antoniou, Ch.: Melasma. Classification and treatment. J. Eur. Acad. Dermatol. Venereol. 4(3), 217–223 (1995) 11. Lynde, C.B., Kraft, J.N., Lynde, C.W.: FRCPC topical treatments for melasma and postinflammatory hyperpigmentation. Skin Therapy Lett. 11(9), 1–12 (2006) 12. Yuri, T.J., Schwartz, R.A.: Treatment of Melasma. Evidence-based dermatology (2011) 13. Noh, T.K., Choi, S.J., Chung, B.Y., Kang, J.S., Lee, J.H., Lee, M.W., et al.: Inflammatory features of melasma lesions in Asian skin. J. Dermatol. 41(9), 788–794 (2014). https://doi. org/10.1111/1346-8138.12573 14. Ortonne, J.P., Arellano, I., Berneburg, M., Cestari, T., Chan, H., Grimes, P., et al.: A global survey of the role of ultraviolet radiation and hormonal influences in the development of melasma. J. Eur. Acad. Dermatol. Venereol. 23(11), 1254–1262 (2009). https://doi.org/10. 1111/j.1468-3083.2009.03295.x 15. Saumya, P.: Agenda for future research in Melasma: QUO VADIS? J. Pigmentary Disord. 1(5), 1–5 (2014). https://doi.org/10.4172/JPD.1000e103 16. Rath, S.: YOLOv8 Ultralytics: State-of-the-Art YOLO Models 17. Van Lam, H., Anh, V.T., Diu, P.T.H.B., Viet, T.X.: Appling machine learning to predict Melasma. In: International Journal of Computer Science and Information Security (IJCSIS), vol. 19, no. 11 November (2021)

Using Machine Learning Algorithms to Diagnosis Melasma

101

18. Kassem, M.A., et al.: Machine learning and deep learning methods for skin lesion classification and diagnosis: a systematic review. Diagnostics 11, 1390 (2021) 19. Liu, L., et al.: An intelligent diagnostic model for melasma based on deep learning and multimode image input. Dermatol. Ther. (Heidelb.) 13, 569–579 (2023). https://doi.org/10. 1007/s13555-022-00874-z 20. Mahbod, A., Ellinger, I.: Special issue on advances in skin lesion image analysis using machine learning approaches. Diagnostics 12(8), 1928 (2022)

Reinforcement Learning for Portfolio Selection in the Vietnamese Market Bao Bui Quoc, Quang Truong Dang, and Anh Son Ta(B) School of Applied Mathematics and Informatics, Hanoi University of Science and Technology, Hanoi, Vietnam [email protected]

Abstract. In this paper, we explore the application of reinforcement learning in the context of Vietnam’s rapidly growing ﬁnancial market, where research on algorithmic trading, in general, remains limited. We implement and compare state-of-the-art reinforcement learning algorithms such as Proximal Policy Optimization (PPO), and Twin Delayed Deep Deterministic Policy Gradient (TD3); in an eﬀort to improve trading strategies and decision-making. Additionally, we employ the Spectral Residual method to detect anomalies in sequence state spaces and mitigate potential risks. We conclude that Special Residual noise ﬁltering delivers the best portfolio performance across the board, and the ActorCritic using Kronecker-Factored Trust Region (ACKTR) and the PPO attain dominance in portfolio performance in the training data and testing data respectively.

Keywords: reinforcement learning trading

1

· portfolio optimization · algorithm

Introduction

Algorithmic trading (AT) is a process of executing a trade order in the ﬁnancial market at a speciﬁed moment and involves the use of computerized programs and algorithms. Algorithmic trading splits time into discrete steps [1] and decided the execution of those steps based on the current feedback from the market at that moment of time, hence, the goal of AT is to maximize the return value at the end of the trading period. In 2019, the Securities and Exchange Commission (SEC) concluded that algorithmic trading had reached a dominant status within the US ﬁnancial industry [2], highlighting its increasing dominance and the growing reliance of market participants on advanced technologies and automated systems for conducting trades. In Vietnam, however, even though the nascent ﬁnancial market has experienced considerable growth and progress during the 2010s - the Vietnamese market capitalization has reached 186.01 billion USD, approximately 53.7% GDP [3] - research on algorithmic trading within the country remains surprisingly scarce. This is especially noteworthy given the increasing importance of algorithmic trading in modern ﬁnancial markets which has led to extensive c The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 N.-N. Dao et al. (Eds.): ICIT 2023, LNDECT 187, pp. 102–114, 2023. https://doi.org/10.1007/978-3-031-46573-4_10

Reinforcement Learning for Portfolio Selection in the Vietnamese Market

103

research and utilization in other regions [4]. Consequently, there’s a growing need for more comprehensive investigations and analyses on algorithmic trading in the context of Vietnam’s rapidly evolving ﬁnancial landscape. There has been research on the application of neural networks to improve quantitative analyses, with the aim of helping traders make better executions. However, the highly competitive landscape has created a demand for more eﬃcient and eﬀective trading decision-makers. As a result, researchers have explored Reinforcement Learning (RL) as a promising approach to tackle this problem [1]. Our main contributions are three-fold and summarized as follows: – The application and comparison of reinforcement learning models in stock trading problems. – Developing a dedicated environment for handling Vietnamese data to address the speciﬁc constraints of the typical Vietnamese market, such as payment cycles. – We apply the Spectral Residual method [5] to ﬁnd anomalies in sequence state spaces, thus mitigating potential risks in the decision-making process and enhancing the results.

2 2.1

Overview State-of-the-Art Reinforcement Learning

Reinforcement Learning is a machine learning technique, that enables an agent to learn decision-making in an uncertain environment. The agent learns by interacting with the environment, receiving rewards for good decisions and penalties for poor ones [6]. This feedback is then used to improve the agent’s decisionmaking process. The ultimate goal of the agent is to maximize the total reward it receives over time. The decision-making process of an agent can be modeled as a Markov Decision Process (MDP) [7,8]. An MDP is represented by the tuple < S, A, P, R, γ >, where S represents the set of states, A denotes the set of actions, P is the transition probability function, R is the reward function, and γ is the discount factor. The agent’s primary objective is to dis∗ cover the optimal ∞ t policy π , which maximizes the expected return J(π); that is, J(π) = E[ t=0 γ Rt |π]. To ﬁnd the optimal policy π ∗ , we aim to maximize the expected return J(π), given by π = arg maxπ J(π). This can be achieved by solving the optimal value function: P (s |s, a)V (s ) , (1) V (s) = max R(s, a) + γ a∈A

s ∈S

where V (s) denotes the value of state s under the optimal policy π. From the optimal value function V , we can derive the optimal policy π: π(s) = arg max R(s, a) + γ P (s |s, a)V (s ) , (2) a∈A

s ∈S

104

B. B. Quoc et al.

where π(s) indicates the optimal action to take in state s under the optimal policy π. Model-Free (MF) [9] Reinforcement Learning has gained attraction in the ﬁnance industry due to its ﬂexibility and adaptive nature in handling complex, dynamic, and uncertain environments without requiring a complete model of the environment or a prerequisite context knowledge. The algorithms consider episodes generated by a predeﬁned strategy to accumulate experience and subsequently improve the strategy. MF algorithms consist of three main steps, repeated until an optimal strategy is achieved: 1. Step 1: Generate new samples (episodes) by executing the strategy in the environment. Episodes are executed until the ﬁnal state is reached or a predetermined number of steps are reached. 2. Step 2: Estimate the return acquired. 3. Step 3: Improve the strategy using the gathered samples and estimates from Step 2. These basic steps characterize MF algorithms, but diﬀerent implementations can yield various algorithms. We consider three subclasses of MF algorithms as follows: Policy Optimization methods aim to ﬁnd a parameterized strategy πθ (a|s). Parameters θ are optimized directly using gradients that maximize the objective function J(πθ ) or indirectly by optimizing local approximations of J(πθ ). Some policy optimization algorithms include Vanilla Policy Gradient (VPG) [10], Trust Region Policy Optimization (TRPO) [11], Proximal Policy Optimization (PPO) [12], Advantage Actor Critic (A2C ) [13], etc. Q-Learning methods learn an approximation function Qθ (s, a) for the optimal action-value function, Q(s, a). Typically, algorithms use a target function based on the Bellman equation, therefore, the corresponding strategy is derived through the relationship between Q and π ∗ : the actions executed by the Qlearning agent are as follows: a(s) = arg maxa Qθ (s, a) Hybrid algorithms can be constructed to balance the strengths and weaknesses of policy optimization and Q-Learning algorithms. Some hybrid algorithms include Deep Deterministic Policy Gradient (DDPG) [4], Soft ActorCritic (SAC ) [14], and Twin Delayed DDPG (TD3 ) [15]. 2.2

Related Work

So far, the ﬁeld of Reinforcement Learning oﬀers an extensive and diverse range of methods and approaches for developing ﬁnancial trading strategies, as evidenced by the rich literature and numerous research studies on this topic. The study referenced in [26] is to be held in high regard, as this paper was among the ﬁrst literature exploring the application of deep reinforcement learning for hedging in the Vietnamese stock market. The multi-agent reinforcement learning in conjunction with the actor-critic architecture highlighted in the paper

Reinforcement Learning for Portfolio Selection in the Vietnamese Market

105

has demonstrated advanced capabilities to safeguard investments amidst market turmoil. In the following, the paper [16] presents pioneering research that introduces two innovative designs based on the multi-agent deep deterministic policy gradient (MADDPG [17]) framework. These designs eﬀectively exploit the independent decision-making behavior of multiple actor agents to explore novel trading strategies. Simultaneously, a centralized critic network evaluates the performance of these independent actors, maximizing each agent’s proﬁtability and distributing the associated risks evenly. On the other hand, the paper [18] drew a comparison of the performance of two diﬀerent approaches to hedging: Reinforcement Learning and Deep Trajectory Based Stochastic Control an architecture involving deep neural network to optimize the stochastic control problem represented as a computational graph considering the hedging actions and the amount of time for the hedging is regarded as a stochastic control problem, thus eliminating the disadvantage given by the large dimensions. Many studies focus on leveraging the Deep Deterministic Policy Gradient (DDPG) [4] algorithm and its various modiﬁcations in developing automated trading strategies. Majidi et al. [19] use the Twin-Delayed DDPG approach in processing the daily close price data of US stocks (Amazon) and cryptocurrencies (Bitcoin) to achieve optimal trading strategy with continuous action spaces. Another notable example is presented in [20], where the authors propose a unique approach by implementing a Double Q-Learning [21] algorithm combined with a Feedforward Linear Network [22] approximation architecture. This method features a saving mechanism speciﬁcally designed to handle bearish market conditions. The proposed architecture highlights the potential for successful trading applications using deep reinforcement learning techniques, even without relying on traditional oﬄine training processes. Notably, [23] focuses on addressing the complex bidding mechanism in the electricity market to achieve a balanced reward. The authors employ a sophisticated Dual-agent Deep Deterministic Policy Gradient (D2PG) [24] framework and leverage prior domain knowledge to enhance the training eﬃciency. This approach allows them to tackle the challenges associated with power trading and develop a more robust solution for the electricity market. Chen et al. [4] process ﬁnancial data from S&P500, CSI300, SSE Composite, and DJI using techniques introduced in the classical Chan Theory. They experiment with the DDPG algorithm alongside the DQN algorithm and other trading strategies, achieving signiﬁcant success in the CSI300 datasets.

3 3.1

Method Modeling the Stock Trading Problem

First, we will model the stock trading problem as a Markov decision process as follows.

106

B. B. Quoc et al.

Fig. 1. The interactive loop between Agent and Environment in the stock trading problem.

State space S. The state space describes the observations that the agent obtains from the environment. As traders, prior to executing a trade, it is essential for us to examine, process, and analyze a multitude of market information. Similarly, when the agent performs a trade, it has to observe numerous features to enhance its learning process, thus improving its decision-making process while interacting with the environment. We have the following features: – Balance bt ∈ R+ : The current amount of money in the account at time step t (b0 is the initial investment amount when starting the trading). – Shares own ht ∈ Zn+ : The quantity of ownership for each type of stock at time t, with n is the number of types of stocks. – pt ∈ Rn+ : The adjusted closing price of each stock at the time t. – ot , ht , lt , ct ∈ Rn+ : The opening, high, low, and closing prices of each stock at the time t. – Trading volume, denoted as vt (vt ∈ Rn+ ): Total number of shares traded at a time of trading t. – Technical Indicators: Moving Average Convergence Divergence (MACD), Mt ∈ Rn and Relative Strength Index (RSI), Rt ∈ Rn+ , v.v. Action space A. The action space A describes the actions of the agent interacting with the environment in the time t. Normally, A undertakes three action: a ∈ {−1, 0, 1}, where −1, 0, 1 represent selling, holding, and buying one stock. Besides, we can describe an action that it can be carried upon multiple shares. By the description, an action a ∈ A is a vector of space Rn with n is number of shares, and action space {−k, ..., −1, 0, 1, ..., k} where k denotes number of shares. −k means selling k shares and k means buying k shares. Reward function R(s, a, s ). The reward function R(s, a, s ) is used to measure the eﬃciency of action a by comparing the value between state s at time t and

Reinforcement Learning for Portfolio Selection in the Vietnamese Market

107

state s at t + 1. Speciﬁcally, in the trading problem, action a is chosen when we know the state of s. The action a will change the number of shares from ht to ht+1 and the balance bt to bt+1 at the subsequent time step t + 1. Here are some of the involved functions: – The change of the portfolio value when action a is taken at state s and changing to state s : R(s, a, s ) = v − v (3) where v , v represent the portfolio values at state s and s, respectively. – We can also use the portfolio log return function as follows: R(s, a, s ) = log(

v ) v

(4)

– The Sharpe ratio for periods T : ST =

mean(Rt ) std(Rt )

(5)

where Rt = vt − vt−1 . Mechanism of Trading. We consider at the time t, and the actions on stock d as follows: – Selling: selling k number of shares held at the moment (k ∈ [1, h[d]], where d = 1, . . . , n), k is a positive integer. In this case, ht+1 = ht − k. – Holding: k = 0 and ht+1 do not change, ht+1 = ht . – Buying: buying k of shares, then we have ht+1 = ht +k. In this case at [d] = −k is a negative integer. Figure 1 describes the process of taking an agent’s action at time t, three actions (“Buy”, “Sell”, “Hold”) result in three new portfolio values, where the actions have a total probability of one. Based on the actions taken and the value of the stock changed the portfolio value will change its status from “portfolio value 0” to “portfolio value 1”, “portfolio value 2” or “portfolio value 3” at time (t + 1) (the action “holding” can result in a change in portfolio value if the value of stock changes). Trading Constraints. To better simulate the trading process in reality, we incorporate additional constraints into transactions, such as risk aversion, transaction costs, and so on. – Non-negative balance constraint: Before carrying out a list of actions, we need to ensure that when executing those actions, the balance bt+1 at the next time step t+1 is a non-negative value. We construct a list of indices, denoted as ind, such that the ﬁrst d1 elements of ind correspond to the indices of stocks with sell orders, and the last d2 elements of ind correspond to the indices of stocks with buy orders. We have the following formula to represent the constraints: sell ind = ind [1 : d1 ]

108

B. B. Quoc et al.

buy ind = ind [D − d2 + 1 : D] T

T

pt [sell ind] at [sell ind] + bt + pt [buy ind] at [buy ind] ≥ 0

(6)

– Transaction costs: We can apply transaction costs as parameters in the environment, with cases of Fixed Fee (a ﬁxed amount of money for each transaction regardless of the number of shares traded) or Per Share Percentage (calculated as a percentage of the total number of shares being traded). – Risk-aversion: To manage risks in adverse situations, such as the ﬁnancial crisis of 2007–2008, we can also utilize a turbulence index to measure ﬂuctuations as follows:

turbulencet = (y t − μ) Σ −1 (y t − μ) ∈ R

(7)

where yt ∈ Rn is the investment return at the current time period t, μ ∈ Rn and Σ ∈ Rn×n are respectively the average return and covariance matrix of stocks over the entire past period at time t. These are control parameters for buying or selling actions. For example, if the turbulence index reaches a predetermined threshold, the agent will stop buying and gradually start selling the stocks it holds. 3.2

Environment for Vietnamese Market

For the Vietnamese market, we have additional constraints on the payment cycle. For example, if we buy VND stock on Tuesday (22/06/2021), the stock will arrive in our account on Thursday (24/06/2021). However, since the stock will arrive in the afternoon, we have to wait until the next trading session on the following day to sell it. For stocks purchased on Thursday, the arrival time will not include Saturday and Sunday. Similarly, when we sell on Tuesday (22/06/2021), the money will be available in our account by Thursday morning (24/06/2021), then we can withdraw or transfer the money from our securities account to our bank. Alternatively, we can use the “advance selling” service on the same day by paying a fee of 0.0375% advance amount number of days in advance. For stocks sold on Thursday and Friday, the settlement period does not include weekends, which means the arrival time of the funds will exclude Saturday and Sunday. The payment cycle currently applies on all three exchanges HOSE, HNX, and UPCOM is T+2 (2 trading days). Therefore, before applying these algorithms to develop trading strategies in the burgeoning Vietnamese ﬁnancial market, we implement the following: 1. In the ﬁrst step, we create 4 “bags” to store stocks. 2. Initially, on the ﬁrst day - corresponding to the time step t = 0, the stocks purchased are placed in the ﬁrst bag. On the second day (t = 1), the stocks purchased are transferred to the second bag. On the third day (t = 2), the stocks purchased are moved to the third bag. On the fourth day (t = 3), the stocks purchased are placed in the fourth bag.

Reinforcement Learning for Portfolio Selection in the Vietnamese Market

109

3. On the fourth day, we can sell the stocks purchased on the ﬁrst day. From the ﬁfth day onward, before trading, we consolidate all the stocks in the second bag into the ﬁrst bag, then transfer the stocks in the third bag to the second bag, and the stocks in the fourth bag to the third bag. At this point, the ﬁrst bag contains stocks that can be sold, and any newly purchased stocks (from the ﬁfth day onward) are placed only in the fourth bag. We sell stocks exclusively from the ﬁrst bag and buy stocks exclusively into the fourth bag. 4. Starting from the sixth day (t ≥ 4), we repeat the same process as on the ﬁfth day. 3.3

Noise Filter

Noise can manifest in diﬀerent ways and have undesired impacts within stock data. Random Noise, Outliers, Environmental Noise, and other forms are among the prevalent types of noise found in stock data. Identifying and eliminating noise from stock data is crucial to enhance data quality and boost precision in ﬁnancial market analysis and forecasting. Techniques such as Spectral Residual (SR), Exponential Smoothing (ES), Kalman Filter (KF), etc., can be employed for ﬁltering noise in stock data. For a sequence of real values x = x1 , x2 , ..., xn , xi ∈ R (speciﬁcally, x can be a sequence of open, high, low, close, and volume). The method of detecting outliers with input as a sequence x and output a sequence y = y1 , y2 , ..., yn , where yi ∈ {0, 1}, yi = 1 corresponding to xi is an anomaly in the sequence x.

4

Experimental Evaluation

The initial data for the problem includes information about the Date, Open price, High price, Low price, Close price, Adjusted Close price, and trading Volume. These are common data ﬁelds that are often used and easily collected for most types of stocks. 4.1

Data Pre-processing

The data pre-processing process can be divided into two stages, before and after adding technical indicators. At each stage, we sequentially check for NULL elements in the data and process those NULL elements. We perform forward ﬁll and backward (ﬁlling in empty elements with the value of the preceding element/following element, respectively). 4.2

Experimental Setup

Throughout our experiments, we employed stock datasets sourced from both the United States and Vietnam for the purpose of training and evaluation. The dataset employed for Vietnam comprises 21 securities codes corresponding to the top 30 companies in the VN30 stock index. The data was collected between

110

B. B. Quoc et al.

January 5, 2017, and May 26, 2021. We partitioned the data into two sets: the training set spanning from January 5, 2017, to December 1, 2020, and the testing set from December 1, 2020, to May 25, 2023. Regarding the data for the United States, we utilized stock information from 30 companies within the DOW30 index as of June 25th , 2021 (Source: https:// www.cnbc.com/dow-30/). The statistical data covers the period from January 1st , 2012, to June 25th , 2021. We divided the dataset into two sets: the training dataset encompassing the period from January 1st , 2012, to January 1st , 2019, and the testing dataset from January 1st , 2019, to June 25th , 2021. The reason we utilized evaluation data from the period between 2019 and 2021 for the US data is that this timeframe coincided with the outbreak of the Covid-19 pandemic, which had a signiﬁcant impact on the economy. Assessing the data during this period will enable us to observe and make more accurate observations regarding the model’s capabilities. In this study’s experiments, the technical indicators employed consist of Exponential Moving Average (EMA), Moving Average Convergence Divergence (MACD), Relative Strength Index (RSI), and Commodity Channel Index (CCI). We set the initial balance value at 1000000. 4.3

Experimental Results

Our practical experiments in this subsection aims to address the subsequent three pivotal research inquiries. – Q1. What are the optimal hyperparameters for each model in order to maximize performance? – Q2. What are the outcomes of each model on the validation dataset? – Q3. Among the three noise ﬁltering algorithms, including Spectral Residual (SR), Exponential Smoothing (ES), and Kalman Filter (KF), which algorithm yields the best results? – Q4. How does the application of the best noise ﬁltering algorithm aﬀect the model’s results? To answer these questions, we carry out comprehensive experiments as follows. Comparative Study for Choosing the Model’s Hyperparameters (Q1). Each model has a set of diﬀerent hyperparameters, which can be adjusted to optimize the performance and stability of the model in various reinforcement learning tasks. In our experiments, within each set of hyperparameters, we select key parameters that directly impact model performance for adjustment (some parameters, whose default values have already yielded the best results, will not be mentioned). In this part, we utilize a larger training dataset from the United States for a better parameter selection process. The ﬁne-tuned parameters for each model are as follows:

Reinforcement Learning for Portfolio Selection in the Vietnamese Market

111

– PPO: The number of algorithm iterations - time steps coeﬃcient (ts = 50000), the learning rate (lr = 0.00025), the discount factor γ (γ = 0.99), and the coeﬃcient ( = 0.2). – TRPO: The time steps coeﬃcient (ts = 100000), and the threshold δ (δ = 0.005) of KL divergence constraint [25] in the optimization problem. – TD3: The learning rate (lr = 0.0003), and the exploration factor(ef = 0.3). – SAC: The learning rate (lr = 0.00025). – ACKTR: The discount factor γ (γ = 0.95). Here, we present a concrete instance of selecting the ideal hyperparameter value for the PPO model (Fig. 2). Every graph showcases the portfolio’s value as time progresses, corresponding to each hyperparameter value. Performance Comparison of Models (Q2). We evaluate and compare the performance of TRPO and PPO, two policy optimization algorithms, along with hybrid-class algorithms like TD3, SAC, and ACKTR (Fig. 3). The Vietnamese dataset is used for both training and testing. In Fig. 3, the portfolios managed by various algorithms exhibit a similar trend over time. Initially, the portfolio values gradually increase from day 0 to day 270, followed by a sharp decline around day 300. After the decline, the values recover and reach new heights. Throughout the period, the portfolios managed by SAC and PPO consistently dominate the chart. PPO’s portfolio achieves the highest value of 174,021 after 600 days, while SAC’s portfolio maintains the second position with a value of around 136,210 after the same period. Notably, TD3’s managed portfolio consistently underperforms against other models, with the lowest value of approximately 132,465 after 600 days. Interestingly, while the portfolio values of other models suﬀer a setback after day 540 and fail to recover since then, PPO’s portfolio undergoes only a brief drop in value before quickly bouncing back and attaining the highest value at the end of the observation period. In addition, PPO also exhibits an advantage in terms of runtime (TD3 takes 280.16 s; ACKTR takes 579.65 s; SAC takes 1042 s). Table 1. Comparison results between applying noise ﬁltering algorithms. Model

The ﬁnal balance on testing SR ES KF without noise ﬁltering

PPO TRPO SAC TD3 ACKTR

174021 136182 136210 132465 134021

170711 136521 131998 133924 130096

169524 130029 133226 135912 129521

162503 121331 132321 131978 123222

112

B. B. Quoc et al.

Fig. 2. Comparison results corresponding to the parameters of PPO. Table 2. Comparison results with and without applying SR noise ﬁltering. Model

The ﬁnal balance on training The ﬁnal balance on testing SR without SR SR without SR

PPO TRPO SAC TD3 ACKTR

2059947 1869745 1951442 1961201 2151997

1891561 1708047 1792231 1807044 2091508

174021 136182 136210 132465 134021

162503 121331 132321 131978 123222

Comparision Results when Applying Noise Filtering Algorithm (Q3, Q4). From Table 1, one can conclude that the use of noise ﬁltering can lead to signiﬁcant improvement in portfolio value, ranging from 3.05% to 12.39%. The TRPO achieves the best performance in ES noise ﬁltering, while the KF noise ﬁltering oﬀers much improvement to TD3’s portfolio balance. The SR reigns supreme, however, with the PPO has successfully delivered the highest portfolio values. In the following, the ﬁgure in Table 2 yielded intriguing ﬁndings, where ﬁrst of all, one can observe that, like Table 1, the models utilizing SR noise

Reinforcement Learning for Portfolio Selection in the Vietnamese Market

113

Fig. 3. Performance comparison of the algorithms

ﬁltering consistently exhibit superior performance compared to those without the ﬁlter, with the improvement ranging from 2.89% to 8.88% in training data, and 3.03% to 11.57% in testing data. Remarkably, the PPO dominates the testing data with the highest balance (174,021 with SR and 162,593 without SR) while the ACKTR has the best performance in the training data (2,151,997 with SR and 2,091,508 without SR), but vastly under-perform in the testing data, posting the second lowest balance with both SR involved and without SR (134,021 and 123,222 respectively) and only slightly surpassing TD3 in that regard.

5

Conclusion

In this paper, we have presented the applications of reinforcement learning algorithms for portfolio selection in the Vietnamese stock market. We have eﬀectively applied methods (such as noise ﬁltering) to enhance the model’s performance, as evidenced by the positive numerical results. Although there are further practical constraints that need to be applied to the model (such as handling and responding to sudden market ﬂuctuations) and addressed more optimally, the potential application of our approach to real-world portfolio selection problems in the Vietnamese ﬁnance market is highly promising.

References 1. Sun, S., Wang, R., An, B.: Reinforcement learning for quantitative trading. ACM Trans. Intell. Syst. Technol. 14, 1–29 (2021) 2. U.S. Securities and Exchange Commission Staﬀ: Staﬀ Report on Algorithmic Trading in U.S. Capital Markets. Technical report, U.S. Securities and Exchange Commission. The Commission has expressed no view regarding the analysis, ﬁndings, or conclusions contained herein (2020)

114

B. B. Quoc et al.

3. Market capitalization of listed domestic companies (% of GDP) - Vietnam. https:// data.worldbank.org/indicator/CM.MKT.LCAP.GD.ZS?locations=VN 4. Chen, J.-C., Chen, C.-X., Duan, L.-J., Cai, Z.: DDPG based on multi-scale strokes for ﬁnancial time series trading strategy (2022) 5. Meli, E., Morini, B., Porcelli, M., Sgattoni, C.: Solving nonlinear systems of equations via spectral residual methods: stepsize selection and applications (2021) 6. Sutton, R.S., Barto, A.G.: Reinforcement Learning: An Introduction, 2nd edn. The MIT Press, US (2018) 7. Bellman, R.E.: Dynamic Programming. NJ Princeton University Press, Princeton (1957) 8. Puterman, M.L.: Markov Decision Processes: Discrete Stochastic Dynamic Programming. John Wiley & Sons, Hoboken (1994) 9. Watkins, C.J., Dayan, P.: Q-learning. Mach. Learn. 8(3–4), 279–292 (1992) 10. Sutton, R.S., McAllester, D., Singh, S., Mansour, Y.: Policy gradient methods for reinforcement learning with function approximation. In: Solla, S., Leen, T., Muller, K. (eds.) Advances in Neural Information Processing Systems, vol. 12. MIT Press (1999) 11. Schulman, J., Levine, S., Moritz, P., Jordan, M.I., Abbeel, P.: Trust region policy optimization. CoRR abs/1502.05477 (2015) 12. Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal Policy Optimization Algorithms (2017) 13. Mnih, V., et al.: Asynchronous Methods for Deep Reinforcement Learning (2016) 14. Haarnoja, T., Zhou, A., Abbeel, P., Levine, S.: Soft Actor-Critic: Oﬀ-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor (2018) 15. Fujimoto, S., Hoof, H., Meger, D.: Addressing Function Approximation Error in Actor-Critic Methods (2018) 16. Zhang, H., Shi, Z., Hu, Y., Ding, W., Kuruoglu, E.E., Zhang, X.-P.: Strategic Trading in Quantitative Markets through Multi-Agent Reinforcement Learning (2023) 17. Lowe, R., Wu, Y., Tamar, A., Harb, J., Abbeel, P., Mordatch, I.: Multi-agent actor-critic for mixed cooperative-competitive environments (2020) 18. Fathi, A., Hientzsch, B.: A Comparison of Reinforcement Learning and Deep Trajectory Based Stochastic Control Agents for Stepwise Mean-Variance Hedging (2023) 19. Majidi, N., Shamsi, M., Marvasti, F.: Algorithmic trading using continuous action space deep reinforcement learning. Exper Syst. App. 235, 121245 (2022) 20. Lazov, B.: A Deep Reinforcement Learning Trader without Oﬄine Training (2023) 21. Hasselt, H., Guez, A., Silver, D.: Deep reinforcement learning with double q- learning (2015) 22. Ehlers, R.: Formal veriﬁcation of piece-wise linear feed-forward neural networks. In: D’Souza, D., Narayan Kumar, K. (eds.) ATVA 2017. LNCS, vol. 10482, pp. 269–286. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-68167-2 19 23. Wang, Y., Swaminathan, V.R., Granger, N.P., Perez, C.R., Michler, C.: Deep Reinforcement Learning for Power Trading (2023) 24. Zhan, M., Chen, J., Du, C., Duan, Y.: Twin delayed multi-agent deep deterministic policy gradient. In: 2021 IEEE International Conference on Progress in Informatics and Computing (PIC), pp. 48–52 (2021) 25. Campbell, S.L., Gear, C.W.: The index of general nonlinear DAES. Numer. Math. 72(2), 173–196 (1995) 26. Pham, U., Luu, Q., Tran, H.: Multi-agent reinforcement learning approach for hedging portfolio problem. Soft. Comput. 25(12), 7877–7885 (2021). https://doi. org/10.1007/s00500-021-05801-6

AIoT Technologies

A Systematic CL-MLP Approach for Online Forecasting of Multiple Key Performance Indicators Pha Le1,2 , Triet Le1,2 , Thien Pham1,2 , and Tho Quan1,2(B) 1

Faculty of Computer Science and Engineering, Ho Chi Minh City University of Technology (HCMUT), Ho Chi Minh City, Vietnam {ltpha.sdh20,triet.lecsk20,pcthien.sdh20,qttho}@hcmut.edu.vn 2 Vietnam National University Ho Chi Minh City, Ho Chi Minh City, Vietnam

Abstract. The problem of predicting key performance indicators in mobile networks has had impacts on improving resource utilization with powerful applications of machine learning and deep learning. Based on these forecasts, telecommunications network operators can be proactive in allocating resources or preventing incidents that aﬀect key performance. However, previous studies often focused on a few speciﬁc indicators. In this paper we perform a deep analysis of data from 4G key performance indicators on multiple aspects such as user traﬃc, average download speed, service drop rate, handover success rate, and service setup success rate in real time. With a deep learning approach and online learning method, we propose a CL-MLPs model that combines Convolutional Neural Network (CNN), LSTM and Multi-layer Perceptron (MLP) architectures that can predict multiple key performance indicators at the same time with high accuracy, which can be used in predicting anomalies on mobile networks. Keywords: key performance indicator forecasting

1

· online training · time-series

Introduction

4G technology plays a vital role in Vietnam’s digital economy aspirations [1]. It oﬀers high-speed and dependable mobile internet connectivity for millions of Vietnamese users, facilitating their access to various online platforms and resources. 4G also encourages innovation and improves productivity, eﬃciency, and quality of life in diﬀerent sectors and communities. 4G is essential for 5G implementation and will coexist with 5G in the long term. The dynamic wireless communication environment challenges many applications like adaptive multimedia streaming services [2]. Hence, Vietnam should enhance and streamline its 4G infrastructure and market. 4G key performance indicator (KPI) depends on various parameters, such as signal strength, bandwidth, latency, jitter, packet loss, etc. Predicting 4G KPI can assist network operators to optimize network c The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 N.-N. Dao et al. (Eds.): ICIT 2023, LNDECT 187, pp. 117–127, 2023. https://doi.org/10.1007/978-3-031-46573-4_11

118

P. Le et al.

performance and service quality, and help users to select the optimal network provider and plan. Forecasting key performance from historical data is crucial for network management. It helps network operators optimize resources, improve service quality, and avoid failures. However, most existing methods use conventional models such as CNN and LSTM, which are not suitable for 4G data. 4G data have high speed, large bandwidth, and diverse services that require more advanced and adaptive models for accurate forecasting. Conventional models have several drawbacks for 4G data. Moreover, they do not support online learning, which is essential for adapting to the dynamic changes in real-time data. Therefore, we need new and innovative models for key performance forecasting that can leverage 4G data and overcome its challenges. We propose a prediction model for real 4G data in Vietnam, a fast-growing mobile market in Southeast Asia. Our model combines convolutional neural networks (CNN) and long short-term memory (LSTM) networks, which extract spatial and temporal features, respectively. Our model also uses online learning to update its parameters dynamically with streaming data. Our main results of our research are as follows. 1. This study proposes a multivariate CL-MLPs model that integrates three advanced neural network architectures: CNN, LSTM and MLPs. 2. The model uses online learning to leverage the continuous collection of multiple key performance indicator data and to model the complex relationships and dynamics among various features and metrics of the network. 3. This study has presented a new technique for the task and validated its performance and reliability on a realistic dataset, demonstrating that it surpasses various existing methods in terms of precision and adaptability. The structure of this paper is as follows. Section 1 provides the background and motivation of our study and introduces our real-world network dataset. Section 2 formulates the 4G prediction problem and illustrates the input data with an example KPI. Section 3 surveys the existing literature on key performance forecasting and identiﬁes the research gap. Section 4 describes our proposed models with their architectures, parameters, and training methods. Section 5 evaluates the forecasting results and compares the models using various metrics. Section 6 summarizes and suggests future work.

2

Preliminaries

On a 4G network system, network signals are emitted from CELLs installed in diﬀerent geographical locations. Each CELL serves a certain number of users (users) near the location of that CELL. To measure the quality of a CELL’s network, we will rely on KPIs. In practice, evaluating the network quality and of a CELL requires 5 basic KPIs, including CSSR, USER UL AVG THPUT, SERVICE DROP ALL, TRAFFIC, INTRA FREQUENCY HO as described in Table 1. In reality, the data is collected at the hourly level, represented as time

A Systematic CL-MLP Approach for Online Forecasting

119

series. Figure 1 illustrates the hourly data collection of the KPI USER UL AVG THPUT. Other KPIs are collected in a similar way. Table 1. Data sources for the multi-KPI forecasting model. The name of the KPI

Description

CSSR USER UL AVG THPUT

Call setup success rate Average user download speed on the link up Loss rate of all services SERVICE DROP ALL Network traﬃc TRAFFIC INTRA FREQUENCY HO The successful network transfer rate of cells of the same frequency

Fig. 1. Representing USER UL AVG THPUT data as a time series

Based on this information, the engineer responsible for the area can monitor, assess, analyze and propose solutions to improve KPIs and network quality. Therefore, forecasting in 4G networks will be a problem of predicting the value of based on the values of that have been collected previously. This is a multivariate time series forecasting problem. However, for 4G networks, there has been no previous work that uses all 5 KPIs as described above due to the lack of real data. Moreover, proposing a forecasting model that handles all 5 KPIs at the same time will encounter many diﬃculties due to the nature of the data of each KPI being diﬀerent. These are the obstacles we need to tackle in this study.

3 3.1

Related Works Time Series Forecasting Models

Many studies predict key performance indicators using statistical or ML methods [4]. They consider network factors such as KPIs, user mobility, and regional

120

P. Le et al.

congestion [5][6]. This helps to control network functions proactively and improve the Quality of Experience of end users [8]. Time series forecasting is an important technique for predicting future values based on past observations. It has many applications in domains such as ﬁnance, economics, and network monitoring. Traditional statistical methods like ARIMA have long been used for time series modeling and forecasting. For example, [3] proposed a model to predict and classify network traﬃc (Traﬃc KPI) using time series forecasting methods with basic machine learning and statistical learning models such as LSTM and ARIMA. They found that classical algorithms performed better than basic machine learning methods in terms of complexity and accuracy for network traﬃc prediction. They argued that the data complexity was not high enough for deep learning models (such as LSTM) to outperform classical methods. However, this does not imply that deep learning is ineﬀective for network index data. The reason will be discussed in the next section. In recent years, machine learning approaches like deep neural networks have emerged as powerful techniques, often outperforming classical models. The authors of [3] present DeepAnT, a deep learning approach for detecting time series anomalies. DeepAnT combines a CNN-based predictor with an anomaly detector that compares predictions and observations. DeepAnT can handle anomalies without ﬁltering them during training. A schematic representation of the structure of the model can be found in Fig. 2. Experiments on 10 datasets show that DeepAnT surpasses 15 other methods in most cases, proving the suitability of CNNs for time series tasks. Hybrid models that combine CNN and LSTM have proven very eﬀective for univariate and multivariate time series forecasting in diﬀerent domains. For example, [15] proposed a CNN-LSTM model for passenger demand prediction that outperformed ARIMA and single model RNN baselines. [16] extensively evaluated deep learning models in multivariate time series classiﬁcation, ﬁnding that convolutional-LSTM architectures achieved the best overall performance.

Fig. 2. DeepAnT architecture for time series prediction. [3]

A Systematic CL-MLP Approach for Online Forecasting

3.2

121

Online Learning

Online learning methods that update models continuously as new data arrives are also promising for time series problems. [11]. It is suitable for large or changing data sets. For example, e-commerce websites use it to recommend products based on recent views. Online learning is beneﬁcial for large, nonstationary datasets like network traﬃc. Some online learning machine learning methods are: Stochastic Gradient Descent (SGD)— Algorithms like stochastic gradient descent optimize models in real-time [12]. Passive Aggressive (PA)—: It updates the model parameters when the model makes a mistake on a new data sample [13]. Perceptron—: It updates the model whenever a new data sample is provided and ﬁnds the optimal classiﬁcation equation for diﬀerent classes [14]. Our proposed CL-MLP model builds on recent advances in deep neural networks for time series. The CNN layers learn informative features from multivariate traﬃc data. The LSTM layers capture long-term temporal dependencies. Online learning enables real-time adaptation to new patterns. Our experiments demonstrate CL-MLP’s accuracy for KPI forecasting, outperforming models like DeepAnT. Deep learning holds signiﬁcant promise for advancing network monitoring capabilities.

4 4.1

CL-MLP Our Workflow

KPIs

Transformed KPIs

Normed KPI

Inverse transform

Predicted changing

CL-MLP

Fig. 3. A comprehensive framework for estimating various key performance metrics based on the input data

Our workﬂow in Fig. 3 improves the quality of our network analysis, we use several indicators of network performance as input variables and construct new time series with Δt(24h) = T24 − T0 , which represents the diﬀerence between the current and the previous day. This transformation allows us to obtain a more uniform and distinct data distribution between normal and abnormal cases, as

122

P. Le et al.

shown in Fig. 4. Next, we normalize our time series and feed them to a CLMLP model, a type of neural network that can capture complex patterns and relationships. The model outputs a predicted value for each indicator, which we then inverse transform to recover the original scale.

Fig. 4. Data distribution between normal and abnormal cases after transform

4.2

Model Construction

In this paper, we propose a novel method for 4G multiple timeseries prediction with deep learning. The research employs a hybrid model of convolutional neural networks (CNNs) and long short-term memory (LSTM) networks for time series analysis [10]. Combining CNN and LSTM networks is an eﬀective method for solving time series prediction problems in the ﬁeld of machine learning. The hybrid model uses CNN to extract features from the input data, while LSTM is used to learn the relationship between the features in the time series. The combination of CNN and LSTM helps the model learn complex features and the relationship between those features in the time series. We formulate our method as follows: Let zi,T = [xi,T −N +1 , xi,T −N , ..., xi,T ] be the input vector in time step T , where xi,T is the value of the i-th time series and N is the window size of the encoder. We assume that we have N historical observations and we want to n z , n is the number predict the next H steps. The CNN takes zT = i=1 i,T of time series, as input and produces a feature vector fT as output. The CNN applies a series of ﬁlters and pooling operations to extract features from zT , while the LSTM uses a recurrent structure to capture the temporal dependencies in zT . fT = CNN(zT )

(1)

hT = LSTM(zT )

(2)

A Systematic CL-MLP Approach for Online Forecasting

123

The concatenation layer combines fT and hT into a single vector zT : zT = [fT ; hT ]

(3)

The MLPs take zT as input and produce a prediction vector ˆT +H = [ˆ z z1,T +1 , ..., x ˆ1,T +H , ..., zˆn,T +1 , ..., zˆn,T +H ]

(4)

ˆT +H = MLP(zT ) z

(5)

as output: The model is trained by the back-propagation on mean squared error (MSE) metric. MSE =

n H 1 (ˆ xi,T +j − xi,T +j )2 n × H i=1 j=1

(6)

Our CL-MLP model is illustrated in Fig. 5.

Fig. 5. Our novel framework CL-MLP for the task of multivariate time series forecasting with online learning

4.3

Online Learning

Stochastic gradient descent (SGD) is a widely used optimization algorithm for online learning, a form of machine learning where the model parameters are updated continuously based on new data. One way to update our model parameters in real time is to use the following formula. θt+1 = θt − ηt ∇ft (θt )

(7)

where θt is the parameter vector in the time step t, ηt is the learning rate, and ∇ft (θt ) is the gradient of the loss function ft with respect to θt . This formula can be interpreted as moving the parameters in the opposite direction of the gradient by a small step size determined by the learning rate.

124

5 5.1

P. Le et al.

Experiment Results Dataset

A data set of 2200 ﬁles was obtained from the process of managing the degradation of key performance indicators (KPIs) from various input sources, such as property management system (PMS) KPI evaluation, measurement results KPI evaluation monitoring system (MMS, Mentor) and mobile network monitoring system. Each ﬁle contained quality data of 5 KPIs in the form of hourly time series with KPI values. To detect an abnormality in a CELL, which serves a number of users in its vicinity, we used the past n values of the time series (n = 24). 5.2

Our Results

The Online learning methods oﬀers several advantages for time series forecasting compared to traditional batch learning methods. It allows continuous learning and adaptation to changes in the underlying patterns of the data, faster response times, scalability, improved accuracy, and greater ﬂexibility in the modeling approach. These beneﬁts make online learning particularly useful for forecasting time series with large amounts of data like 4G network performance, where real-time monitoring and quick adaptation to changes are essential (Show on Fig 6). As more data become available, online learning can provide more accurate predictions and enable better decision making, leading to more eﬀective planning and resource allocation. Table 2. Model Performance Comparison Model

Average Loss Average MAE Average RMSE

CL-MLP 0.1686 DeepAnt 0.2959 0.2269 LSTM

0.1724 0.1994 0.1822

0.4656 0.5438 0.4753

The Tables 2 show CL-MLP has the lowest average loss, DeepAnt has the highest average loss. For MAE, DeepAnt has slightly higher error than CL-MLP and LSTM. For RMSE, CL-MLP has the lowest error while DeepAnt has the highest error. Overall, these numerical results on the test set indicate CL-MLP performs the best out of the three models in terms of loss and RMSE. LSTM is comparable to CL-MLP on MAE. DeepAnt lags on all metrics. Figure 7 (a) illustrates the loss values of three forecasting models over 100 epochs: DeepAnt, CL-MLP, and LSTM. The ﬁgure shows that DeepAnt exhibits the highest loss and the slowest convergence among the models, whereas CLMLP achieves the lowest loss and the fastest convergence. LSTM performs moderately in terms of loss and convergence. These ﬁndings indicate that CL-MLP is

A Systematic CL-MLP Approach for Online Forecasting

125

Fig. 6. Utilization of our online training methodology resulted in the INTRA FREQUENCY HO KPI.

the most eﬀective model for this task, followed by LSTM and DeepAnt. A similar pattern is evident in the comparison of MAE and RMSE metrics in Fig. 7 (b). Our prediction outcomes in Fig. 8, our CL-MLP model approximates the true values with high accuracy.

Fig. 7. Predictive loss across KPIs for DeepAnt (blue line), LSTM (green line), and CLMLP (red line) models. The mean absolute error (MAE) and root mean squared error (RMSE) values compare total loss across all ﬁve KPIs forecast by the three models. (Color ﬁgure online)

126

P. Le et al.

Fig. 8. A graphical representation of the comparison between the actual values and the predicted values obtained by our proposed method (CL-MLP). (Color ﬁgure online)

6

Conclusion

In this paper, we have tackled the problem of predicting key performance indicators based on continuous learning and our proposed method (CL-MLP). We have presented a novel end-to-end framework that leverages the continuous collection of key performance indicator data to capture the complex patterns and dependencies among diﬀerent network features and metrics. We have also discussed how our framework can be applied to network management and optimization, as well as its limitations and future directions for further improvement. Our extensive experiments and analysis on real-world datasets have demonstrated that our framework outperforms existing methods in terms of performance and cost-eﬀectiveness. This is signiﬁcant as the target tasks are already practically used in industries. For future research, we suggest modifying and extending the CL-MLP framework, optimizing the data collection system, and adding more features and metrics to the prediction model. We also encourage exploring other dorecasting frameworks that can surpass the CL-MLP framework, and using more reﬁned features to increase the diversity, accuracy and generalization of the performance metrics and the key performance indicator prediction system.

A Systematic CL-MLP Approach for Online Forecasting

127

Acknowledgement. This research is funded by the Vietnam Posts and Telecommunications Group (VNPT) under grant number 169KHCN2021005.

References 1. Nguyen, T.T., Nguyen, T.H., Nguyen, T.H., Nguyen, T.H.: The development of the digital economy in Vietnam. J. Asian Financ. Econ. Bus. 7(12), 1001–1010 (2020) 2. Abou-Zeid, H., Hassanein, H.S., Valentin, S.: Energy-eﬃcient adaptive video transmission: exploiting rate predictions in wireless networks. IEEE Trans. Veh. Technol. 63(5), 2013–2026 (2014) 3. Azari, A., Papapetrou, P., Denic, S., Peters, G.: Cellular traﬃc prediction and classiﬁcation: a comparative evaluation of LSTM and ARIMA. In: Kralj Novak, ˇ P., Smuc, T., Dˇzeroski, S. (eds.) DS 2019. LNCS (LNAI), vol. 11828, pp. 129–144. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-33778-0 11 4. Bui, N., Cesana, M., Hosseini, S.A., Liao, Q., Malanchini, I., Widmer, J.: A survey of anticipatory mobile networking: Context-based classiﬁcation, prediction methodologies, and optimization techniques. IEEE Commun. Surv. Tutor. 19(3), 1790–1821 (2017) 5. Fahad Iqbal, M., Zahid, M., Habib, D., John, L.K.: Eﬃcient prediction of network traﬃc for real-time applications. J. Comput. Netw. Commun. 2019, 4067135 (2019) 6. Elsherbiny, H., Abbas, H.M., Abou-zeid, H., Hassanein, H.S., Noureldin, A.: 4G LTE network throughput modelling and prediction. In: GLOBECOM 2020–2020 IEEE Global Communications Conference, pp. 1–6. IEEE (2020) 7. Gebrie, H., Farooq, H., Imran, A.: What machine learning predictor performs best for mobility prediction in cellular networks? In: 2019 IEEE International Conference on Communications Workshops (ICC Workshops), pp. 1–6. IEEE (2019) 8. Santos, G.L., Endo, P.T., Sadok, D., Kelner, J.: When 5g meets deep learning: a systematic review. Algorithms 13(9), 208 (2020) 9. Munir, M., Siddiqui, S.A., Dengel, A., Ahmed, S.: DeepAnT: a deep learning approach for unsupervised anomaly detection in time series. IEEE Access 7, 1991–2005 (2019) 10. Du, Q., Gu, W., Zhang, L., Huang, S.-L.: Attention-based LSTM-CNNs for timeseries classiﬁcation. 410–411 (2018). https://doi.org/10.1145/3274783.3275208 11. Steven, C.H., Hoi, D.S., Lu, J., Zhao, P.: Online learning: a comprehensive survey. Neurocomputing 459, 249–289 (2018) 12. Bottou, L.: Large-scale machine learning with stochastic gradient descent. In: Lechevallier, Y., Saporta, G. (eds.) Proceedings of COMPSTAT 2010. PhysicaVerlag HD, pp. 177–186. Springer, Berlin, Heidelberg (2010). https://doi.org/10. 1007/978-3-7908-2604-3 16 13. Crammer, K., Dekel, O., Keshet, J., Shalev-Shwartz, S.: Online passive-aggressive algorithms. J. Mach. Learn. Res. 7, 551–585 (2006) 14. Gallant, S.I.: Perceptron-based learning algorithms. IEEE Trans. Neural Netw. 1(2), 179–191 (1990) 15. Cui, Z., Henrickson, K., Ke, R., Wang, Y.: Traﬃc graph convolutional recurrent neural network: a deep learning framework for network-scale traﬃc learning and forecasting. IEEE Trans. Intell. Transp. Syst. 21(11), 4883–4894 (2018) 16. Ismail Fawaz, H., Forestier, G., Weber, J., Idoumghar, L., Muller, P.-A.: Deep learning for time series classiﬁcation: a review. Data Min. Knowl. Disc. 33(4), 917–963 (2019). https://doi.org/10.1007/s10618-019-00619-1

Neutrosophic Fuzzy Data Science and Addressing Research Gaps in Geographic Data and Information Systems A. A. Salama1 , Roheet Bhatnagar2 , N. S. Alharthi3 , R. E. Tolba4 , and Mahmoud Y. Shams5(B) 1 Dean of Higher Institute of Computer Science and Information System, 6 October, Giza, Egypt 2 Department of Computer Science and Engineering, Manipal University Jaipur, Jaipur,

Rajasthan, India [email protected] 3 Department of Mathematics, Faculty of Science and Arts, King Abdulaziz University, Rabigh 25732, Saudi Arabia [email protected] 4 Centre for Theoretical Physics, The British University in Egypt (BUE), El-Shorouk City, Cairo, Egypt [email protected] 5 Faculty of Artificial Intelligence, Kafrelsheikh University, Kafrelsheikh 33516, Egypt [email protected]

Abstract. This paper explores the topic of data, information, and geographical knowledge in a neutrosophic environment, where certainty and doubt coexist. Neutrosophy is a modern tool that can handle different types of uncertainty, such as doubt, ambiguity, ignorance, neutrality, and saturation. We present the latest trends and research gaps in the field of information systems and geographic data from a neutrosophic perspective. We also introduce new concepts of topological data and information and apply some neutrosophic topological concepts to GIS. We show some neutrosophic spatial relationships that can capture the complexity and vagueness of real-world phenomena. Finally, we summarize the main findings and contributions of this paper for the advancement of neutrosophic GIS. Keywords: neutrosophic · Artificial Intelligence · geographical Information System

1 Introduction Neutrosophy is a philosophical and mathematical framework that deals with neutrality and the interaction between truth and falsehood. It was introduced by Florentin Smarandache in 1995 and extended by Ahmed Salama to include crisp neutrosophic crisp theory and applications in various fields, such as computer science, information systems, and statistics [1]. © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 N.-N. Dao et al. (Eds.): ICIT 2023, LNDECT 187, pp. 128–139, 2023. https://doi.org/10.1007/978-3-031-46573-4_12

Neutrosophic Fuzzy Data Science and Addressing

129

The neutrosophic approach combines fuzzy logic, probability theory, and topology to handle uncertainty and limitations in data and knowledge [2]. NF-topology (GIS) is an emerging field that integrates neutrosophic principles into geographic analysis. Unlike traditional GIS analysis, which uses fuzzy sets, NF-topology allows the representation of geographic data as neutrosophic sets consisting of true, false, and indeterminate subsets [2, 3]. NF-GIS topology can be used for spatial aggregation, interpolation, and prediction, as well as developing decision support systems for managing geographic resources. Salama introduced and studied the NF-GIS topology in graphical analysis, showing how it can deal with uncertain and contradictory geographic data [4]. Neutrosophic techniques are used in various fields, such as data science, artificial intelligence, philosophy, social sciences, and literary criticism, to handle uncertainty and limitations in knowledge and information [5]. NF-topology (GIS) is an emerging field that provides a more accurate representation of geographic reality and helps decision makers in managing geographic resources [6]. The concept of “Neutrosophic fuzzy” is a combination of two mathematical and philosophical frameworks, namely “neutrosophy” and “fuzzy logic”. Neutrosophy is a theory that deals with the interaction between truth, falsehood, and indeterminacy, while fuzzy logic allows for the representation of degrees of truth and falsehood [7]. The combination of these two concepts results in the Neutrosophic fuzzy theory, which deals with the representation and manipulation of uncertain and imprecise data. The Neutrosophic fuzzy theory provides a framework for representing data that is not entirely true or false, but lies in between. Neutrosophic fuzzy sets can be used to represent the degree of membership of an element to a set, where the degree of membership lies between completely true and completely false. The degree of indeterminacy can also be incorporated using a third subset, which represents the degree of neither true nor false. The concept of Neutrosophic fuzzy theory has been applied in various fields, including decision-making, image processing, control systems, and expert systems. In decision-making, Neutrosophic fuzzy logic can be used to handle vague and uncertain information, while in image processing, it can be used to handle imprecise and ambiguous data. In control systems, it can be used to handle uncertain and imprecise measurements [8]. Salama [9] was the first to introduce and study the NF-GIS topology in graphical analysis and describe how neutrosophic principles of GIS analysis can be applied in order to deal with uncertain and contradictory geographic data with certainty and contradiction. He stated that this approach gives a more accurate representation of geographic reality and can help decision makers make more informed choices about how to manage and allocate geographic resources, as the NF-GIS topology processes a large number of data and information through a geographic data representation like neutrosophic sets (NS), allowing for indifference, uncertainty and inconsistency. NS consists of three NF-subsets: the truth NF-subset, the indeterminate NF-subset, and the false NF-subset. The indeterminate NF-subset represents uncertainty or ambiguity in the data, while the true and false NF-subsets represent the presence of contradictory information. NF-GIS topology can be applied to a wide range of geographic analysis tasks, including spatial aggregation, spatial interpolation, and spatial prediction. It can also be used to develop decision support systems that can help plan and manage geographic resources. One of

130

A. A. Salama et al.

the main advantages of the NF-GIS topology is its ability to deal with data gaps and uncertain and contradictory information [10]. The study presented by Cardone et al. [11] demonstrated that GIS-based framework that uses a multi-criteria decision analysis (MCDA) fuzzy model to identify areas at risk of urban hazards in climate scenarios. The fuzzy-based approach adapts to the hierarchical structure of urban systems and the decision-making process. The criteria are organized hierarchically, with fuzzy numbers assigned to the leaf nodes. Each node is given a weight, called coefficient of relative significance, which determines its importance in generating the parent node in the higher level. The fuzzy set in the parent node is implemented using a fuzzy operator that combines the fuzzy sets of the child nodes. This is particularly important in geographic analysis, where data can be complex and difficult to interpret. By allowing for indeterminacy and inconsistency, the NF-GIS topology can provide a more accurate representation of geographic reality. In this paper, NF-topology (GIS) refers to the application of neutrosophic principles in geographic information systems (GIS) analysis [12]. NF-topology provides a more accurate representation of geographic reality by allowing for the representation of data that exhibit uncertainty and ambiguity. By using neutrosophic sets, it enables decision-makers to make more informed choices about how to manage and allocate geographic resources, especially in situations where traditional GIS analysis may not be sufficient. NF-topology and NF-GIS topology are related concepts that use neutrosophic sets to represent and reason about uncertainty and imprecision in data. NF-topology is a general framework that can be applied to various domains, while NF-GIS topology is a specific application of that framework to geographic analysis.

2 Neutrosophic Fuzzy Data Sciences Neutrosophic data science is a recent discipline in the field of computational and informatics sciences. It aims to develop tools and techniques that enable researchers and specialists to deal with neutrosophic information data, which includes quantitative and qualitative information, as well as spatial and non-spatial data. Neutrosophic data science is characterized by its ability to deal with uncertainty and ambiguity in data that includes changes and contradictions in the qualitative and quantitative values of information, by introducing neutrosophic concepts that allow a better representation of the multiple relationships between data. The applications of neutrosophic data science include many fields such as environmental science, agriculture, medicine, business, marketing, finance, insurance, commerce, and others. NF Data Sciences is an emerging field of research that combines the principles of neutrosophy and data science. NFS are a generalization of fuzzy sets, which themselves are a generalization of crisp sets [13]. NFS allow for degrees of truth, falsity, and indeterminacy, and are defined on an underlying set. To define a NFS on a set X, we associate with each element x of X a triplet (T(x), I(x), F(x)), where T(x) represents the degree of truth, I(x) represents the degree of indeterminacy, and F(x) represents the degree of falsity of the element x in the set X. To be a valid NFS, the triplet must satisfy the following axioms [14]:

Neutrosophic Fuzzy Data Science and Addressing

131

1. Non-negativity: 0 ≤ T(x), I(x), F(x) ≤ 1 for all x ∈ X. 2. Lower bound: There exists an element x0 ∈ X such that T(x0 ) + I(x0 ) + F(x0 ) = 0 for all x ∈ X. 3. Upper bound: There exists an element x1 ∈ X such that T(x1 ) + I(x1 ) + F(x1 ) = 1 for all x ∈ X. 4. Normalization: (T(x) + I(x) + F(x)) = 1. These neutrosophic fuzzy axioms ensure that the degree of truth, indeterminacy, and falsity of each element in the set is well defined and bounded, and that the set itself is normalized to have total NF membership value equal to 1. Data science is an interdisciplinary field that involves the use of statistical and computational methods to extract insights from data. NF Data Sciences seeks to develop new methods and techniques for analyzing and interpreting complex and uncertain data. It recognizes that many real-world phenomena exhibit indeterminacy, uncertainty, and contradiction, and aims to develop approaches that can account for these characteristics. Some of the key concepts and techniques used in NF Data Sciences include [7]: 1. NFS: NFS are a generalization of fuzzy sets that allow for the representation of indeterminacy, uncertainty, and contradiction in the data. They consist of three NFsubsets: the truth NF-subset, the indeterminate NF-subset, and the false NF-subset. 2. NF logic: NF logic is a generalization of crisp logic that allows for the representation of indeterminacy, uncertainty, and contradiction. It can be used to reason about data that exhibits these characteristics. 3. NF probability: NF probability is a generalization of crisp probability that allows for the representation of uncertain and ambiguous data. It can be used to estimate probabilities when the data is incomplete or uncertain. 4. NF-clustering: NF-clustering is a technique for grouping data points into clusters based on their similarity. It can be used to identify patterns in data that exhibit indeterminacy, uncertainty, and contradiction. A NF-topology (NT ) an a non-empty set X is a family τ of NF-NF-subsets in X satisfying the following axioms as shown in Eqs. (1, 2, and 3). NT 1 = ON , IN ∈ τ

(1)

NT 2 = G1 IG 2 ∈ τ forG1 , G 2 ∈ τ

(2)

NT 3 = UG i ∈ τ forall{G i , i ∈ J } ⊆ τ

(3)

in another meaning: To show that the family τ is a NFT on, we need to verify that it satisfies the following three axioms: 1. The NF empty set and the whole space belong to τ . 2. The NF intersection of any finite number of NF sets in is also in τ . 3. The NF union of any collection of NF sets in τ is also in τ .

132

A. A. Salama et al.

In the case the pair (X, τ ) is called a NFTS and any NFS in τ is known as neuterosophic fuzzy open set (NFO) in X. The elements of τ are called NF-open and A NFS F is closed if and only if it C(F) is NF-open [15]. NF-topology (NFT) layers are a type of GIS layer that combines the concepts of neutrosophic logic and fuzzy topology. Neutrosophic logic is a branch of logic that deals with uncertain, imprecise, and indeterminate information, while fuzzy topology is a branch of topology that deals with the concept of fuzzy sets. NFT layers are used to represent and analyze spatial data that contain uncertainty, ambiguity, and imprecision. They are particularly useful in GIS applications that deal with complex systems and phenomena, such as urban planning, environmental management, and risk assessment. NFT layers can be used to represent a wide range of spatial data, including land cover types, environmental parameters, social and economic variables, and transportation networks. They can also be used to analyze the relationships between different spatial features and to identify patterns and trends in the data. One example of an NFT layer is a NF topological map, which is a graphical representation of spatial data that uses NF-sets to represent the uncertainty and ambiguity of the data. The map is constructed by dividing the spatial domain into a set of NF- regions, each of which is characterized by a degree of membership in different NF-sets. Another example of an NFT layer is a NF-topological network, which is a network representation of spatial data that uses NF-sets to represent the uncertainty and ambiguity of the data. The network is constructed by defining a set of NF-nodes and edges, each of which is characterized by a degree of membership in different NF-sets.

3 Neutrosophic Fuzzy GIS- Map Neutrosophic fuzzy cognitive maps are a type of map that shows how different concepts are related to each other in a way that is not sure or clear. They use the concepts of neutrosophic theory to show how true, false, or indeterminate the relationships are. Here is an example of a Neutrosophic fuzzy cognitive map of the conjugation technologies: In this Neutrosophic fuzzy cognitive map, the nodes represent different concepts related to conjugation technologies, and the edges represent the causal relationships between the concepts. The edges are labeled with a truth value, a falsity value, and an indeterminacy value, which represent the degree to which the causal relationship is true, false, or indeterminate. Some other examples of Neutrosophic fuzzy cognitive maps are: - A Neutrosophic fuzzy cognitive map of the factors affecting climate change, where the nodes represent different factors such as greenhouse gases, deforestation, solar activity, and the edges represent how they influence climate change in a true, false, or indeterminate way. - A Neutrosophic fuzzy cognitive map of the preferences of customers for different products or services, where the nodes represent different products or services such as price, quality, brand, and the edges represent how they affect customer satisfaction in a true, false, or indeterminate way.

Neutrosophic Fuzzy Data Science and Addressing

133

- A Neutrosophic fuzzy cognitive map of the causes and effects of social conflicts, where the nodes represent different causes such as poverty, inequality, corruption, and the edges represent how they lead to social conflicts in a true, false, or indeterminate way. Let τj : j ∈ J be a family of NTSS on X. Then ∩τj is a NF-topology (NFTS) on X. Furthermore ∩τj is the coarsest NT on X containing all. τj , S. The NFT layer provides a means to represent such data in a way that captures the uncertainty and imprecision. This layer is designed to deal with the limitations of traditional topology, which assumes that objects are either completely inside or outside a given region. In contrast, NFT allows for objects to have degrees of membership in a region, reflecting the uncertainty or imprecision in the data. This can be useful in a variety of applications, such as geographic information systems (GIS), where accurate representation of uncertainty is critical. NFT is a relatively new field of research, and there are many possible definitions and variations of NFTS and their regions. Here are a few examples of NFT regions: NF-Open Sets: In a NFTS, a set is called a NF-open set if its NF- interior is equal to itself. The NF-int of NFS a set is a function that assigns a NFS to each point in the set, representing the degree of membership of the point in the set as shown in Fig. 1.

(a)

(b)

(c)

Fig. 1. The Neutrosophic fuzzy GIS open regions with borders (a), (b), and (c).

NF-Closed Sets: A set is called a NF-closed set if its NF-exterior is a NF-open set. The NF-exterior of a set is a function that assigns a NFS to each point not in the set, representing the degree of non-membership of the point in the set as shown in Fig. 2.

134

A. A. Salama et al.

(a)

(c)

(b)

(d)

Fig. 2. Neutrosophic fuzzy GIS closed regions represents as (a), (b), (c), and (d).

NF-Boundary: The NF-boundary of a set is the set of points whose NF membership degree in the set is neither 0 nor 1. It represents the degree of indeterminacy or ambiguity in whether a point belongs to the set or not (Fig. 3).

(a)

(b)

Fig. 3. The Neutrosophic fuzzy GIS boundary regions (a), and (b).

NF-Connected Sets: A set is called NF-connected if it cannot be expressed as the union of two non-empty NF-disjoint sets. This definition captures the notion of connectedness while allowing for degrees of uncertainty and ambiguity in the membership of points as shown in Fig. 4.

Neutrosophic Fuzzy Data Science and Addressing

(a)

(c)

135

(b)

(d)

Fig. 4. The Neutrosophic fuzzy GIS connected regions represented as (a), (b), (c), and (d).

These are just a few examples of NFT regions, and there are many other possible definitions and variations depending on the specific properties and applications being considered.

4 Neutrosophic Crisp Open in GIS Topology Neutrosophic nearly crisp open sets in GIS topology are a generalization of the concept of open sets in crisp topology, where the membership of a point to a set is not limited to only two values, i.e., true or false, but is rather represented by a neutrosophic membership function [16]. In GIS (Geographic Information System) topology, neutrosophic nearly crisp open sets are used to model the uncertainty and imprecision that is inherent in spatial data [17]. Let X be a spatial domain in GIS topology, and let N be a neutrosophic set on X with membership function μN(x), where x is a point in X. A neutrosophic nearly crisp open set U in GIS topology is defined as a NF-subset of X such that for any point x in U, there exists a neutrosophic nearly crisp neighborhood N(x) of x such that: μN(x)(y) > α, for all y in N(x) and some threshold α [18]. The threshold α is a small positive value that represents the degree of tolerance allowed for the neutrosophic membership function [19]. In other words, a neutrosophic nearly crisp open set U is defined as a set of points in X such that for each point x in U, there exists a neutrosophic nearly crisp neighborhood N(x) of x, where the neutrosophic membership function μN(x) is sufficiently high for all points in N(x) to be considered as belonging to U, the classification of intuitionistic crisp topology of neutrosophic is shown in Fig. 5. The GIS-topologic spatial relations are shown in Figs. 6, 7, and 8.

136

A. A. Salama et al.

Fig. 5. The intuitionistic crisp topology

(a)

(b)

(c)

Fig. 6. The The GIS-topologic spatial relations of (a) Crisp relation, (b) Fuzzy relation, and (c) Neutrosophic Fuzzy Relation.

Neutrosophic Fuzzy Data Science and Addressing

(a)

(b)

137

(c)

Fig. 7. The The GIS-topologic spatial relations of (a) Line Intersect Polygon, (b) Fuzzy Line Intersect Fuzzy Polygon, and (c) Neutrosophic Fuzzy Line Intersect Fuzzy Polygon.

(a)

(b)

(c)

Fig. 8. The The GIS-topologic spatial relations of (a) Crisp Circles, (b) Fuzzy circles, and (c) Neutrosophic Fuzzy Circles.

5 Conclusion and Future Work NFS is well equipped to deal with missing data. By employing NSs in spatial data models, we can express a hesitation concerning the object of interest. This article has gone a step forward in developing methods that can be used to define NF spatial regions and their relationships. The main contributions of the paper can be described as the following: Possible applications have been listed after the definition of NS. Links to other models have been shown. We are defining some new operators to describe objects, describing a simple NF- region. This paper has demonstrated that spatial object may profitably be addressed in terms of NFS. Implementation of the named applications is necessary as a proof of concept. Eospatial topology studies the rules associated with the relationships between points, lines, and polygons that represent features of a geographic area. For example, when two polygons represent two neighboring countries, typical topological rules require that these two countries share a common boundary without any gaps and

138

A. A. Salama et al.

overlaps. Likewise, it would be nonsense to allow two polygons representing lakes to overlap. Availability of Data and Materials There is no applied data. Competing Interests There is no conflict of interest. Authors’ contributions. The researchers participated in all parts of the research equally. Funding. There is no conflict of funding.

References 1. Smarandache, F.: Operators on single-valued neutrosophic oversets, neutrosophic undersets, and neutrosophic offsets. In: Collected Papers Volume IX: On Neutrosophic Theory and Its Applications in Algebra, p. 112 (2022) 2. Salama, A.A., Alblowi, S.A.: Generalized neutrosophic set and generalized neutrosophic topological spaces. In: Infinite Study (2012) 3. Salama, A.A., Smarandache, F., Kroumov, V.: Neutrosophic crisp sets & neutrosophic crisp topological spaces. In: Infinite Study (2014) 4. Salama, A.A., Broumi, S., Alblowi, S.A.: Introduction to neutrosophic topological spatial region, possible application to gis topological rules. Int. J. Inf. Eng. Electron. Bus. 6, 15 (2014) 5. Salama, A.A., Smarandache, F., Kromov, V.: Neutrosophic closed set and neutrosophic continuous functions. In: Collected Papers Volume IX: On Neutrosophic Theory and Its Applications in Algebra, p. 25 (2022) 6. Salama, A.A., Smarandache, F., Alblowi, S.A.: New neutrosophic crisp topological concepts (2014) 7. Bui, Q.-T., Ngo, M.-P., Snasel, V., et al.: Information measures based on similarity under neutrosophic fuzzy environment and multi-criteria decision problems. Eng. Appl. Artif. Intell. 122, 106026 (2023) 8. Das, S., Roy, B.K., Kar, M.B., et al.: Neutrosophic fuzzy set and its application in decision making. J. Ambient Intell. Human Comput. 11, 5017–5029 (2020). https://doi.org/10.1007/ s12652-020-01808-3 9. Salama, A.A.: Basic structure of some classes of neutrosophic crisp nearly open sets and possible application to GIS topology. Neutrosophic Sets Syst. 7, 18–22 (2015) 10. Salama, A.A., Smarandache, F.: Neutrosophic ideal theory neutrosophic local function and generated neutrosophic topology. Neutrosophic Theory Appl. Collected Pap. 1, 213 (2014) 11. Cardone, B., Di Martino, F., Miraglia, V.: GIS-based hierarchical fuzzy MCDA framework for detecting critical urban areas in climate scenarios. In: Gervasi, O., Murgante, B., Rocha, A.M.A.C., et al., (eds.): Computational Science and Its Applications – ICCSA 2023 Workshops. Springer Nature Switzerland, Cham, pp 345–358 (2023). https://doi.org/10.1007/9783-031-37117-2_24 12. Pamucar, D., Ecer, F., Deveci, M.: Assessment of alternative fuel vehicles for sustainable road transportation of United States using integrated fuzzy FUCOM and neutrosophic fuzzy MARCOS methodology. Sci. Total Environ. 788, 147763 (2021). https://doi.org/10.1016/j. scitotenv.2021.147763

Neutrosophic Fuzzy Data Science and Addressing

139

13. Garrett, H.: Neutrosophic Duality. Dr. Henry Garrett (2023) 14. Garrett, H.: Beyond Neutrosophic Graphs. Dr. Henry Garrett (2023) 15. Jdid, M., Alhabib, R., Salama, A.A.: The basics of neutrosophic simulation for converting random numbers associated with a uniform probability distribution into random variables follow an exponential distribution. Neutrosophic Sets Syst. 53, 22 (2023) 16. Salama, A.A., Hanafy, I.M., Dabash, H.E.: Neutrosophic crisp closed region and neutrosophic crisp continuous functions. New Trends Neutrosophic Theory Appl. 1, 403 (2016) 17. Abdulkadhim, M.M., Imran, Q.H., Al-Obaidi, A.H., Broumi, S.: Neutrosophic crisp generalized αg-continuous functions. J. Neutrosophic Fuzzy Ststems (JNFS) 6, 08–14 (2023) 18. Salama, A.A., Alhabib, R.: Neutrosophic ideal layers & some generalizations for GIS topological rules. Int. J. Neutrosophic Sci. 8, 44–49 (2020) 19. Salama, A.A.: Neutrosophic crisp points & neutrosophic crisp ideals. In: Neutrosophic Sets and Systems, p. 49 (2013)

Inhibitory Control during Visual Perspective Taking Revealed by Multivariate Analysis of Event-Related Potentials Hirokazu Doi(B) Nagaoka University of Technology, Nagaoka City, Niigata 940-2188, Japan [email protected]

Abstract. To understand other’s mental state, one has to see things from other’s viewpoint. Visual perspective taking is an oft-used task to investigate the computational mechanism of “seeing things from other’s perspective”. Previous studies have indicated the possibility that visual perspective taking of other’s viewpoint requires active suppression of information from own visual perspective by inhibitory control, an intentional suppression of ongoing cognitive processing. The present study investigated neural correlates of inhibitory control during visual perspective taking by measuring electrophysiological responses while simultaneously completing visual perspective taking task and Go/NoGo task, a well-established behavioral paradigm to tap into the process of inhibitory control. Electrophysiological data was analyzed by multivariate analysis of single-trial event-related potentials (ERPs) utilizing machine-learning by linear discriminant classifier. The results revealed that inhibitory control modulated electrophysiological activation in parieto-central and fronto-temporal electrode sites. Modulation of scalp ERP field was confined to right electrode sites in perspective taking of other’s, but was observed bilaterally during self perspective taking. The potential implication of this observation is discussed. Keywords: ERP. Multivariate Analysis · Machine Learning · Visual Perspective Taking

1 Introduction Inferring other’s mental states is cognitive function with fundamental importance in navigating through social environment. In order to compute other’ mental state, one has to see things from other’s standpoint. Several researchers have argued that the ability to image how a visual scenery looks like from other’s visual perspective forms the basis of the ability to understand other’s mental states [1, 2]. As such, neural mechanism of perspective taking has long been the focus of intensive research in such fields as social cognition and developmental psychology. In these lines of research, many researchers have utilized visual perspective taking task [2]. In visual perspective taking task, a participant is asked to describe visual scenery from the visual perspective of others that is differently oriented from the participant’s. Participants who are good at perspective taking can mentally simulate visual scenery from other’s standpoint efficiently. © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 N.-N. Dao et al. (Eds.): ICIT 2023, LNDECT 187, pp. 140–147, 2023. https://doi.org/10.1007/978-3-031-46573-4_13

Inhibitory Control during Visual Perspective Taking Revealed

141

Human beings are inherently egocentric, and information computed from one’s own viewpoint interferes with and slows computation of other’s visual perspective [3, 4], a phenomenon termed egocentric interference. Egocentric information from one’s own perspective proceeds without conscious awareness. Thus, during visual perspective taking one has to actively inhibit egocentric information by executive function. In accordance with this hypothesis, a seminal study on a brain-damaged patient [5] reported that lesion in inferior frontal region, a locus of inhibitory control, impairs the ability to take visual perspective of other’s. However, few studies so far have reported neurophysiological evidence that neural regions responsible for inhibitory control are recruited efficiently in visual perspective taking of others in neurologically intact human subjects. The primary aim of the present study is to examine neural correlates of inhibitory control in visual perspective taking task by measuring event-related potentials (ERPs). Instead of traditional peak analysis focusing on latency and amplitude of empirically defined ERP peaks, conditional difference in ERP was examined by multivariate analysis [6, 7]. Multivariate analysis of neural activity pattern first gained popularity in fMRI studies on object recognition [8], but recently, increasing number of studies has adopted this approach to analysis of ERP scalp field [9, 10]. As mentioned above, majority of ERP studies investigated effects of experimental manipulation on pre-defined ERP peaks that had been identified empirically in literature. In contrast to this, multivariate analysis reveals conditional difference in ERP in a data-driven manner. Thus, multivariate ERP analysis has the potential to find hitherto overlooked neural activation associated with experimental manipulation of interest.

2 Method 2.1 Participants 19 neurotypical males with normal or corrected to normal visual acuity participated in the present study after giving written informed consent. The number of participants was determined based on previous study on multivariate analysis of ERPs induced by facial stimuli [10]. 2.2 Stimulus Female faces and images of numbers (‘6’ or ‘9’) were used as the visual stimuli. Fortyeight facial images were taken from SCUT-FBP database [11]. The eye region of twothirds of the faces was redrawn so that they are perceived to be looking downward. These images were used as the stimuli in Go condition. The eye region of the remaining images were left intact, i.e. looking straight, and they were used as the stimuli in No/Go condition. In stimulus display of each trial, a number image (‘6’ or ‘9’) painted on a floor viewed from 45 deg above was presented just below facial image. When a face was presented simultaneously with number images in Go conditions, female model was perceived to be looking downwards at number image. Faces in No/Go condition were perceived to be looking straight.

142

H. Doi

2.3 Procedure After participants’ arrival at the lab, EEG net-sensor was placed on participant’s scalp. After explanation of the experimental procedure, the EEG measurement started. After presentation of fixation cross at the center of screen, perspective instruction (“You” or “She”) was presented. After disappearance of perspective instruction, facial image and number image were presented simultaneously. The sequence of stimulus presentation is schematically shown in Fig. 1. Participants were instructed to answer the number from their own perspective in self perspective condition, whereas from female viewer’s perspective in other perspective condition. When the female viewer was looking downward at number on the floor, participants made responses as soon as possible by key-press (Go condition). However, when the female viewer was looking straight towards the participants, they were instructed to refrain from making any responses (No/Go condition). 128 Go trials and 32 No/Go trials were conducted in each perspective (self and other) condition, yielding in total of 320 trials. The trials were pseudo-randomly ordered. EEG data during experiment was recorded continuously at 1 kHz and stored on a hard disk.

Fig. 1. Schematic representation of sequence of stimulus presentation in Go and No/Go conditions. The facial image is not the one actually used in the experiment.

2.4 Analysis EEG data were first band-pass filtered and average-referenced. Then four channels (two for EOG and two at tragus) were deleted from the dataset. Thereafter, artifacts were corrected by independent component analysis. Before being submitted to multivariate analysis, data were down-sampled to 55 Hz. The data were epoched from 100 ms before to 600 ms after stimulus presentation (the onset of face and number images), and baselinecorrected. Data from trials in which the amplitude does not exceed ∓ 75 µV were retained.

Inhibitory Control during Visual Perspective Taking Revealed

143

Multivariate analysis of ERP consisted of first- and second-level analysis. The flow of analysis is schematically shown in Fig. 2. In the first-level analysis, a linear discriminant classifier was trained at each time point that discriminates trials in Go condition from those in No/Go condition based on amplitude data of scalp ERP field. Linear discriminant classifier was chosen because previous studies have reported superior performance of this classifier over other types of classification algorithms for ERP data [6]. Then, classification performance of trained classifier was evaluated based on ERP amplitude data of corresponding time point in test trials. Specifically, the performance of classifier trained based on ERP amplitude data at time point t of training trials was evaluated by ERP amplitude data at the same temporal point t of test trials. Classification performance was evaluated by 5-fold cross-validation, and AUC was computed as the indicator of classification performance at each time point. Consequently, the first-level analysis yielded time-series of AUC for each participant. In the second-level analysis, time window was determined during which classification performance was above chance performance (AUC = 0.5). AUC at each time point was tested statistically against AUC = 0.5, and temporal cluster with above-chance classification performance was extracted by cluster-permutation statistics.

Fig. 2. The flow of analytic procedure of the first level analysis in multivariate ERP analysis.

Multivariate analysis was first applied for all the data combining both self and other perspective conditions to see conditional difference in ERP scalp field between Go and No/Go condition. Then, the same analyses were applied to data in each of self and other perspective condition separately.

144

H. Doi

3 Results 3.1 Go vs No/Go Condition in the Self and Other Conditions Combined In the comparison between Go and No/Go conditions, multivariate analysis revealed statistically significant temporal-cluster from 303 to 593 ms after stimulus onset. The topography of conditional difference is shown in Fig. 3. As can be seen in this figure, the ERP amplitude increased in central parietal region and decreased in right fronto-temporal region in No/Go condition.

Fig. 3. Left: Scalp topography of ERP amplitude difference z-scored across scalp between Go and No/Go condition. The black filled circles represent electrode sites in which there was significant difference in ERP amplitude. Right: Temporal course of AUC and logarithmically transformed p-value. Dotted lines in AUC represent standard error. Black horizontal line shows the temporal window during which AUC is significantly above chance.

3.2 Go vs No/Go Condition in the Self and Other Perspective Condition Multivariate analysis revealed statistically significant temporal cluster in roughly the same temporal range in both self and other perspective condition as shown in Fig. 4 and Fig. 5; 375–593 ms after stimulus onset in self perspective condition and 430 to 593 ms after stimulus onset in other perspective condition. As can be seen in these figures, the ERP amplitude increased in central parietal region in No/Go trials in both self and other perspective condition. At the same time, significant difference in scalp ERP amplitude was observed in bilateral fronto-temporal regions in self perspective condition (Fig. 4), but was confined to right fronto-temporal region in other perspective condition (Fig. 5), which indicates that effect of inhibitory control was more widespread in self than other perspective condition.

Inhibitory Control during Visual Perspective Taking Revealed

145

Fig. 4. Left: Scalp topography of ERP amplitude difference between Go and No/Go condition in self perspective condition. The black filled circles represent electrode sites in which there was significant difference in ERP amplitude. Right: Temporal course of AUC and logarithmically transformed p-value. Dotted lines in AUC represent standard error. Black horizontal line shows the temporal window during which AUC is significantly above chance.

Fig. 5. Left: Scalp topography of ERP amplitude difference between Go and No/Go condition in other perspective condition. The black filled circles represent electrode sites in which there was significant difference in ERP amplitude. Right: Temporal course of AUC and logarithmically transformed p-value. Dotted lines in AUC represent standard error. Black horizontal line shows the temporal window during which AUC is significantly above chance.

146

H. Doi

4 Discussion The present study investigated neural activation accompanying inhibitory control during visual perspective taking. Multivariate analysis utilizing linear discriminant classification revealed difference in scalp ERP between Go and No/Go condition in centro-parietal and fronto-temporal regions. Many previous studies have consistently identified inferior frontal region as the locus of inhibitory control [12, 13]. Though the present study did not carry out source estimation, conditional difference in scalp ERP field probably originates from activation in inferior frontal cortex in No/Go condition. Interestingly, conditional difference in ERP amplitude was observed in frontotemporal region bilaterally in self perspective condition, but only in right hemisphere in other perspective condition. This pattern indicates that more right-lateralized neural regions are recruited in inhibitory control during perspective taking of other’s view point. Previous studies have shown that cortical regions in right hemisphere play primary roles in various domains of inhibitory control [12, 13]. Because completion of Go/NoGo task requires inhibition of inappropriate motor response, it is no surprise that neural activation was observed in right fronto-temporal regions in both self and other perspective conditions. Conditional difference in scalp ERP was more wide-spread in self than other condition. This indicates that inhibitory control in self condition recruits more neural and hence cognitive resources, seemingly contradicting the claim of additional requirement of egocentric interference inhibition during perspective taking of other’s viewpoint. In the stimulus sequence of the present study, perspective instruction was presented before facial stimuli. Thus, participants were primed to recruit more resources of executive control beforehand to efficiently inhibit egocentric interference. I propose that it was because of this preparedness that inhibitory control in other perspective condition induced smaller neural activation than self perspective condition. However, this is mere an ad-hoc speculation, and this conjecture should be tested empirically in future study. There are several limitations that qualify the interpretation and robustness of the present findings. Most of the limitations are related to arbitrariness in multivariate analysis. First, statistical results of multivariate analysis are influenced by the number of time-points entered into the analysis. Second, linear discriminant classifier was selected as classification algorithm based on previous studies [6], but it remains to be seen whether this is optimal choice for the current task. These parameters and choices should be more systematically searched for to obtain robust finding.

References 1. Hamilton, A.F., Brindley, R., Frith, U.: Visual perspective taking impairment in children with autistic spectrum disorder. Cognition 113(1), 37–44 (2009) 2. Pearson, A., Ropar, D., Hamilton, A.F.: A review of visual perspective taking in autism spectrum disorder. Front. Hum. Neurosci. 7, 652 (2013) 3. Surtees, A., Samson, D., Apperly, I.: Unintentional perspective-taking calculates whether something is seen, but not how it is seen. Cognition 148, 97–105 (2016) 4. Doi, H., Kanai, C., Tsumura, N., Shinohara, K., Kato, N.: Lack of implicit visual perspective taking in adult males with autism spectrum disorders. Res. Dev. Disabil. 99, 103593 (2020)

Inhibitory Control during Visual Perspective Taking Revealed

147

5. Samson, D., Apperly, I.A., Kathirgamanathan, U., Humphreys, G.W.: Seeing it my way: a case of a selective deficit in inhibiting self-perspective. Brain 128(Pt 5), 1102–1111 (2005) 6. Fahrenfort, J.J., van Driel, J., Van Gaal, S., Olivers, C.N.: From ERPs to MVPA using the amsterdam decoding and modeling toolbox (ADAM). Front. Neurosci. 12, 368 (2018) 7. Grootswagers, T., Wardle, S.G., Carlson, T.A.: Decoding dynamic brain patterns from evoked responses: a tutorial on multivariate pattern analysis applied to time series neuroimaging data. J. Cogn. Neurosci. 29, 677–697 (2017) 8. Haxby, J.V., Gobbini, M.I., Furey, M.L., Ishai, A., Schouten, J.L., Pietrini, P.: Distributed and overlapping representations of faces and objects in ventral temporal cortex. Science 293(5539), 2425–2430 (2001) 9. Mares, I., Ewing, L., Farran, E.K., Smith, F.W., Smith, M.L.: Developmental changes in the processing of faces as revealed by EEG decoding. Neuroimage 211, 116660 (2020) 10. Doi, H.: Multivariate ERP analysis of neural activations underlying processing of aesthetically manipulated self-face. Appl. Sci. 12, 13007 (2022) 11. Jin, L., Xu, J., Li, M., Xie, D., Liang, L. SCUT-FBP: A benchmark dataset for facial beauty perception. In: 2015 IEEE International Conference on Systems, Man, and Cybernetics. https:// doi.org/10.1109/SMC.2015.319 12. Hampshire, A., Chamberlain, S.R., Monti, M.M., Duncan, J., Owen, A.M.: The role of the right inferior frontal gyrus: inhibition and attentional control. Neuroimage 50(3), 1313–1319 (2010) 13. D’Alberto, N., Funnell, M., Potter, A., Garavan, H.: A split-brain case study on the hemispheric lateralization of inhibitory control. Neuropsychologia 99, 24–29 (2017)

A Novel Custom Deep Learning Network Combining 1D-Convolution and LSTM for Rapid Wine Quality Detection in Small and Average-Scale Applications Quoc Duy Nam Nguyen1(B) , Hoang Viet Anh Le1 , Le Vu Trung Duong2 , Sang Duong Thi2 , Hoai Luan Pham2 , Thi Hong Tran1 , and Tadashi Nakano1 1

2

Graduate School of Informatics, Osaka Metropolitan University, Osaka 558-8585, Japan [email protected], [email protected] Graduate School of Information Science, Nara Institute of Science and Technology (NAIST), Nara, Japan

Abstract. The maintenance of superior quality standards in wine is of paramount importance to both wine producers and consumers. However, traditional approaches to assessing wine quality are characterized by protracted processes and the involvement of specialists with a comprehensive understanding of taste proﬁles and the determinants of wine quality. In this study, we propose a method that is both rapid and precise, and it was designed speciﬁcally for monitoring the wine’s spoilage. Four distinct computational experiments are conducted in order to identify the most eﬀective algorithm for the existing wine dataset. Combining 1D-convolutional and long-short-term memory layers, the proposed algorithm utilizes a deep learning network that has been designed specifically for it. We employ a strict validation strategy based on 10-fold cross-validation in order to assess the eﬃcacy of our design. According to the ﬁndings, the architecture we have proposed is capable of achieving an impressive 93.27% recognition accuracy in just 4 s. These ﬁndings contribute signiﬁcantly to the development of eﬃcient and trustworthy methods for detecting and monitoring wine quality.

Keywords: Wine Quality Detection Deep Learning Network

1

· 1D-CNN & LSTM · Custom

Introduction

For both wine producers and consumers, ensuring wine quality is an essential requirement. In order to accomplish this, a comprehensive quality assurance system is implemented throughout the wine-making process, incorporating various cultivation and production factors. In addition, a traditional manual ﬁnal inspection is conducted, which includes sensory evaluation and the monitoring of c The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 N.-N. Dao et al. (Eds.): ICIT 2023, LNDECT 187, pp. 148–159, 2023. https://doi.org/10.1007/978-3-031-46573-4_14

Rapid Wine Quality Detection Based 1D-Convolution and LSTM

149

technical analysis parameters within predeﬁned limits. In addition to water and ethanol, the complex and nuanced ﬂavor proﬁle of wine is inﬂuenced by a multitude of compounds, including more than 20 constituents [1]. Even in momentary concentration variations, these compounds play a signiﬁcant role in determining the overall quality of the wine. Therefore, the evaluation of organoleptic characteristics by trained experts continues to be the most widely used method for assessing and determining the quality of wine [2–4]. In recent years, numerous technologies have been implemented in the pursuit of wine quality control, including the electronic nose (E-nose). This non-invasive technology, which was inspired by the human olfactory system, has been used to analyze the aromas of wines and identify volatile compounds [5,6]. E-nose technology has applications in diverse ﬁelds, including beverages and food products, in addition to wine. Speciﬁcally, in [7], E-nose technology was used to diagnose biotic stress in Khasi Mandarin orange plants via sensors designed for this purpose. Another signiﬁcant contribution by Ozmen et al. was the introduction of a fully portable E-nose replete with quartz crystal microbalance (QCM) sensors and online analysis tools, allowing for on-site gas measurements [8]. This capability is advantageous for both swift assessment tasks and long-term monitoring applications. In addition, E-noses provide viable alternatives to conventional methods for distinguishing wines based on their organoleptic characteristics. By capturing and analyzing the gas mixture signals, analogous to human olfaction, these devices generate response patterns that enable comparative evaluations of various wine samples [9–11]. The incorporation of E-nose technology into the ﬁeld of wine quality control exempliﬁes the innovative strides made towards augmenting the comprehension and evaluation of aroma proﬁles, thereby paving the way for advances in quality evaluation and discrimination techniques. There has been an increase in the use of deep learning techniques in wine research. Gomes et al. accurately determined the sugar content of port wine grape berries using a combination of neural network (NN) [12,13] and partial least squares regression (PLSR). Liu et al. used a convolutional neural network (CNN) to diﬀerentiate grape varieties, achieving an impressive 99.91% classiﬁcation accuracy by leveraging GoogleNet [12]. Kuntche et al. investigated bottle classiﬁcation in images, glycerol adulteration detection in red wines, and wine origin studies using various deep learning architectures such as DenseNet, ResNet, artiﬁcial neural networks (ANN), and Kohonen’s self-organizing neural network (KhNN) [14–16]. Deep learning has demonstrated its applicability and immense potential in the wine research domain, primarily due to its adaptability, self-learning capabilities, and strong nonlinear mapping abilities [17]. The network architectures employed in previous research tend to be large and resourceintensive, which makes it diﬃcult to implement deep learning approaches that are speciﬁcally suited for small or average-scale applications, such as IoT ﬁelds or wearable devices. Filling this gap would encourage the widespread adoption and incorporation of deep learning techniques in these domains, thereby improving research outcomes.

150

Q. D. N. Nguyen et al.

This research aims to develop a compact deep learning network architecture by integrating the concepts of 1D-Convolutional networks, LSTM networks, and ANN networks. The primary objective is to design a model suitable for the classiﬁcation and monitoring of wine data, taking into account the model’s potential future applications. In order to determine whether or not the proposed architecture is eﬀective, a strict validation strategy employing 10-fold cross-validation is utilized. Experiments indicate that the proposed network architectures can eﬀectively divide wine samples into a variety of distinct categories, including High, Average, and Low Quality, as well as Alcohol. These results demonstrate the promising potential of the devised model for improving wine quality assessment and monitoring procedures.

2 2.1

Material and Methodology Data Description

The database utilized in this research was conducted by Rodriguez Gamboa et al. [18] and it was obtained through the implementation of an electronic olfactory sensing device, commonly referred to as O-NOSE. The present study involves an assemblage of six channels of metal-oxide gas sensors, which have been employed for the purpose of detecting volatile compounds (Table 1).

Fig. 1. The ﬂowchart illustrates the experimental setup for collecting samples from the wine database

In their experiment, 22 bottles of commercially available wines were used to collect the necessary samples. Prior to the beginning of the experiment, a small group of 13 subjects was subjected to a random selection procedure and then placed in an uncontrolled environment for approximately six months. Consequently, these subjects were categorized as low-quality (LQ) for the objectives

Rapid Wine Quality Detection Based 1D-Convolution and LSTM

151

of the study. Four of these 22 bottles were opened two weeks prior to the experiment and were labeled as being of average quality (AQ). The remaining ﬁve bottles were labeled with high-quality (HQ). A total of 256 samples were collected, comprising 51, 43, and 141 measurements of high-quality (HQ), average-quality (AQ), and low-quality (LQ) samples, respectively. The database also contained 65 samples of ethanol [18]. Table 1. The establishment of an array for gas sensors. The sensors were selected based on their heightened sensitivity to organic, natural, ethanol, methanol, and combustible gases, as well as their ease of operation and cost-eﬀectiveness Name Number Resistance High Sensitivity

2.2

MQ-3 1, 4

22 kΩ

Alcohol and small sensitivity to Benzine

MQ-4 2, 5

48 kΩ

CH4 and natural gas

MQ-6 3, 6

22 kΩ

LPG, iso-butane, propane

Sampling Procedure

The experimental setup for the collection of volatile organic compound (VOCs) signals in the context of wine analysis is depicted in Fig. 1 [18]. As indicated in the preceding description, the setup consists of 22 wine bottles. Thirteen of these bottles contain low-quality (LQ) samples, four contain average-quality (AQ) samples, and ﬁve contain high-quality (HQ) samples. In order to facilitate evaporation prior to the saturation phase, only 1 ml of each wine sample is used. The resultant VOCs are then stored in a gas chamber for 30 s to allow for volatile accumulation, after which they are transferred to the signal acquisition phase. During the ﬁrst 90 s of the signal acquisition phase, volatile organic compounds are pumped from the gas chamber into the sensor chamber, causing sensor resistance to change. The pumping process is then stopped for the next 90 s, allowing the sensor to begin desorbing. During this phase, the sampling frequency of the sensor is set to 18.5 Hz. After the desorption phase of the sensor has concluded, the gas puriﬁcation phase begins. During this phase, residual VOCs are evacuated from the chamber over a period of 600 s to ensure chamber cleanliness prior to the next sampling cycle. Table 2 demonstrates the inﬂuence of volatile acidity (VA) and acetic acid concentration on the ﬂavor, intricacy of taste, and odor of each label type. 2.3

Computation Algorithm

Figure 2 illustrates the proposed algorithm for the processing of VOCs signals, which includes critical phases such as signal preprocessing, training, testing, classiﬁcation, and result evaluation.

152

Q. D. N. Nguyen et al.

Table 2. The ranges of volatile acidity and acetic acid detected in wine indicate wine spoilage thresholds Quality Level Volatile acidity Acetic acid HQ

0.15–0.3

ND-0.23

AQ

0.31–0.4

0.24–0.34

LQ

0.80–3.0

0.74–2.75

A. Data Preprocessing The ﬁrst volatile organic compound (VOC) signals go through a step of preprocessing to reduce the chance that artiﬁcial noise will be added during the sampling period. In particular, the ﬁrst and last 10 s of the signals are omitted from further analysis. The objective of the subsequent down-sampling procedure is to reduce noise interference (Fig. 3). This meticulous approach is taken to ensure the dependability and precision of the data processing that follows. Considering the small size of the available database, which consists of 321 samples divided into four distinct groups (HQ, AQ, LQ, and Alcohol), it is evident that this dataset is insuﬃcient for the development of a deep learning network model capable of monitoring the behavior of VOCs signals.

Fig. 2. The ﬂowchart illustrates the experimental setup for collecting samples from the wine database

As a result, the time-slicing window method is utilized as an alternative strategy to address this issue (Fig. 3). Under the time-slicing window method, the original signals are subdivided into signals with durations of 4 s each. Importantly, a 50% overlap is incorporated between consecutive sub-signals to assure data integrity and coherence. By employing this method, the data used for both training and testing are of suﬃcient scale, enhancing the stability and dependability of our deep learning network model [19].

Rapid Wine Quality Detection Based 1D-Convolution and LSTM

153

Fig. 3. The pre-processing stage, which comprises the down-sampling and time-slicing window methods

B. Network Architecture The proposed deep learning architecture is comprised of three fundamental building blocks: the base block, the long short-term memory (LSTM) block, and the neural network (NN) block. The base block is constructed with a 1Dconvolutional layer (1D-Conv), a batch normalization layer, and a rectiﬁed linear unit (ReLU) activation layer. Its primary function is to extract unique characteristics from individual signal channels of gas sensors, thereby facilitating eﬀective feature representation. The NN block, on the other hand, consists of three neuron network layers (ANN), one ReLU activation layer, and two sigmoid activation layers. This block aims to generate a more intricate representation of features, thereby enabling input data with greater complexity. Moreover, the LSTM block is composed of two LSTM layers and one dropout layer. This block is essential for capturing long-term dependencies and creating a more complex representation of features, thereby enhancing the overall performance and capability of the deep learning model. Prior to the experiment session, a channel block is incorporated to guarantee a comprehensive analysis of the gas sensor data. The gas sensor has six separate channels, each channel is independently processed and fed to the base block. After aggregating the extracted features from each channel, a 1D average composite layer is applied. After this, another base block is applied to further reﬁne the representation of the features (Fig. 4). The channel-speciﬁc characteristics are then combined via an average layer. This strategy is based on the observation that the shape of each channel’s response is similar, while frequency and amplitude diﬀerences exist between the signals. In consideration of the varying compound composition and response characteristics of the gas being analyzed,

154

Q. D. N. Nguyen et al.

a more comprehensive and representative feature representation can be attained by extracting features from each channel separately and then averaging them.

Fig. 4. The channel block illustration

Fig. 5. The depiction of the four experiments conducted in this study

3

Computation Algorithm

During the experimental session, four network architectures (Fig. 5) were meticulously designed and implemented for the training, validation, and testing of the wine database. The primary objective was to determine the optimal network conﬁguration for enhancing the model’s overall performance. Experiment 1 assessed

Rapid Wine Quality Detection Based 1D-Convolution and LSTM

155

the performance of the Channel Block when only connected to a dropout layer and softmax layer. Experiments 2 and 3 investigated the eﬀect of connecting the output of the dropout layer to the NN and LSTM blocks, respectively. The ﬁnal experiment involved concurrently connecting the output of the dropout layer to the NN and LSTM blocks. The purpose of these experiments was to determine whether the addition of the Channel Block in conjunction with the NN block, the LSTM block, or both, would result in superior performance in comparison to the Channel Block alone. This exhaustive evaluation revealed the synergistic eﬀects and potential beneﬁts of combining the Channel Block with other architectural components, elucidating the optimal conﬁguration for achieving optimal performance in the analysis. During the 600 epochs spanning the training and validation phases, the learning rate was adjusted strategically. After the 200th and 450th epochs, speciﬁcally, the learning rate decreased by a factor of 5 and 10, respectively. Beginning with the 550th epoch, the learning rate decelerated exponentially by a factor of 0.1. This learning rate schedule’s deliberate design contributed to the optimization and convergence of the deep learning network architecture.

4

Validation Strategy

The 321-sample database was divided into three discrete sets prior to the data preprocessing phase: the training set, the validation set, and the test set. Initially, 85% of the database was randomly divided into the training/validation set, while 15% was exclusively designated to the test set and remained unmodiﬁed throughout the training and validation periods. Within the training/validation sets, a 10-fold cross-validation (10-FoldCV) strategy was utilized, with samples equitably distributed across the folds. This method guaranteed the robustness and dependability of the results obtained after the training and validation phases were completed. The training set, validation set, and test set were subsequently subjected to the data preprocessing stage, as depicted in Fig. 2.

5

Result and Discussion

In this session, we present a comprehensive analysis of the performance outcomes obtained during the experiment’s training, validation, and testing phases. The evaluation metrics considered include training, validation, and testing accuracy scores. In addition, we evaluate the training and validation times, as well as the total number of network parameters. These performance metrics provide valuable insights into the eﬃcacy and eﬃciency of the proposed model, shedding light on its capacity to classify and predict outcomes in the wine dataset with precision.

156

Q. D. N. Nguyen et al.

A. Result Table 3 displays the results of the performance evaluations that were conducted on the four diﬀerent computational experiments that were discussed in the previous sections. Notably, experiment 1 with the smallest number of network parameters (22,616) exhibited decreased accuracy across all metrics, with training, validation, and testing accuracies of 88.36%, 88.52%, and 82%, respectively. Experiment 4, which used the most network parameters (82,724), achieved signiﬁcantly higher accuracy rates than the previous three experiments: 96.8% for training, 97% for validation, and 93.24% for testing. Additionally, it is interesting to note that experiment 4 performed better than experiment 3 in terms of network parameters, despite the fact that it required a shorter amount of training time (3,630.518 s as opposed to 3,751.167 s). Table 3. Classiﬁcation results for HQ, AQ, LQ, and Alcohol from the 4 computational experiments Experiment

1

2

3

4

Network Parameter

22,616

39,224

78,680

82,724

Training Time (sec) 2125.156 2127.731 3751.167 3630.518 Training Accuracy

88.36%

93.51%

96.71%

96.8%

Validation Accuracy 88.52%

91.73%

96.56%

97.00%

Testing Accuracy

89.02%

92.35%

93.27%

83.00%

The accuracy of the testing in Experiment 1 is greater than 80%, allowing us to conﬁdently assert that the data for this wine clearly distinguishes between labels and that further improvements to the architecture of the deep learning network are all that is necessary to achieve improved performance. In Experiment 2, the addition of the NN Block to the network results in a signiﬁcant improvement in accuracy, particularly testing accuracy. This improvement increases the overall accuracy from 83% in Experiment 1 to 89.2% in Experiment 2. Although the addition of the NN Block has a minimal impact on training time, the improvement is astounding. If we replace a NN Block with an LSTM Block, or if we use both, the results are enhanced even further. Even though Experiments 3 and 4 have nearly double the number of network parameters as Experiments 2 and 1, there is still a trade-oﬀ between performance and parameters. B. Comparison with Other Existing Research The analysis of Table 4 provides essential information regarding the robustness of our proposed models. First, the longer training period in our study, which consisted of 600 epochs, ensures that the models are resistant to underﬁtting. In addition, the Time-slicing window method utilized in our strategy reduces the recognition time to a mere 4 s, outperforming other comparable outcomes.

Rapid Wine Quality Detection Based 1D-Convolution and LSTM

157

Although our proposed work’s accuracy of 93.27% falls short of the 97.68% accuracy of the original wine data article [18], it is important to note that their validation strategy utilized a potentially risky Leave-one-out approach, which can be negatively impacted by the small sample size of each bottle in the wine data. In contrast, our validation strategy is stringent and takes into account the independence of measurements in the wine data, while the gas puriﬁcation stage (Fig. 1) guarantees the accuracy of each measurement. Overall, our results demonstrate a high level of acceptability and provide a solid foundation for our models’ dependability. Table 4. Comparison of the detection method with other comparable works [20]

6

[21]

[18]

Proposed Work

Model

DCNN DBN

Deep MLP 1DCNN + LSTM

Training Time (sec)

154

N/A

99

Recognition Time (sec) 100

25

2.7

Testing Accuracy

83.7% 97.68%

95.2%

3630.518 4 93.27%

Conclusion

This study eﬀectively demonstrates the feasibility of monitoring and detecting wine quality by combining the concepts of 1D-Convolutional and LSTM into a compact deep learning network. The attained accuracy is up to 93.27% with an outstanding estimation time of 4 s. The use of such a tiny network architecture enables implementation in small and average-sized software and hardware platforms. In future work, we hope to expand this study by accumulating and analyzing additional samples using the proposed algorithms, ultimately leading to the development of a dedicated wine detection system. Acknowledgments. This work was supported by the Japan Science and Technology Agency (JST) under a Strategic Basic Research Programs Precursory Research for Embryonic Science and Technology (PRESTO) under Grant JPMJPR20M6.

References 1. Jackson, R.S.: Wine Science Principles and Applications. Elsevier, Burlington (2008) 2. Aleixandre, M., Cabellos, J.M., Arroyo, T., Horrillo, M.C.: Quantiﬁcation of wine mixtures with an electronic nose and a human panel. Front. Bioeng. Biotechnol. 6, 14 (2018). https://doi.org/10.3389/fbioe.2018.00014 3. Cretin, B.N., Dubourdieu, D., Marchal, A.: Inﬂuence of ethanol content on sweetness and bitterness perception in dry wines. LWT 87, 61–66 (2018). https://doi. org/10.1016/j.lwt.2017.08.075

158

Q. D. N. Nguyen et al.

4. S´ aenz-Navajas, M.P., et al.: Sensory-active compounds inﬂuencing wine experts’ and consumers’ perception of red wine intrinsic quality. LWT 60, 400–411 (2015). https://doi.org/10.1016/j.lwt.2014.09.026 5. Santos, J.P., et al.: Threshold detection of aromatic compounds in wine with an electronic nose and a human sensory panel. Talanta 80, 1899–1906 (2010). https:// doi.org/10.1016/j.talanta.2009.10.041 6. Lozano, J., Santos, J.P., Aleixandre, M., Sayago, I., Gutierrez, J., Horrillo, M.C.: Identiﬁcation of typical wine aromas by means of an electronic nose. IEEE Sens. J. 6, 173–178 (2006). https://doi.org/10.1109/JSEN.2005.854598 7. Hazarika, S., Choudhury, R., Montazer, B., Medhi, S., Goswami, M.P., Sarma, U.: Detection of citrus tristeza virus in mandarin orange using a custom-developed electronic nose system. IEEE Trans. Instrum. Meas. 69, 9010–9018 (2020). https:// doi.org/10.1109/TIM.2020.2997064 8. Ozmen, A., Dogan, E.: Design of a portable e-nose instrument for gas classiﬁcations. IEEE Trans. Instrum. Meas. 58, 3609–3618 (2009). https://doi.org/10.1109/ TIM.2009.2018695 9. Peris, M., Escuder-Gilabert, L.: Electronic noses and tongues to assess food authenticity and adulteration. Trends Food Sci. Technol. 58, 40–54 (2016). https://doi. org/10.1016/j.tifs.2016.10.014 10. Zhao, Z., et al.: Vortex-assisted dispersive liquid-liquid microextraction for the analysis of major aspergillus and penicillium mycotoxins in rice wine by liquid chromatography-tandem mass spectrometry. Food Control 73, 862–868 (2017). https://doi.org/10.1016/j.foodcont.2016.09.035 11. Lozano, J., Santos, J.P., Horrillo, M.C.: Chapter 14 - wine applications with electronic noses. In: Rodr´ıguez M´endez, M.L. (ed.) Electronic Noses and Tongues in Food Science, pp. 137–148. Academic Press, San Diego (2016). ISBN 978-0-12800243-8 12. Gomes, V.M., Fernandes, A.M., Faia, A., Melo-Pinto, P.: Comparison of diﬀerent approaches for the prediction of sugar content in new vintages of whole port wine grape berries using hyperspectral imaging. Comput. Electron. Agric. 140, 244–254 (2017). https://doi.org/10.1016/j.compag.2017.06.009 13. Lu, B., et al.: Identiﬁcation of Chinese red wine origins based on raman spectroscopy and deep learning. Spectrochim. Acta A Mol. Biomol. Spectrosc. 291 (2023). https://doi.org/10.1016/j.saa.2023.122355 14. Guo, T., et al.: Non-target geographic region discrimination of cabernet sauvignon wine by direct analysis in real time mass spectrometry with chemometrics methods. Int. J. Mass Spectrom. 464, 116577 (2021). https://doi.org/10.1016/j.ijms.2021. 116577 15. Dixit, V., Tewari, J.C., Cho, B.-K., Irudayaraj, J.M.K.: Identiﬁcation and quantiﬁcation of industrial grade glycerol adulteration in red wine with Fourier transform infrared spectroscopy using chemometrics and artiﬁcial neural networks. Appl. Spectrosc. 59, 1553–1561 (2005). https://doi.org/10.1366/000370205775142638 16. Kuntsche, E., Bonela, A.A., Caluzzi, G., Miller, M., He, Z.: How much are we exposed to alcohol in electronic media? Development of the alcoholic beverage identiﬁcation deep learning algorithm (ABIDLA). Drug Alcohol Depend. 208, 107841 (2020). https://doi.org/10.1016/j.drugalcdep.2020.107841 17. Gao, R., et al.: Classiﬁcation of multicategory edible fungi based on the infrared spectra of caps and stalks. PLoS ONE 15, 1–14 (2020). https://doi.org/10.1371/ journal.pone.0238149

Rapid Wine Quality Detection Based 1D-Convolution and LSTM

159

18. Rodriguez Gamboa, J.C., Albarracin, E.S., da Silva, A.J., de Andrade Lima, L., Tiago, T.A.: Wine quality rapid detection using a compact electronic nose system: application focused on spoilage thresholds by acetic acid. LWT 108, 377–384 (2019). https://doi.org/10.1016/j.lwt.2019.03.074 19. Nam, N.Q.D., Liu, A.B., Lin, C.W.: Development of a neurodegenerative disease gait classiﬁcation algorithm using multiscale sample entropy and machine learning classiﬁers. Entropy 22, 1–1818 (2020). https://doi.org/10.3390/e22121340 20. Peng, P., Zhao, X., Pan, X., Ye, W.: Gas classiﬁcation using deep convolutional neural networks. Sensors 18 (2018). https://doi.org/10.3390/s18010157 21. L¨ angkvist, M., Coradeschi, S., Loutﬁ, A., Rayappan, J.B.B.: Fast classiﬁcation of meat spoilage markers using nanostructured ZnO thin ﬁlms and unsupervised feature learning. Sensors 13, 1578–1592 (2013). https://doi.org/10.3390/s130201578

IoT-Enabled Wearable Smart Glass for Monitoring Intraoperative Anesthesia Patients B. Gopinath1(B)

, V. S. Yugesh1 , T. Sobeka1 , and R. Santhi2

1 Kumaraguru College of Technology, Coimbatore, Tamilnadu, India

{gopinath.b.ece,yugesh.19ec,sobeka.19ec}@kct.ac.in 2 PSG College of Arts & Science, Coimbatore, Tamilnadu, India [email protected]

Abstract. Surgeons use many technological advancements to increase the rate of successful surgeries. Also, adopting sophisticated technologies assists them to enhance their operating environment and offer a better surgical as well as patient experience. In this work, a wearable smart glass prototype model has been developed to assist the surgeons. The performance of the developed model is tested for the intraoperative anesthesia stage of the surgery under laboratory testing conditions. The smart glass display unit is integrated with a set of sensor nodes and processing unit. Heartbeat rate, respiration rate and body temperature levels are sensed by the respective sensor nodes and connected to a microcontroller. These vital signs are further connected to a prototype smart glass model. Since the smart glass system is interfaced with the networkable NodeMCU and wireless communication facility, the real-time values of vital signs from sensor units are published using the concept of Internet of Things (IoT) through ThingSpeak. A surgeon, who wears the smart glass, can now observe the significant changes in the vital signs without seeing various supportive monitors during intraoperative surgical stage. This procedure, during the surgery, helps the surgeon to experience the minimum number of head movements and avoid surgical difficulties and minor errors. A chief surgeon can access the readings of sensor nodes from a remote location in terms of plots/charts with timestamps using ThingSpeak. Keywords: Anesthesia · Internet of Things · Smart Glass · Surgery

1 Introduction 1.1 Surgical Patient Monitoring System A surgical patient monitoring system permits surgeons to detect many kinds of abnormalities during surgery so that they can manage the situation quickly [1]. A patient who is under general anesthesia is normally experiencing rapid variations in his/her vital signs such as heartbeat rate, respiration rate, body temperature and blood sugar level. Hence, it is required to monitor these vital signs persistently by an anesthesiologist and associated surgeon. This task of surgical monitoring is done using multiple monitors in the surgical © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 N.-N. Dao et al. (Eds.): ICIT 2023, LNDECT 187, pp. 160–170, 2023. https://doi.org/10.1007/978-3-031-46573-4_15

IoT-Enabled Wearable Smart Glass for Monitoring

161

environment by the experts. Any surgical procedure includes three sequential phases, namely, preoperative phase, intraoperative phase, and postoperative phase [2]. In the preoperative phase, the need for surgery and surgical procedure is well explained to the patients to stabilize the mental stability of the patients and reduce the anxiety. During the intraoperative phase, general anesthesia or regional anesthesia is given to the patient. After the anesthesia procedure, the surgical procedure starts, and vital signs are being monitored by the surgeon, anesthesiologist, and staff nurse [3]. In the postoperative phase, the patient is transferred to the Post Anesthesia Care Unit (PACU) immediately after the surgery. This stage is mainly focused on the patient’s physiological health and surgical recovery of the individual. The anesthesia surgical procedure changes the vital sign of patients under surgery and may create critical concerns during the surgery [4]. Hence, the need for surgical patient monitoring systems is increasing. In all the surgical theatres, various monitors are used to display all the significant parameters which are periodically observed by the surgeon, anesthesiologist, and staff nurse. If the vital levels go to unusual, the necessary actions are performed. In a few cases, the message is passed to the chief surgeon and appropriate attention is given by the chief surgeon to solve the issue immediately. 1.2 Literature Review Today, there is a rapid increase in the applications of wireless communication technology in the domain of remote location monitoring and controlling. Remote Patient Monitoring (RPM) facility is nowadays used to observe the health conditions of a patient routinely in which the patient does not need to be present physically in the clinic. This facility services the patients to minimize their expenses and time towards accessing the health services [5]. A wearable display unit was designed and tested [6] for the invasive spinal instrumentation surgery which was used to transfer the monitor data to the display glasses. The experimental results revealed that the surgeons with smart glasses performed well than the surgeons without glass during the surgery. The virtual reality and head mounted display were tested among the medical professionals to analyze the patient data for planning the preoperative stage of surgery [7]. The study achieved success in training the medical students through the developed system along with enhancement in patient care. A detailed review was conducted on the role of head mounted display units in surgical environment by reviewing 120 research articles [8]. The review concluded that among 120 articles, only a few experiments were conducted using head mounted display in the field of anesthesiology. Augmented Reality (AR) based smart glasses were used to enhance the accuracy in orthopedic surgery [9]. A pair of smart glasses were integrated to sensor units and display units to assist the field of view of the surgeon. In a recent work, the requirement of surgical smart glasses was well analyzed, and a theoretical framework was discussed using extended reality or mixed reality model by combining augmented reality and virtual reality concepts [10]. A wearable smart glass based healthcare management system with IoT facility is presented in this work. In this proposed patient health monitoring system, heartbeat rate, respiration rate and body temperature values are observed during the intraoperative surgery using a set of sensors. These vital signs are displayed on the local LCD display; connected to smart glass and connected to IoT-based ThingSpeak for remote monitoring.

162

B. Gopinath et al.

The transmitter and receiver sections of the NodeMCU with wireless communication module is used for real-time data transfer during surgery. The vital signs displayed on the smart glass help the surgeons since the information is kept in the view of the surgeons. Hence, the surgeon can give fullest attention towards the intraoperative surgical procedure with minimal head movements which can significantly improve the quality of surgical process.

2 Experimental Setup and Procedure The proposed local and remote patient monitoring system is presented in Fig. 1. In this model, the heartbeat sensor, respiration sensor and temperature sensor are used to gather the vital signs of patients and display them in the local display unit as well as in the smart glasses worn by surgeons. A combination of infrared Light Emitting Diode (LED) and a photo transistor is used in fingers to form a simple heartbeat sensor (KY-039). A differential pressure sensor of ADP2000 is used in modern respiratory devices to observe the respiratory activity of patients and their spontaneous breathing effort. The LM35 series thermocouple based precision temperature sensor is used in integrated-circuit version to record the body temperature in Celsius. The sensor outputs are connected to the PIC (Peripheral Interface Controller) Microcontroller. The PIC processes the heartbeat rate, respiration rate and body temperature values of the patient under surgery. The analog to digital converter module of the PIC is utilized to convert the analog output voltage from sensors to digital form and displayed on LCD.

ThingSpeak Local LCD Display Heartbeat Sensor NodeMCU (Transmitter ) Temperature Sensor

PIC Microcontroller NodeMCU (Receiver )

Respiratory Sensor Power Supply

Smart Glass

Fig. 1. Workflow diagram representation of the patient monitoring system.

IoT-Enabled Wearable Smart Glass for Monitoring

163

Consecutively, the sensor data is supplied to transmitter NodeMCU cum wireless module from which the sensor data is transferred to receiver NodeMCU through WiFi protocol and cloud. The sensor data is processed and analyzed in the cloud using ThingSpeak analytics techniques [11]. It is also possible to transmit the data from receiver NodeMCU to cloud. But transmitting data from transmitter NodeMCU reduces the latency time of data transfer. This enables healthcare providers to observe the health status of patients and identify any potential issues. From the results of data analytics, the system can generate appropriate alerts to get the attention of caregivers or surgeons. The possible alert services include email information for less critical observations and a popping Short Message Service (SMS) through GSM module. However, in this work, a buzzer unit is attached to the system to generate sound alert when the observed vital sign values cross the preset values. During critical surgery, it is difficult for the surgeon to watch various monitors for observing patients’ vital signs values. The surgeon wears this glass at the time of intraoperative surgery and the real-time vital signs of the patient are sequentially displayed in the semi-transparent glass placed in the headset. It decreases the distraction of the surgeon and saves the time of the surgeon. For displaying the observed vital sign values at the remote place, receiver NodeMCU and ThingSpeak cloud are utilized to store and retrieve the values respectively. This IoT concept is used for remote monitoring by the chief surgeon during surgery and alert message is sent by him/her whenever the observed vital sign values go beyond the threshold values.

3 Results and Discussions The proposed surgical patient monitoring system has a PIC-transmitter-NodeMCU section and a receiver-NodeMCU section. In the transmitter section, all the vital sign sensors are interfaced with the PIC microcontroller as shown in Fig. 2.

Fig. 2. Transmitter section with real-time data.

The real-time vital signs data are first collected by the sensors and displayed on the local LCD module of the transmitter section. Then, the data from the microcontroller

164

B. Gopinath et al.

Fig. 3. Receiver section of the system.

is forwarded to the transmitter NodeMCU (ESP8266). Now, the transmitter NodeMCU transmits data to the receiver NodeMCU (ESP8266) by establishing a Wi-Fi communication and data transfers are carried out via HTTP requests. The transmitter board serves as a server while the receiver board acts as a client. The SSID access point, password and IP address parameters of the server board are used during the Wi-Fi data transfer. In the receiver section, the receiver-NodeMCU is interfaced to the small OLED screen and a semi-transparent glass placed in front of the OLED screen as shown in Figs. 3 and 4. The receiver section collects the vital signs data from the transmitter section and displays them on the OLED. The data are then reflected on the semi-transparent glass, by placing it in the appropriate angle from the OLED screen. This smart glass module is worn by the surgeon during intraoperative surgery to observe the real-time vital signs data of patients. The limitation of this setup is that the real-time data from the heart rate, respiratory and temperature sensors are displayed in the semi-transparent glass, sequentially. Hence, the physician can only observe the values one by one sequentially during the surgery.

Fig. 4. Smart glass preliminary prototype setup.

IoT-Enabled Wearable Smart Glass for Monitoring

165

A typical real-time heartbeat rate is being displayed on the smart glass as shown in Fig. 5. The physical structure of the smart glass under worn condition by one of the authors is photographed as front view and side view and presented in Fig. 6 for easy understanding by the readers.

Fig. 5. Heartbeat rate is being displayed through smart glasses.

Fig. 6. The front and side views of prototype model of smart glass worn by one of the authors.

The transmitter-NodeMCU board transmits the sensor data to the cloud platform using ThingSpeak. The ThingSpeak account and new channel are created to store the data. In the newly created channel, the Write API key is noted from the API Keys tab. This key is used to send data to ThingSpeak. The NodeMCU board and the Arduino IDE are used to connect to Wi-Fi using the Wi-Fi SSID and password. The suitable code in the Arduino IDE is written and executed to read data from the sensors and sent to ThingSpeak using the Wi-Fi connection and the ThingSpeak API key. The developed smart glass structure is in under laboratory testing conditions only. The testing of smart glasses under real-time surgical conditions is not yet performed in the operation theater

166

B. Gopinath et al.

of hospital environment since the prototype model is yet to be validated by the surgeon. Meanwhile, a set of 20 observations of the vital signs of a normal person is observed using the sensor modules and presented in Figs. 7, 8 and 9.

Fig. 7. Monitoring the heartbeat rate of intraoperative patients through ThingSpeak.

Fig. 8. Monitoring the respiratory level of Anesthesia patients through ThingSpeak.

IoT-Enabled Wearable Smart Glass for Monitoring

167

Fig. 9. Monitoring the temperature level of Anesthesia patients through ThingSpeak.

The abnormal conditions in heartbeat rate, respiration rate and temperature levels are introduced artificially, and the graphical analysis is done using ThingSpeak. The false value along with original observations of the vital signs are tabulated in Table 1 and the performance of the system is evaluated. The normal respiration rate ranges from 12 to 18 breaths per minute. The safe range of heartbeat rate is fixed in between 60 to 100 beats per minute. And the lower and upper temperature levels are fixed as 36 and 37 Degree Celsius respectively. The vital sign values above and below the reference values are identified by the system visually and graphically. The abnormal values in heartbeat rate are observed as 101 bpm and 55 bpm as shown in Fig. 7. Similarly, the abnormal values in respiratory rate are observed as 11 bpm as shown in Fig. 8. The abnormal values in temperature are read as 35, 38, 41 and 42 Degree Celsius as shown in Fig. 9. These abnormal values are monitored on the transmitter LCD screen, on the smart glass by the surgeon using receiver NodeMCU and on the cloud by the chief surgeon using IoT based ThingSpeak. Thus, the proposed smart glass system for monitoring intraoperative anesthesia patients is tested and works well under laboratory conditions.

168

B. Gopinath et al. Table 1. Vital sign values observations under laboratory conditions.

Heartbeat Rate (bpm)

Respiration Rate (bpm)

Temperature (Degree Celsius)

Remarks

94

15

36.2

Normal

94

15

36.2

Normal

101

11

38

Abnormal

55

11

35

Abnormal

93

16

37

Normal

93

16

36.2

Normal

94

16

36.2

Normal

94

16

36.2

Normal

94

16

36.2

Normal

94

16

36.2

Normal

94

11

42

Abnormal

94

11

42

Abnormal

94

16

36.2

Normal

94

16

36.2

Normal

94

16

36.2

Normal

95

16

36.2

Normal

94

14

41

Abnormal

94

14

41

Abnormal

94

16

36.2

Normal

94

15

36.2

Normal

Table 2 gives a summary of specifications of the developed smart glass in terms of weight, materials used and cost. These specifications are the results of initial and preliminary design of the developed smart glass. Soon, the same setup will be redesigned to reduce the challenging factors of weight and cost for the real-time implementation. Similarly, the product will be tested in the hospital environment during the intraoperative stage of surgical procedure.

IoT-Enabled Wearable Smart Glass for Monitoring

169

Table 2. Specifications of the smart glass. Specification

Details

Weight

40–50 g

Dimensions

15 × 14 × 4 cms

Material used for frame

Plastic

Material used for display

OLED Display

Cost (only smart glass)

$60 (approximate)

4 Conclusion The continuing development of the surgical environment has led to numerous innovations through the potential disruptive technologies in the surgical workplace. One of the significant surgical requirements is the observation of vital signs during the intraoperative surgical stage. In this work, the real-time vital sign data of patients were collected by the sensors attached to patients in the surgical environment. Once the values were measured by the sensors, then they were processed and sent to smart glass for local use and to a chief surgeon for remote monitoring through wireless IoT facility with appropriate timestamps using ThingSpeak. Thus, the proposed IoT based patient monitoring system displays a set of vital signs on semi-transparent smart glass which is included in a wearable headset. It also alerts the surgeon if an abnormal condition occurs during intraoperative stage. It facilitates the surgeon to initiate suitable actions based on the present health requirements of the patient.

References 1. Zeadally, S., Bello, O.: Harnessing the power of internet of things based connectivity to improve healthcare. Internet Things 14, 100074 (2021) 2. Millan, M., Renau-Escrig, A.I.: Minimizing the impact of colorectal surgery in the older patient: the role of enhanced recovery programs in older patients. Eur. J. Surg. Oncol. 46(3), 338–343 (2020) 3. Manta, C., Jain, S.S., Coravos, A., Mendelsohn, D., Izmailova, E.S.: An evaluation of biometric monitoring technologies for vital signs in the era of COVID-19. Clin. Transl. Sci. 13(6), 1034–1044 (2020) 4. Chheang, V., et al.: Toward interprofessional team training for surgeons and anesthesiologists using virtual reality. Int. J. Comput. Assist. Radiol. Surg. 15, 2109–2118 (2020). https://doi. org/10.1007/s11548-020-02276-y 5. Atreja, A., Francis, S., Kurra, S., Kabra, R.: Digital medicine and evolution of remote patient monitoring in cardiac electrophysiology: a state-of-the-art perspective. Curr. Treat. Options Cardiovasc. Med. 21, 1–10 (2019). https://doi.org/10.1007/s11936-019-0787-3 6. Matsukawa, K., Yato, Y.: Smart glasses display device for fluoroscopically guided minimally invasive spinal instrumentation surgery: a preliminary study. J. Neurosurg. Spine 34(1), 150– 154 (2020)

170

B. Gopinath et al.

7. Kenngott, H.G., et al.: IMHOTEP: cross-professional evaluation of a three-dimensional virtual reality system for interactive surgical operation planning, tumor board discussion and immersive training for complex liver surgery in a head-mounted display. Surg. Endosc. 36(1) (2022) 8. Rahman, R., Wood, M.E., Qian, L., Price, C.L., Johnson, A.A., Osgood, G.M.: Head-mounted display use in surgery: a systematic review. Surg. Innov. 27(1), 88–100 (2020) 9. Fucentese, S.F., Koch, P.P.: A novel augmented reality-based surgical guidance system for total knee arthroplasty. Arch. Orthop. Trauma Surg. 141(12), 1–7 (2021). https://doi.org/10. 1007/s00402-021-04204-4 10. Gong, X., JosephNg, P.S.: Technology behavior model—Beyond your sight with extended reality in surgery. Appl. Syst. Innov. 5(2), 35 (2022) 11. Gopinath, B., Boopathy, S., Alagumeenaakshi, M.: Development of an IoT based integrated sensor fusion system to analyze the air pollution level. In: 2021 IEEE International Conference on Advancements in Electrical, Electronics, Communication, Computing and Automation (ICAECA), pp. 1–5 (2021)

Traffic Density Estimation at Intersections via Image-Based Object Reference Method Hieu Bui Minh1,2 and Quang Tran Minh1,2(B) 1 Department of Information Systems, Faculty of Computer Science and Engineering, Ho Chi Minh City University of Technology (HCMUT), 268 Ly Thuong Kiet, District 10, Ho Chi Minh City, Vietnam [email protected] 2 Vietnam National University Ho Chi Minh City (VNU-HCM), Linh Trung Ward, Thu Duc District, Ho Chi Minh City, Vietnam

Abstract. This study aims to estimate traffic density at intersections using images. The problem involves focusing on two components: the number of vehicles and the area of the region. Counting vehicles is relatively straightforward with various modern techniques available. However, estimating the region’s area is more challenging due to the lack of specific camera configurations for calculations. The proposed technique utilizes traffic vehicles as the reference object to calculate the intersection’s total area dynamically. The study presents two approaches for identifying the region’s area based on the object reference technique and an experimental system architecture for real-time application. Keywords: Traffic density · Region’s area · Object reference

1 Introduction Traffic congestion is a pressing issue in major cities like Ho Chi Minh City, impacting the economy, residents’ quality of life, and health [1, 6]. Despite significant investments in closed-circuit television (CCTV) camera systems, their full potential remains untapped. Therefore, the research team aims to develop strategies for evaluating traffic and maximizing the capabilities of these cameras. Various metrics are used to assess traffic conditions, and this study focuses on traffic density characteristics. Calculating traffic density requires considering both the number of vehicles and the counting area, such as the junction area in this case. Vehicle counting has been addressed through studies and modern technologies, particularly machine learning. However, determining the counting area poses additional challenges. Each camera has unique characteristics, height, and location, which complicate the calculations. To deal with the previous problem, we offer a method that requires the use of a reference object [7]. The use of a reference object is a method that uses the known size of an object in space to derive the size of another. In this study, it was discovered that the reference objects have dynamic rather than static properties. For the computations, we use traffic vehicles as regularly recognized reference objects. As a result, we developed two following solutions: © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 N.-N. Dao et al. (Eds.): ICIT 2023, LNDECT 187, pp. 171–181, 2023. https://doi.org/10.1007/978-3-031-46573-4_16

172

H. B. Minh and Q. T. Minh

– Distance-based method: a concept using an object reference by identifying the distance between the object and a fixed point in images – Mean-based method: a concept using all object references for area calculation via calculating the average value We obtained numerous promising results from this study. Firstly, we propose a way to calculate traffic density at intersections. Secondly, we used and evaluated the above solutions in real-world circumstances. Finally, we developed an experimental system for practical application. The paper has five main parts. Firstly, we introduce the overview topic. Secondly, we go through the related work. The main parts will discuss problems and solutions for calculating traffic density at intersections, experimental setup & results, and conclusion & future work.

2 Related Work Numerous studies have focused on collecting, analyzing, and evaluating traffic status from various sources. In Vietnam, there are existing systems like VTIS (Vietnam Traffic Information System) [8] and UTraffic (Urban Traffic Estimation System) [9]. These web-based systems, including a mobile app for Android, primarily rely on data from VOV (traffic channel FM91 MHz) [8] and VOH (traffic channel FM95.6 MHz) [9]. They heavily depend on user-provided information to scale up and maintain updated status in their databases. Besides, applications such as Google Maps face challenges in accessing traffic data due to the constraints imposed by Cybersecurity Law 2018 No. 24/2018/QH14 [10]. Regarding counting vehicle tasks, there are several technologies nowadays to help achieve the goal. The first method to be considered is foreground extraction, which is the simplest and most basic technique to solve this challenge [2]. The advantage of this approach is that it is easy to implement and does not take much time to compute. However, it is very sensitive to external impacts, such as brightness. According to [2], the author has to adjust the brightness parameter when he runs experiments because of changes in the pixel intensity. Besides, it requires a clean foreground image to perform this method, so it is mostly suitable for making experiments. The second method to be mentioned is machine learning-based approaches. In this kind of approach, a lot of approaches can be applied, such as Haar-cascade [11], Faster R-CNN, Mask R-CNN, ResNet-50, etc. Related to the detection result, Haar-cascade’s accuracy is quite low compared to the others because it is sensitive to changes in input data. Meanwhile, Faster R-CNN, Mask R-CNN, and ResNet-50 are shown to require a lot of time for processing the input to be applied to real-life cases [3]. Finally, YOLO (You Only Look Once) is known as one of the most state-of-the-art algorithms, which is improved and developed with each version. It shows the potential to outperform the other approaches in terms of performance and ease of use. Converting from pixels to meters is commonly achieved by utilizing the camera’s parameter specifications, along with factors like height and distance [4]. While this math-based approach is straightforward to implement and yields low calculation errors, it

Traffic Density Estimation at Intersections via Image-Based Object Reference Method

173

necessitates gathering the required parameter information. Unfortunately, certain parameters, such as the camera’s focal length or height, can be challenging to obtain due to variations in camera installations at each intersection. Therefore, the concept of a reference object is introduced [7]. This concept allows disregarding configuration parameters and only requires information of the object’s actual size. However, this solution demands the reference object to be stationary and the camera angle to be perpendicular to the object’s plane, which is nearly impractical to apply in the context of a traffic camera. Finally, according to [5], the research team has issued a data collection and traffic estimation system using CCTV cameras, employing YOLO v3. However, as presented by the team, it is still incomplete in terms of traffic density evaluation. Each of the existing studies and systems discussed above has its own advantages and disadvantages. Based on the above points, this work has achieved: – Propose an approach to calculate traffic density at intersections. – Propose methods to calculate region’s area without relying on camera specifications and static object references. – Design an experimental system to collect and estimate traffic density.

3 Problem Definition and Proposed Solutions 3.1 Problem Definition To achieve the goal of evaluating traffic density, this paper addresses two key challenges that need to be resolved, as described below: – The process of counting the number of vehicles needs to be reconsidered. If it were simply a matter of counting vehicles without classification, a car would be treated equally with a motorcycle. This approach fails to capture the spatial occupancy of cars in the area, as it does for trucks and buses. – The calculation of the area needs to be considered and performed independently, without relying on specific camera specifications. It is necessary to generalize the computation for various intersections. 3.2 Proposed Solutions To obtain traffic density, it is required to divide the total number of vehicles in the region (n_vehicles) by the total area of the region in square meters (intersectionarea ). density =

n_vehicles intersectionarea

(1)

Vehicle Counting Related to the first parameter, the total number of vehicles, we propose to use motorcycle as a standard unit for counting and convert the other vehicle types into it based on the differences in size, as mentioned in Table 1, to overcome the first issue mentioned in Section Problem definition. Thus, Eq. (2) is improved from Eq. (1), where n_motorcycles is the number of motorcycles after converting. density =

n_motorcycles intersectionarea

(2)

174

H. B. Minh and Q. T. Minh

Via Eq. (2), intersectionarea is known as the intersection area in square meters (m2 ) from an image. Thus, based on the object reference concept [7], we propose two methods depending on conditions in the next section. In addition, Table 1 is subjective and serves as a reference for conversion purposes based on [14–17]. Conducting surveys is necessary to enhance accuracy. Table 1. Vehicle conversion table based on vehicle area. No

Vehicle type

Length (mm)

Width (mm)

Conversion ratio

0

Motorcycle

1931

740

1 motorcycle

1

Car

3700

1500

4 motorcycles

2

Truck

6000

2000

6 motorcycles

3

Bus

9440

2450

16 motorcycles

Area Calculation In conducting this research, we divided it into two cases, corresponding to the two methods that will be introduced below: – Sparse, low-traffic area: The distance-based method is applied. – Moderately congested area: The mean-based method is applied. Distance-Based Method This method uses 2D Euclidean distance as its core. Motorcycles, being common in Vietnam, are chosen as the reference for calculations. Unlike [7], which considers only one reference at a time, images often contain multiple motorcycles. Thus, a mechanism is needed to select the appropriate reference for calculations. During our research, we discovered a relationship between the object’s position and the pixel intensity required to describe it, as seen in Fig. 1. The red dot denotes the image’s center, while the black dots denote motorcycle positions in general and the blue dots denote motorcycle positions near the image’s center. According to our observations, the pixel intensity fluctuation reduces as the motorcycle position nears the image center. We use this finding to find the shortest distance between the motorcycle position and the image center using the min function in Eq. (3). In this case, n is the number of identified motorcycles in the region, and d denotes the distance between the motorcycle location and the image center. min(di ) = min ([d0 , d1 , . . . , dn ])

(3)

Identifying the valid object reference, we calculate the conversion ratio of the two units, between pixel and m2 , as Eq. (4). Finally, together with the junction area in pixels (obtained from the segmentation task), we can infer junction area in m2 unit via Eq. (5). ratio =

obj_pixelarea obj_meter area

(4)

Traffic Density Estimation at Intersections via Image-Based Object Reference Method

175

Fig. 1. Impact of positions on the pixel intensity in 3D.

intersection_pixelarea ratio

intersectionarea =

(5)

where: obj_pixelarea is the area of object reference in pixels; obj_meterarea is the real area of object reference in square meters (m2 ); and intersection_pixelarea is the area of intersection region in pixels. This method utilizes distance comparison to the image center for continuous selfupdating, acting as a bounding condition to avoid explosive values of obj_pixelarea in Eq. (4). However, in cases with numerous vehicles, relying on a single object reference via the distance-based method can result in inefficient data utilization. To address this, the mean-based method is introduced. Mean-Based Method Mean-based method is a method that does not need to identify an object reference and minimum distance; instead, it selects all the vehicles as the references. Thus, Eq. (3) will be ignored, while Eq. (4) will be improved to Eq. (6) and calculate intersectionarea via Eq. (5). n ratio =

obj_pixelareaj,i i=1 obj_meterarea j

n

(6)

where: n is the number of vehicles and j is the type of vehicle (such as motorcycle, car, etc.). This method effectively exploits and utilizes data without relying on a fixed reference object. However, it has two weaknesses. Firstly, it requires even distribution of vehicles within the intersection region to avoid a high error rate in Eq. (5) caused by the dependency shown in Fig. 1. Secondly, it lacks a bounding mechanism like the distance-based method, which may lead to the possibility of obj_pixelarea in Eq. (4) becoming explosive. Comparison Two Methods In the low-density state, the distance-based method removes information from vehicles too close or too far from the camera’s perspective, reducing their impact on the ratio

176

H. B. Minh and Q. T. Minh

calculation in Eq. (4) based on the impact from Fig. 1. When employing the mean-based method in this condition, assuming most traffic vehicles are concentrated close to the camera’s perspective, the calculated ratio in Eq. (6) heavily depends on the pixel values of vehicles too close to the camera. This leads to a significant increase in the ratio due to the need for more pixels to represent vehicles when they are closer to the camera, and vice versa. In high-density conditions, with sufficient vehicles distributed across the intersection area, the influence from Fig. 1 no longer affects the ratio calculation in Eq. (6) using the mean-based method. Alternatively, applying the distance-based method in high-density conditions is feasible but excludes information from other vehicles like cars, trucks, etc., leading to inefficient data utilization.

4 Experiment Setup and Result The experiments were conducted on a personal computer with an Intel i3 processor and 12 GB of RAM, using Google Drive as the image storage database. The system was developed using Python code. Currently, it is in the prototype stage and has been tested in real-life cases, including the Ba Huyen Thanh Quan - Vo Thi Sau intersection and Ly Chinh Thang - Truong Dinh intersection. 4.1 Overall System Architecture Figure 2 shows a general picture of the experimental system used for collecting data from CCTV cameras and estimating traffic density. The overall system includes three main sections: Data collection, Training server, and Data analysis, while Diagnosis is developed for debugging purposes. 4.2 Automatic Access Due to cybersecurity laws [10], the database cannot be accessed automatically via codes but instead uses commands. To overcome this issue, opening a new instance inside a Python script via the subprocess and wget packages [12] is required. 4.3 Data Setup To address external factors, we segment both vehicles and intersection regions. This approach treats intersections as objects, unlike [2] where they are considered as foregrounds. In Fig. 3, the blue segment represents the intersection region, while other colors indicate vehicle types. For the experiment, we retrained the model with 100 images per intersection and collected data from three intersections. Real-time images are obtained from cameras provided by the transportation department. Interestingly, we found that just 50 images of a new intersection (the minimum requirement) are enough for the system to autonomously recognize the intersection area.

Traffic Density Estimation at Intersections via Image-Based Object Reference Method

177

Fig. 2. Overall system architecture.

Fig. 3. Sample of how to segment the data.

4.4 Error Rate Calculation Finally, to evaluate the accuracy of the area calculation, we need to know the real area of that region. To obtain that information, an application called FieldAreaMeasure [13] is used for measurement. Equation (7) is used to calculate the error rate for evaluating, where real_intersectionarea is the real intersection area taken from the application. error rate =

|intersectionarea − real_intersectionarea | ∗ 100% real_intersectionarea

(7)

178

H. B. Minh and Q. T. Minh

4.5 Result and Evaluation

Fig. 4. Detection result at Ba Huyen Thanh Quan – Vo Thi Sau intersection.

Table 2. Survey between traffic density and traffic status at Ba Huyen Thanh Quan – Vo Thi Sau intersection. No of motorcycles

Traffic density

Traffic status

0–5

0–0.055

Free

6–13

0.075–0.21

Normal

14~

0.22~

Busy

Figure 4 shows vehicles and the intersection region can be segmented. Using a contour area threshold, we eliminate noise and identify regions of interest for automated traffic density calculation. From Fig. 4, only a few motorcycles are detected, so the distance-based method is applied. The calculated intersection area is 68.7 m2 , while the actual area is 73 m2 , resulting in an error rate of 5.89%. The traffic density in this sample is 0.029. Table 2 provides survey results comparing the system-calculated traffic density with actual traffic conditions at the intersection. These results are subjective and based on observations due to the need to consider multiple traffic parameters along with traffic density to determine the traffic situation accurately.

Traffic Density Estimation at Intersections via Image-Based Object Reference Method

179

Fig. 5. 1000 times experiment for distance-based method.

To demonstrate the impact of the dependency mentioned in Fig. 1 on the error rate, we conducted 1000 experiments during non-rush hours (from 11:00 to 14:00) using the distance-based method. However, even in cases where the distance is nearly zero, the error rate remains high, as depicted in Fig. 5. This error stems from the reliance of intersection_pixelarea in Eq. (5) on the outcome of machine learning. Additional training is required to address this issue. Furthermore, we performed 1000 experiments during rush hour (from 7:00 to 11:00) to showcase the influence of the number of vehicles on the mean-based method. Figure 6 highlights that the mean-based method lacks stability due to two factors: the absence of boundary conditions and the distribution of vehicles in the region, as discussed in the Mean-based method section. Additionally, the dependency of intersection_pixelarea in Eq. (5), similar to the distance-based method, contributes to this issue.

Fig. 6. 1000 times experiment for mean-based method.

180

H. B. Minh and Q. T. Minh

5 Conclusion and Future Work This work is investigated and developed to prove the proposed solutions. The key point of this work is to provide methods for converting the area of a given area from pixel to m2 , which do not depend on camera specifications. Besides that, the paper also contributes an experimental system architecture to run all the processes automatically or most conveniently. Regarding future work, we would like to integrate this system into a bigger one (U-Traffic) from Ho Chi Minh City University of Technology (HCMUT). Moreover, distance-based and mean-based methods still need to improve more to achieve higher performance. Lastly, techniques related to image processing will be considered to extract more information from images. Acknowledgement. We acknowledge Ho Chi Minh City University of Technology (HCMUT), VNU-HCM for supporting this study.

References 1. Ghazali, W.N.W.B.W., Zulkifli, C.N.B., Ponrahono, Z.: The effect of traffic congestion on quality of community life. In: Wahid, P.A.J., Aziz Abdul Samad, P.I.D.A., Sheikh Ahmad, P.D.S., Pujinda, A.P.D.P. (eds.) Carving The Future Built Environment: Environmental, Economic and Social Resilience, vol. 2, pp. 759–766. European Proceedings of Multidisciplinary Sciences (2017) 2. Karthik Srivathsa, D.S., Kamalraj, R.: Vehicle detection and counting of a vehicle using Opencv. Int. Res. J. Mod. Eng. Technol. Sci. IRJMETS 03(05), 04 (2021) 3. Tahir, H., Shahbaz Khan, M., Owais Tariq, M.: Performance analysis and comparison of faster R-CNN, mask R-CNN and RESNET50 for the detection and counting of vehicles. In: 2021 International Conference on Computing, Communication, and Intelligent Systems, ICCCIS (2021) 4. Vipin, J., Ashlesh, S., Aditya, D., Lakshminarayanan, S.: Traffic density estimation from highly noisy image sources. In: Transportation Research Board 91st Annual Meeting, no. 12–1849. TRB, Washington, D.C. (2012) 5. Mai-Tan, H., Pham-Nguyen, H.N., Long, N.X., et al.: Mining urban traffic condition from crowd-sourced data. SN Comput. Sci. 1, 225 (2020) 6. Thiệt hại của ùn tắc giao thông tại TPHCM. https://laodong.vn/giao-thong/un-tac-giao-thongkhien-tphcm-thiet-hai-6-ti-usdnam-1067354.ldo. Accessed 28 May 2023 7. Measuring size of objects in an image with OpenCV. https://pyimagesearch.com/2016/03/28/ measuring-size-of-objects-in-an-image-with-opencv/. Accessed 28 May 2023 8. VTIS. https://vtis.vn/. Accessed 28 May 2023 9. UTraffic. https://bktraffic.com/. Accessed 28 May 2023 10. Luật an ninh mạng 2018 số 24/2018/QH14. https://luatvietnam.vn/an-ninh-quoc-gia/luat-anninh-mang-2018-164904-d1.html. Accessed 28 May 2023 11. Vehicle Detection and Counting System Using OpenCV. https://www.analyticsvidhya.com/ blog/2021/12/vehicle-detection-and-counting-system-using-opencv/. Accessed 28 May 2023 12. Using Python and wget to Download Web Pages and Files. https://www.scrapingbee.com/ blog/python-wget/. Accessed 28 May 2023 13. GPS Fields Area Measure. https://apps.apple.com/us/app/gps-fields-area-measure/id1123 033235. Accessed 28 May 2023

Traffic Density Estimation at Intersections via Image-Based Object Reference Method

181

14. Kích thước xe máy. https://bilparking.com.vn/article/kich-thuoc-bai-do-xe-may-tieu-chuan. Accessed 23 July 2023 15. Kích thước xe hơi. https://nghiencar.com/kich-thuoc-xe-hoi/. Accessed 23 July 2023 16. Kích thước xe tải. https://vanchuyenachau.com.vn/tin-tuc/kich-thuoc-cac-loai-xe-trong-vantai/. Accessed 23 July 2023 17. Kích thước xe buýt. https://shac.vn/kich-thuoc-xe-buyt-cac-loai. Accessed 23 July 2023

Improving Automatic Speech Recognition via Joint Training with Speech Enhancement as Multi-task Learning Nguyen Hieu Nghia Huynh1,2 , Huy Nguyen-Gia1,2 , Tran Hoan Duy Nguyen1,2 , Vo Hoang Thi Nguyen1,2 , Tuong Nguyen Huynh3 , Duc Dung Nguyen1,2 , and Hung T. Vo1,2(B) 1

2

Ho Chi Minh City University of Technology (HCMUT), Ho Chi Minh City, Vietnam {nghia.huynhbachkhoa,nthduy.sdh222,nddung,vthung}@hcmut.edu.vn Vietnam National University Ho Chi Minh City, Ho Chi Minh City, Vietnam 3 Industrial University of Ho Chi Minh City, Ho Chi Minh City, Vietnam [email protected]

Abstract. Multi-Task Learning (MTL) has proven its eﬀectiveness for decades. By combining certain related tasks, neural networks will likely perform better due to their inductive biases obtained from concrete tasks. Thus, many AI systems (such as GPT) have been developed based on MTL as the de facto solution. MTL has been applied early in the ﬁeld of automatic speech recognition (ASR) and has made some signiﬁcant advances. To continue this work and improve the performance of ASR systems, we propose an MTL-style method that addresses both tasks of automatic speech recognition and speech enhancement, where speech enhancement is used as an auxiliary task. We use Conformer acoustic model as the default architecture in this study, it is also modiﬁed to satisfy both tasks. The performance of the ASR task has improved by about 11.5% on the VIVOS dataset and 10.2% test other on the LibriSpeech 100 h by using the proposed method.

1

Introduction

The performance of an ASR system largely depends on the representations of the acoustic model, known as the encoder in any ASR model. In common situations, training data is not clean; it may contain noise samples or not be large enough to robustify neural networks. Uncertain features that could be classiﬁed as either speech or noise features may cause the encoder to produce ambiguous representations. In other words, if the model is unable to reduce the eﬀect of irrelevant features, it may get stuck in an overﬁtting state. To overcome this issue, the encoder requires more knowledge to represent features better. Data Augmentation is one of the most eﬀective solutions. However, to robustify neural networks via inductive biases, we implemented a Multi-Task Learning (MTL) approach. In real-world situations, noisy environments are prevalent and can noticeably aﬀect c The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 N.-N. Dao et al. (Eds.): ICIT 2023, LNDECT 187, pp. 182–191, 2023. https://doi.org/10.1007/978-3-031-46573-4_17

ASR-SE Joint Training ASR with Speech Enhancement

183

the accuracy and eﬃciency of ASR systems, especially in languages that lack of resources. Therefore, by implementing an auxiliary task that aims to eliminate noise from speech, our goal is to enhance the overall performance and resilience of our ASR model for low-resource languages. The rest of the paper is as follows. The Sect. 2 shows the related work. The Sect. 3 shows the approach. The experimental setup and results are presented in Sect. 4. Finally, Sect. 5 is the conclusion and discussion about future work.

2

Related Work

Self-Attention-based and convolution-based ASR models have gained popularity and are continually improving. The Conformer [11] architecture, which was introduced in 2020, is a combination of the Self-Attention and Convolution-based approaches and has become a leading ASR model. Over the years, denoising has been tackled with various methods, many of which rely on the Short-time Fourier transform (STFT). This approach has gradually gained popularity and is commonly used in recent studies [1,17]. Typically, this involves converting the input data with noisy signals into a spectrogram using STFT and passing it through a deep neural network to restore the original data without noise. In 2014, the concept of autoencoder was oﬃcially introduced as a type of deep neural network that specializes in unsupervised learning tasks [19]. Its architecture comprises three components, namely an encoder, a bottleneck, and a decoder. The primary objective of an autoencoder is to learn the process of reducing data dimensionality via an encoder, representing data in a compressed format through a bottleneck, and reconstructing data from compressed representations using a decoder. Multi-task Learning (MTL) was ﬁrst introduced in 1997 [2]. Through many experiments, the authors have shown that the hypothesis of sharing low-level representations to improve performance on related tasks is reasonable, laying the foundation for many future studies and applications of MTL. Over the years, MTL has proven to be eﬀective and useful in a wide range of ﬁelds, from image processing [9] to natural language processing [3], and especially in speech processing and recognition [6]. In data reconstruction problems, the U-Net network, which is a variant of the autoencoder architecture, is often chosen as the default architecture. While U-Net was originally proposed for medical image processing [18], it has proven to be eﬀective in many other segmentation problems, such as object separation from the background and detection of object boundaries, etc. In particular, UNet has shown great potential for speech enhancement (SE), which has been studied in [8,13]. Research on combining ASR and SE as an MTL task has been studied recently [7,14]. The general approach of these studies is to integrate the SE task into the ASR models causally. In this case, the whole model could be divided into two modules including SE and ASR. Data is forwarded through the SE module

184

N. H. N. Huynh et al.

for enhancement before being recognized with ASR module. In recent years, UNet and Transformer-based architectures have become the de facto architecture in many ﬁelds of image enhancement as well as speech enhancement. In speech enhancement, the combination of U-Net and Transformer has gained popularity [5,13]. The Transformer-based architecture, which is eﬀective to represent data, increasingly plays a critical role in SE models. This approach has a high potential for further expansion.

3

ASR-SE: A MTL Approach

In this research, we integrated the speech enhancement module as an auxiliary task into the ASR model. As SE requires an appropriate architecture for feature extraction and data reconstruction, we expanded the encoder to accommodate this task.

Fig. 1. The proposed architecture. The Conformer acoustic model is expanded with a Sub-encoder and a Sub-decoder, while Conformer blocks play a role as a bottleneck in U-Net architecture.

The network responsible for the SE task is designed based on the U-Net architecture, which comprises three main components: an encoder, a decoder,

ASR-SE Joint Training ASR with Speech Enhancement

185

and a bottleneck. Additionally, the ASR architecture we will investigate in this study is the Conformer-based Transducer [10] model, which includes three modules: a Conformer-based encoder, a Predictor, and a Joiner. In our architecture, the encoder serves as the bottleneck component in the U-Net and is expanded with a Sub-encoder and a Sub-decoder (Fig. 1). The Sub-encoder and Sub-decoder are two components in U-Net models that have distinct roles. The Sub-encoder is responsible for extracting features from the input data and creating compact and abstract representations that retain essential information while ﬁltering out irrelevant components. It consists of several Sub-sampling blocks that reduce the data features by half. On the other hand, the Sub-decoder is responsible for reconstructing data from compact representations. It contains a matching number of Upsampling blocks to the number of Subsampling blocks in the Sub-encoder, with each block expanding the data by a factor of two.

Fig. 2. The proposed subsampling block (left) and upsampling block (right). Both modules have two residual convolution blocks; the subsampling block uses stride convolution to reduce, while the upsampling block expands data with transposed convolution.

Each Subsampling block is connected to its corresponding Upsampling block via a skip connection. For instance, the ﬁrst Subsampling block is connected to the last Upsampling block, and the last Subsampling block is connected to the ﬁrst Upsampling block in reverse. The skip connection uses a sum operator to combine the feature maps. Figure 2 illustrates the detailed architecture of the Subsampling and Upsampling blocks, which were inspired by Defossez’s study [5]. The Subsampling blocks follow the principle of reducing the size of feature maps before capturing spatial features with convolution layer(s), while the Upsampling blocks perform these operations in the reverse order. This study diﬀers from the previous studies [5,13] that were conducted on waveform data. Instead, this study performs the SE task on spectrogram data because of the ASR task’s requirements. Therefore, we use two 2D convolution layers instead of one 1D convolution to better represent local features. Each convolution layer has a residual connection, where

186

N. H. N. Huynh et al.

its input and output are summed up and passed through a BatchNorm layer and a ReLU activation function, respectively. In Subsampling blocks, a stride convolution is employed as a downsampler, while in Upsampling blocks, a transposed convolution is utilized as an upsampler. The objective used for the SE task is the Mean Square Error (MSE loss). The overall objective of both tasks is the weighted sum of Tranducer (RNNT) and Mean Square Error (MSE) losses (expressed by formula 1). Lambda (λ) coeﬃcient accepts values in the range of [0, 1]. Ltotal = (1 − λ)LRN N T + λLM SE

(1)

Fig. 3. The data pipeline for both the ASR and SE tasks is shown in the ﬁgure. The components limited by the blue boundary perform the data ﬂow for the SE task, while the components limited by the green boundary perform the data ﬂow for the ASR task.

Figure 3 shows the data pipeline for both the ASR and SE tasks. The required inputs include a clean signal and the corresponding transcripts. The clean signal is perturbed with noise before being passed through the Sub-encoder and the Conformer encoder. The output of the Conformer encoder is split into two branches: one is passed through the Sub-decoder for reconstruction, and the other is passed through the Joiner for recognition. There are two types of noise have been used, including real noise and white noise. In this study, we use real noise corpus published by the Queensland University of Technology (QUT’s noise) [4]. This corpus contains several recordings of various living spaces, such as streets, coﬀee shops, kitchens, airports, etc. Noise is mixed into a clean signal via a signal-to-noise algorithm. On the other hand, white noise is a common occurrence in many electronic devices, and adding white noise to clean speech can help model these situations. At each step of the process, one of these types of noise is randomly chosen and applied to the clean signal. In addition, speeches have to be converted into spectrograms to perform the ASR task; therefore, the SE task is also performed on this data form.

ASR-SE Joint Training ASR with Speech Enhancement

4

187

Experiments and Results

Experimental trials were carried out employing the VIVOS [16] and LibriSpeech [15] datasets. The VIVOS dataset, published by AILAB, a laboratory belonging to the Vietnam National University - Ho Chi Minh City, comprises 15 h of audio. On the other hand, the LibriSpeech dataset, which contains multiple English language voices, is a highly popular speech dataset, and it comprises 100 h of audio data. Our model and method are evaluated using the word error rate (WER) [12], which measures the number of incorrect words predicted by the model per total words in the reference transcriptions. The WER is calculated using a formula that is given by Eq. 2. W ER =

S+D+I N

(2)

where: – – – –

S: Number of words need to be replaced (substitution errors). D: Number of words need to be added (deletion errors). I: Number of words need to be removed (insertion errors). N : Number of words in the reference.

The initial experimental group, conducted on the VIVOS dataset, was carried out using two diﬀerent model sizes, as presented in Table 1, a small model with 17.6 million parameters and a large model with 92.5 million parameters. The variable dmodel denotes the quantity of hidden dimensions, while Heads refers to the number of heads within the Multi-Heads Self-Attention modules, and Layers indicates the count of layers in the neural network. We utilized the Conformer-based Transducer model, which is mentioned above [10,11], without modiﬁcation. Two distinct recipes were implemented in this experimental group, and their detailed speciﬁcations are provided in Table 2. At each step, either real noise or white noise was randomly selected and mixed into the clean signal with an equal probability. In order to mix real noise into the clean signal, a signal-to-noise ratio (SNR) is required. The range of SNR was set to be between 5 and 15, with an integer value of SNR being randomly selected at each step when real noise was applied. Both Recipe 1 and Recipe 2 have the same conﬁguration of SNR value range. When white noise is applied, a standard deviation (std) value is randomly chosen within a range between 0 and an upper bound speciﬁed for each recipe. An artiﬁcial noise with a mean of 0 and the previously selected standard deviation is then generated and mixed into the signal. The upper bound for std in Recipe 1 is 0.05, while the upper bound for std in Recipe 2 is 0.01. The random values for SNR and std are uniformly distributed. The last conﬁguration pertains to the real noise mixing method. Recipe 1 mixes noise on a batch level, while Recipe 2 mixes noise on a sample level. Each model size conﬁguration has its own baseline, which is the default conﬁguration of the Conformer-based Transducer model.

188

N. H. N. Huynh et al.

Table 1. Model Conﬁgurations Size

Small

Table 2. Noise Conﬁgurations

Large

Parameters 17.6 M 92.5 M

Recipe 1

Recipe 2

Real noise SNR U{5, 15}

U{5, 15}

dmodel

176

512

White noise std U(0.0, 0.05) U(0.0, 0.01)

Heads

4

8

Mix noise on

Layers

16

8

batch

sample

Table 3. WER of ASR-SE method on VIVOS dataset Method

Conﬁguration Test clean Test noise

Small size Baseline Sub-encoder Noise Perturbation Noise Perturbation ASR-SE ASR-SE

Recipe Recipe Recipe Recipe

Large size Baseline ASR-SE ASR-SE

1 2 1 2

23.8 22.5 23.1 21.6 22.8 21.2

39.1 33.9 25.0 23.6 24.1 23.7

Recipe 1 Recipe 2

22.7 21.8 20.1

36.0 23.8 21.8

The results presented in Table 3 indicate that our proposed method outperforms the corresponding baseline. These ﬁndings suggest that ASR models that cannot detect and ignore noise features may not perform well in real-world environments that are often noisy. Speciﬁcally, for the small model size, ASRSE using Recipe 1 and Recipe 2 exhibited a relative improvement of 4.2% and 9.2% (22.8% and 21.8% in absolute scale), respectively, compared to the baseline (23.8%). In the case of the large model size, ASR-SE with Recipe 1 and Recipe 2 showed an improvement of 4.0% and 11.5% (21.8% and 20.1% in absolute scale), respectively, compared to the baseline (22.7%). In the small-size model, we conducted experiments called Noise Perturbation to ensure that the eﬀectiveness of our method not only comes from noise perturbation but also comes from the MTL approach. Noise Perturbation experiments diﬀer from ASR-SE in that they do not use MSE loss to reconstruct the original data or actively remove noise. It means the encoder cannot able to reconstruct the original data or actively remove noise. In other words, these experiments refer to the data augmentation method. As we expected, Noise Perturbation is partially eﬀective but it is less signiﬁcant than ASR-SE. Noise perturbation resulted in a relative improvement of 2.9% and 9.2% over the baseline (22.8%), which corresponds to an absolute improvement of 23.1% and 21.6%, respectively. However, these improvements were lower compared to ASR-SE with the corre-

ASR-SE Joint Training ASR with Speech Enhancement

189

Table 4. WER of ASR-SE method on LibriSpeech 100 h dataset Method

Conﬁguration test-clean test-other

Large size Baseline Recipe 2 ASR-SE

12.9 12.0

34.4 30.9

sponding Recipe, with a relative improvement of only 1.3% (23.1% vs. 22.7%) and 1.9% (21.6% vs. 21.2%). The results demonstrate that Recipe 2 outperforms Recipe 1 in both experiments involving Noise Perturbation (by approximately 6.5% in relative scale) and ASR-SE (by approximately 7.0% in small size model and 7.8% in large size model, both in relative scale). This could be due to the large standard deviation, which can make the model noisy, or the mixing of noise on batches, causing the model to struggle to converge to a better local minimum. In addition, to evaluate the impact of the Sub-encoder, which plays a role as a feature extractor, an experiment called Sub-encoder was conducted and resulted in Table 3. In this experiment, neither adding noise nor denoising was applied. The only diﬀerence between this setup and the Baseline is the feature extractor architecture. The results demonstrate a relative improvement of about 5.5%, which corresponds to an absolute improvement of 22.5% compared to the baseline (which achieved 23.8%). It proved that our proposed architecture could better extract features for the ASR task. To assess the eﬀectiveness of our method in handling noisy speech, we generated a test set by combining the VIVOS test set with noise. The noise corpus consisted of both real-world noise (QUT’s noise corpus mentioned above) and synthetic white noise. For each test sample, we selected an SNR value of 10 and added either real noise or white noise with an std of 0.01. The results demonstrate a signiﬁcant improvement in accuracy using the ASR-SE method, with a decrease of 38.4% WER for Recipe 1 (24.1% in absolute scale) and 39.4% WER for Recipe 2 (23.7% in absolute scale), compared to the baseline (39.1%). For Recipe 1, ASR-SE outperformed the Noise Perturbation method by approximately 3.6% (24.1% vs. 25.0%). However, for Recipe 2, there was not a signiﬁcant diﬀerence observed between the ASR-SE and Noise Perturbation methods. To evaluate the eﬀectiveness of our proposed method in other languages, we conducted some experiments with LibriSpeech 100 h, which is a dataset of English. This experimental group utilized an identical network architecture to the previous one, and in these experiments, we employed both the Large-size model and Recipe 2. Similar to VIVOS, LibriSpeech 100 h includes easy-to-listen and low-noise voices. The experiments conducted on the 100 h LibriSpeech set shown in Table 4, similar to those on the VIVOS set, demonstrated positive results. The ASR-SE improved by approximately 7% on the test-clean set and

190

N. H. N. Huynh et al.

10.2% on the test-other set compared to the baseline. It implied that the proposed method could be applied to other languages, especially low-resource languages.

Fig. 4. Clean (left), noisy (center), and reconstructed (right) speech in the form of mel-spectrogram

Figure 4 shows a comparison of the mel-spectrogram representations of clean speech, noisy speech, and reconstructed speech. The noisy speech is heavily corrupted by noise, making it challenging to recognize compared to clean speech. It was cleaned with the SE task, resulting in the reconstructed speech melspectrogram. However, due to limitations in the SE algorithm’s accuracy, it may not be possible to remove all noise features while preserving all relevant speech features.

5

Conclusion

Noise is a critical issue in the ﬁeld of ASR, particularly in low-resource languages. Datasets collected from laboratories or limited in size may not cover all real-life scenarios, and the presence of noise in speech can result in lower recognition accuracy if ASR models cannot appropriately handle noisy features. To address this concern, we have proposed a joint model that simultaneously addresses ASR and SE tasks using Multi-task Learning. By incorporating SE as an auxiliary task, the joint model is expected to acquire the inductive bias of this task and achieve superior performance on the ASR task by reaching better local minima. Our proposed method has demonstrated a remarkable improvement of about 11.5% on the VIVOS dataset and 10.2% test other on the LibriSpeech dataset (in relative scale). However, our proposed method has a limitation regarding the impact of the auxiliary task on enhancing the encoder’s robustness. In future work, we will enhance the data pipeline to overcome this issue. Acknowledgements. We acknowledge Ho Chi Minh City University of Technology (HCMUT), VNU-HCM for supporting this study.

ASR-SE Joint Training ASR with Speech Enhancement

191

References 1. Carbajal, G., Richter, J., Gerkmann, T.: Disentanglement learning for variational autoencoders applied to audio-visual speech enhancement. In: 2021 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA). IEEE (2021). https://doi.org/10.1109/waspaa52581.2021.9632676 2. Caruana, R.: Multitask learning. Mach. Learn. 28, 41–75 (1997). https://doi.org/ 10.1023/A:1007379606734 3. Collobert, R., Weston, J.: A uniﬁed architecture for natural language processing: deep neural networks with multitask learning, pp. 160–167 (2008). https://doi.org/ 10.1145/1390156.1390177 4. Dean, D., Sridharan, S., Vogt, R., Mason, M.: The QUT-noise databases and protocols (2010). https://doi.org/10.4225/09/58819f7a21a21 5. Defossez, A., Synnaeve, G., Adi, Y.: Real time speech enhancement in the waveform domain (2020) 6. Deng, L., Hinton, G., Kingsbury, B.: New types of deep neural network learning for speech recognition and related applications: an overview. In: 2013 IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 8599–8603 (2013). https://doi.org/10.1109/ICASSP.2013.6639344 7. Eskimez, S.E., et al.: Human listening and live captioning: multi-task training for speech enhancement (2021) 8. Fu, Y., et al.: Uformer: a Unet based dilated complex & real dual-path conformer network for simultaneous speech enhancement and dereverberation (2022) 9. Girshick, R.: Fast R-CNN (2015) 10. Graves, A.: Sequence transduction with recurrent neural networks. arXiv preprint: arXiv:1211.3711 (2012) 11. Gulati, A., et al.: Conformer: convolution-augmented transformer for speech recognition (2020). https://doi.org/10.48550/arXiv.2005.08100 12. Klakow, D., Peters, J.: Testing the correlation of word error rate and perplexity. Speech Commun. 38(1), 19–28 (2002). https://doi.org/10.1016/S01676393(01)00041-3 13. Kong, Z., Ping, W., Dantrey, A., Catanzaro, B.: Speech denoising in the waveform domain with self-attention (2022) 14. Ma, D., Hou, N., Pham, V.T., Xu, H., Chng, E.S.: Multitask-based joint learning approach to robust ASR for radio communication speech (2021) 15. Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: LibriSpeech: an ASR corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015). https:// doi.org/10.1109/ICASSP.2015.7178964 16. Quan, P.V.H.: VIVOS: 15 hours of recording speech prepared for Vietnamese automatic speech recognition, Ho Chi Minh, Vietnam (2016) 17. Richter, J., Carbajal, G., Gerkmann, T.: Speech enhancement with stochastic temporal convolutional networks. In: Interspeech (2020) 18. Ronneberger, O., Fischer, P., Brox, T.: U-net: convolutional networks for biomedical image segmentation (2015) 19. Schmidhuber, J.: Deep learning in neural networks: an overview. Neural Netw. 61, 85–117 (2015). https://doi.org/10.1016/j.neunet.2014.09.003

Solving Feature Selection Problem by Quantum Optimization Algorithm Anh Son Ta(B) and Huy Phuc Nguyen Ha School of Applied Mathematics and Informatics, Hanoi University of Science and Technology, Dai Co Viet, Hanoi, Vietnam [email protected], [email protected]

Abstract. This study is to propose a method for feature selection with Quadratic Unconstrained Binary Optimization (QUBO) formulation by applying Conditional Value at Risk (CVaR) hybrid with Quantum Approximate Optimization Algorithm (QAOA) using hybrid Diﬀerential evolution (DE) -Trotterized Quantum Annealing (TQA) initialization method to solve the QUBO formulation of feature selection. This is a new approach to feature selection, which is very important for machine learning research. This method is applied to 11 real-life datasets and the results have been improved signiﬁcantly.

Keywords: Feature selection CVaR · QUBO · TQA · DE

1

· QAOA · Conditional Value at Risk ·

Introduction

In recent years, the development of big data and machine learning has led to the need for methods that analyze and process information, especially for classiﬁcation, regression, and clustering problems. One of the methods - feature selection - has been increasingly crucial to the realm of information processing, as the presence of irrelevant features can lead to overﬁtting in the model and a more expensive computational process. Feature selection is selecting a subset of features that are highly relevant to the output, thus this method can help the training models to decrease the computing time and increase the accuracy. However, the process of choosing the most relevant features is required massive computational resources: the straightforward evaluation of all possible subset features is an NP-hard problem and its complexity increases with the number of features. Feature selection techniques can be categorized into three main groups: ﬁlter methods, wrapper methods, and ensemble methods. Despite their eﬀectiveness, these methods have not yet successfully addressed the computational challenges which plagued the feature selection process. In recent papers, the feature selection problem is reformulated in the form of the quadratic binary optimization formulation. Naghibi T. et al. [6] uses a semideﬁnite programming algorithm to ﬁnd approximate solutions to feature selection problems based on the Mutual c The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 N.-N. Dao et al. (Eds.): ICIT 2023, LNDECT 187, pp. 192–201, 2023. https://doi.org/10.1007/978-3-031-46573-4_18

Solving Feature Selection Problem by Quantum Optimization Algorithm

193

Information measure. Another approach for this problem is using quantum optimization algorithms. Ferrari Dacrema M. et al. [26] use the Quantum Annealing method for three formulations of the feature selection problem on 15 publicly available datasets. The result of [26] is comparable with the classical methods. Mucke S. et al. [7] have focused on Mutual Information and shown a similar result compared to the result in [26]. Turati G. [2] has considered QAOA solving three formulations and reported the numerical result of 7 real-life datasets. However, Turati G. [2] did not consider the initialization parameters for QAOA which could lead to the local optimal solution. In this article, we emphasize the correlation formulation of feature selection and use CVaR optimization and QAOA with hybrid Trotterized Quantum Annealing [3]- Diﬀerential Evolution [8] initialization for classiﬁcation problems. The numerical result of benchmark datasets illustrates the eﬃciency of our proposed methods.

2

Feature Selection Model

The quadratic programming model of feature selection was ﬁrst introduced in [1], and the proposed solution is utilized for ranking features and served as a relaxed solution to the QUBO problem. In this article, we consider the classiﬁcation problem with n features f1 , f2 , .., fn and label y. The primary focus of the task is to ﬁnd k features that are most relevant to label y, which can be used for classiﬁcation purposes. Thus, this approach helps maintain the accuracy of the classiﬁcation using the original data while reducing the complexities of the model. We consider it in the form of a QUBO problem: min xT Qx n s.t. xi = k i=1

0 < k < n, n

where x = (x1 , x2 , .., xn ) ∈ {0, 1} , xi is represented for the selection of feature fi with xi = 1 means fi is chosen and xi = 0 is otherwise. Q is a symmetric matrix or an upper triangle matrix. Q matrix can be chosen from Pearson correlation, Mutual Information, and SVC [2]. In this article Q can be deﬁned by Pearson correlation r(fi , fj ) [2]: Qi,j = r(fi , fj ) Qi,i = −r(fi , y). The Pearson correlation shows the correlation between two features and each feature with labels in the dataset and the value of r(fi , fj ) are in the range [−1, 1]. If all of r(fi , fj ) are equal to 1 or −1, this result means all of the present features are chosen since they are important indicators that inﬂuence the output of the model nand we have to use all of them for the machine learning task. The condition i=1 xi = k is used to make sure that exactly k features are selected.

194

3 3.1

A. S. Ta and H. P. N. Ha

Solving Feature Selection Problems by CVaR-QAOA Quantum Approximate Optimization Algorithm

Quantum Approximate Optimization Algorithm (QAOA) was ﬁrst introduced in 2014 by Fahri [9]. QAOA is a hybrid classical-quantum algorithm for solving optimization problems. It is designed to run on near-term quantum computers and works by encoding the problem as a cost function minimized via the use of a series of quantum gates and measurements. The algorithm alternates between classical optimization of parameters that control the quantum gates and quantum evolution under those gates. The output of the algorithm is a quantum state that approximates the solution to the optimization problem. QAOA has been used for a variety of optimization problems, including graph partitioning, MaxCut, portfolio optimization, and many other combinatorial optimization problems. The state function is deﬁned as |ψ(β, γ) = U (β1 )U (γ1 )...U (βk )U (γk ) |+

n

with U (β) = e−iHB β and U (γ) = e−iHf γ . U (β) and U (γ) are parameterized quantum gates with Hf is problem Hamilton operator Hf =

n

ai,j σiZ σjZ +

i,j=1

n

bi σiZ + c

i=1

with σiZ is the Pauli Z matrix and Hf corresponds to the objective function of combinatorial optimization problem, HB is mixer Hamiltonian operator HB =

n

σiX

i=1

with σiX is Pauli X matrix. The vector |+ is deﬁned by: 1 |+ = √ (|0 + |1) 2

1 0 and |1 = are 2 qubits in a quantum computer. The QAOA 0 1 objective function is deﬁned as follows:

with |0 =

min

β,γ∈[0,2π]

ψ(β, γ)|Hf |ψ(β, γ)

(1)

The expectation can be represented as follow: ψ(β, γ)|Hf |ψ(β, γ) =

n i=1

λi |ψ(β, γ)|xi |2 ,

(2)

Solving Feature Selection Problem by Quantum Optimization Algorithm

195

with xi is the feasible solution represented by bitstring which is encoded in the quantum computer, λi is the value of objective function corresponding to xi , |ψ(β, γ)|xi |2 is the probability of xi . It is noticeable all of the solutions are encoded into columns of the identity matrix with size 2n × 2n with n as the bitstring length corresponding to the feasible solutions of the combinatorial optimization problem and the values of the objective function are the eigenvalue of Hf . This algorithm makes use of classical computers to optimize the expectation function. It uses gradient and gradient-free optimizers such as gradient descent, COBYLA, SPSA [24], etc. with initial parameters (β0 , γ0 ). 3.2

CVaR Optimization for QAOA

In quantum mechanics, observables are deﬁned by the expectation value ψ|H |ψ hence, this concept helps deﬁne the calculation of mean values in the QAOA. As a result, we use the minimum of the observation min {H1 , H2 , .., Hk } as the objective function. However, this function is non-smooth as k is ﬁnite; thus, applying classical optimization algorithms to this problem has been challenging due to the complexities and non-smooth characteristics of the function. Consequently, we use the CVaR(Conditional Value at Risk) function in order to ﬁnd the optimal solution. In general, CVaR of a random variable X for a conﬁdence level α ∈ (0, 1] is written: −1 (α)] CV aRα = E[X|X ≤ FX

with FX is the cumulative density function of X. From this equation, we can understand that CVaR is the expectation of lower distribution of random variable X. Consider the samples Hk are sorted in nondecreasing order. The CV aRα is deﬁned as: αK 1 Hk . αK k=0

If α decreases close to 0 the function can be close to the minimum value, α = 1 the function is equal to the expectation value of X. In QAOA, we consider the Hk is the value of objective function corresponding to bitstring xk taken from the parameterized quantum circuit of QAOA with probability |ψ(β, γ)|xk |2 , k = 1, ..., N with N as a number of samples. The main idea of using CVaR for QAOA is instead of using all the measurement outcomes and computing the expectation, we use the objective function on the tail of energy distribution. In other words, we only use a small fraction of the measurement outcome. By utilizing the method, we can enhance the best measurement outcomes, which might yield better results rather than trying to improve the measurement of the average energy of all outcomes by concentrating on the lowest energy states since QAOA regards all the outcomes of the measurement process with equal importance, even the outlier states that may have an energy that is far away from our goal.

196

3.3

A. S. Ta and H. P. N. Ha

Apply CVaR-QAOA to Feature Selection Problem

In order to use the Quantum Approximate Optimization Algorithm, we have to transform the feature selection problem into the Ising model [5]. From the formulation, we transform binary variable to {−1, 1} by s = 2x − e with e = 1 1 ... 1 and have the following formula: 1 1 min sT Qs + sT Qe + c 4 2 n s.t. si = 2k − n i=1

The penalty problem has the following formulation: n 1 1 f (s) = min sT Qs + sT Qe + c + ρ( si − 2k + n)2 4 2 i=1

(3)

with si ∈ {−1, 1}, i = 1, 2, ..., n, si are component of vector s and ρ is the positive constant. ρ can chosen based on Lemma 2.1 in [10]. In this article, we substitute si with σzi is the Pauli Z matrix. The Hamiltonian problem formed from the feature selection problem has the following formula: Hf =

n

ai,j σzi σzj +

n

bi σzi + c

i=1

i,j=1,i=j

n

with ai,j , bi and c = i=1 qii + nρ + ρ(2k − n)2 are the coeﬃcents of (3). After transforming the feature selection problem into Hamiltonian, we apply QAOA to calculate all energies and probabilities of all solutions. From [11] article we know that ﬁnding the optimal value of QAOA is an NP-hard problem. The hardness of the problem is to ﬁnd the initial parameters since the QAOA function is not convex and using a classical optimizer can lead to a local solution if we do not have a good initialization. In order to improve QAOA, we have to use the QAOA parameters initialization method by using hybrid Trotterized Quantum Annealing (TQA) [3]- Diﬀerential Evolution (DE) [8]. TQA method helps us to ﬁnd good initialization parameters for the QAOA expectation and it shows its eﬀectiveness with the Max-Cut problem [3]. This method is to ﬁnd the best time value t that has a high approximate ratio of the problem. We used the TQA method by ﬁnding the best time value t of Quantum Annealing and used the formula γj = pj Δt, βj = (1 − pj )Δt for each layer with Δt = Tp is the time step, T is the time that Quantum Annealing has the highest approximate ratio. Using this method, we have this estimation shows the error of using the Trotter formula for the state function of Quantum Annealing: j

j

j

j

e−i[(1− p )ΔtHB + p ΔtHf ] ≈ e−i(1− p )ΔtHB e−i p ΔtHf + O(Δt2 ).

(4)

Furthermore, we also have the inequality below which is showing the upper bound of the error when using the Trotter formula:

Solving Feature Selection Problem by Quantum Optimization Algorithm

197

n n i j i Proposition 1. With Hf = i,j=1,i=j ai,j σz σz + i=1 bi σz + c, and M is defined by M = max {||Hf ||, ||HB ||}, j = 1, 2, ..., p with j is the depth of QAOA circuit, Δt is the time step of Quantum annealing and ||A|| = max||x||2 =1 ||Ax||, we have: j

j

j

j

||e−i[(1− p )ΔtHB + p ΔtHf ] − e−i(1− p )ΔtHB e−i p ΔtHf || ≤

2M T 1 T 2 ( ) ||[HB , Hf ]||e p 2 p

Proof. Using Corollary 2 in [12], we have: j

j

j

j

||e−i[(1− p )ΔtHB + p ΔtHf ] − e−i(1− p )ΔtHB e−i p ΔtHf || j j j j ≤ |(1 − ) Δt2 |||[HB , Hf ]||e|(1− p )|||HB ||+| p |||Hf || p p j j 1 ≤ Δt2 ||[HB , Hf ]||e|(1− p )Δt|M +| p Δt|M 2 2M T 1 T ≤ ( )2 ||[HB , Hf ]||e p 2 p Proposition 1 shows the upper bound of error when using the Trotter formula for Quantum Annealing. This proposition helps us to choose p in order to minimize the error of the Trotter formula. The value of p is much larger than T to minimize the Trotter formula error and it makes sure that (1 − pj )Δt and pj Δt is less than 1. After ﬁnd the best value of T , we use Diﬀerential Evolution (DE) [8] to ﬁnd initialization parameters for the QAOA circuit by ﬁnding βj ∈ [(1 − j j p )Δt, 2π], γj ∈ [ p Δt, 2π]. We can use this as an initial parameter to optimize the expectation with classical optimizers. After optimizing the expectation, we use CVaR for getting optimal solutions. With chosen α, we can get the solution corresponding to the α tail distribution by calculating the expected value and the solution will show which features can be used for classiﬁcation. We show the process to perform this algorithm in Algorithm 1. Algorithm 1. Hybrid Diﬀerential Evolution-Quantum Annealing initialization for QAOA-CVaR 1: Build QAOA circuit with depth p and 2 Hamiltonian operators Hf and HB 2: Find time step Δt of TQA by ﬁnding arg minT ψ(t)|H|ψ(t) → T opt with |ψ(t) = −i(1− j )ΔtH

−i j ΔtH

opt

B f p e e p and set Δt = T p j 3: Set γj = p Δt, βj = (1 − pj )Δt 4: Using Diﬀerential evolution to ﬁnd γj∗ ∈ [ pj Δt, 2π], βj∗ ∈ [(1 − pj )Δt, 2π] 5: Run classical optimizer for QAOA circuit with (γj∗ , βj∗ ) as initial parameters 6: Calculate the objective function of feature selection problem values with bitstrings from the QAOA circuit and sort them in increasing order. 7: Use CVaR optimization with chosen α with the distribution of bitstrings.

198

4

A. S. Ta and H. P. N. Ha

Numerical Simulation

In order to show the method of feature selection is eﬃcient, we have to test it on several benchmark datasets and test it with some machine learning models: Logistic Regression (LR), XGBoost (XGB), and Histogram Gradient Boosting (HistGB) and use the Simulated Annealing algorithm for parameter tuning which we base on [25]. 11 datasets used for the benchmark are taken from UCI: Wilt in [4], Wireless Indoor Localization [13], Raisin grain in [14], Rice in [15], Accent [16], Stability [17], HTRU2 [18], Occupancy [19], seeds [20], Room Occupancy Estimation [21] and 1 dataset taken from [22]’s article. Furthermore, Accent [16] and seed [20] are not balanced, so we use SMOTE method [23] to oversample the minority class. We ran QAOA on IBM’s Qiskit Aer simulator matrix product state with 100 qubits with COBYLA for the classical optimizer. For CVaR optimization, we choose α equal to 0.01%. The accuracies of classiﬁcation models using this feature selection method are demonstrated and compared with previous articles that use all features in Table 1. Table 1. Model performance on datasets Dataset

Accuracy(%) Prev result(%) No. of features Model

Wilt [4]

95.45

90

4

XGB

Wireless Indoor Localization [13]

96.834

95.16

6

XGB

Raisin grain [14]

87.037

86.44

5

HistGB

Rice [15]

93.07

93.02

5

LR

Accent [16]

92.42

–

8

XGB

Stability [17]

99.97

80

7

XGB

HTRU2 [18]

97.933

97.8

5

HistGB

Occupancy [19]

99.57

99

3

XGB

Room Occupancy Estimation [21] 99.5

98.4

10

XGB

seeds [20]

92.06

92

4

XGB

pv fault [22]

99.66

90.05

4

HistGB

With No. of features is the number of selected features from original datasets with the TQA-DE-CVaR-QAOA method and Model is the machine learning model that use in each dataset. From Table 1, we show that the accuracy of our proposed method is higher and the number of features used for classiﬁcation models is less than previous results for 11 famous benchmark datasets which can be explained that this method chooses the most relevant features to the output and Simulated Annealing ﬁnds the best parameters for each machine learning model. Furthermore, we would compare the probabilities generated from random initial parameters and 3 methods in Table 2. From Table 2, we can see that every dataset using hybrid DE-TQA has a probability of optimal solution higher than TQA initialization and random initialization. For instance, using the random initialization method for the QAOA circuit in case of Accent [16] and Stability [17] and HTRU2 [18] lead to the probability of optimal solutions are 0.

Solving Feature Selection Problem by Quantum Optimization Algorithm

199

Furthermore, Room Occupancy Estimation [21] has a probability equal to 0 with TQA and random method. This can lead to the optimal solution of the feature selection problem since the CVaR computes the expectation of solutions in the tail of values distribution generated from the QAOA circuit. By contrast, the probability of optimal solutions using the Diﬀerential Evolution-TQA method is better since we can restrict the interval of each parameter by using TQA results as the lower bound of each interval and Diﬀerential Evolution can ﬁnd better initial parameters in these intervals. Table 2 shows the probability of optimal solution when using 3 initial methods. Table 2. Probability of optimal solution of feature selection problem taken from QAOA circuit with 3 initialization method Dataset

TQA initial Random initial TQA-DE initial Model

Wilt [4]

0.10058594

0.015625

0.14160156

Wireless Indoor Localization [13]

0.10839844

0.02734375

0.14160156

XGB

Room Occupancy Estimation [21] 0

0

0.00195312

XGB

Raisin grain [14]

0.0078125

0.00488281

0.02929688

HistGB

Rice [15]

0.01660156

0.00683594

0.08007812

LR

Accent [16]

0.00195312

0

0.00390625

XGB

Stability [17]

0.00097656

0

0.00097656

XGB

HTRU2 [18]

0.00195312

0.00097656

0.00878906

HistGB

Occupancy [19]

0.10644531

0.06835938

0.22265625

XGB

pv fault [22]

0.03613281

0.00488281

0.06933594

HistGB

seeds [20]

0.01367188

0.00585938

0.02636719

XGB

XGB

Finally, we show the TQA approximate ratio graphs of 2 instances in Fig. 1. From these graphs, we can choose the initial parameters for each depth of QAOA circuit p ∈ {1, 2, 3, 4, 5, 6, 7, 8, 9, 10} based on the time of Quantum Annealing. We choose p based on the approximate ratio and the error of approximating (4) in order to minimize the error of the Trotter formula and reduce the complexity of computing the expectation (1). The approximate ratio in Fig. 1 ﬂuctuates considerably because TQA accepts the non-feasible solutions of the feature selection problem which makes the expected value far from the optimal solution. TQA accepts non-feasible solutions because its initial state considers all solutions with the same probability. This issue leads to the approximate ratio changing considerably in the time interval.

200

A. S. Ta and H. P. N. Ha

Fig. 1. The TQA optimal time for 2 datasets: Room Occupancy Estimation and Wireless Indoor Localization

5

Conclusion and Feature Work

In this article, we show that it is possible to use DE-TQA-CVaR-QAOA for performing feature selection based on Pearson correlation matrix coeﬃcient QUBO formulation. To illustrate the eﬀectiveness of our method, we test it on 11 reallife datasets with machine learning models and compared the result with the proposed solutions from the previous articles. Furthermore, we show that the probabilities of optimal solutions increase when using DE-TQA for parameter initialization. Our approach can be readily employed for other measures of signiﬁcance, such as mutual information, entropy, or other information-theoretic measures. In the future, we plan to use it on other real-life datasets and develop a new initialization method for QAOA parameters.

References 1. Rodriguez-Lujan, I., Elkan, C., Cruz, C.S., Huerta, R., et al.: Quadratic programming feature selection. J. Mach. Learn. Res. 11, 1491–1516 (2010) 2. Turati, G., Dacrema, M.F., Cremonesi, P.: Feature selection for classiﬁcation with QAOA. In: 2022 IEEE International Conference on Quantum Computing and Engineering (QCE). IEEE (2022) 3. Sack, S.H., Serbyn, M.: Quantum annealing initialization of the quantum approximate optimization algorithm. Quantum 5, 491 (2021). 11 4. Johnson, B.A., Tateishi, R., Hoan, N.T.: A hybrid pansharpening approach and multiscale object-based image analysis for mapping diseased pine and oak trees. Int. J. Remote Sens. 34(20), 6969–6982 (2013) 5. Lucas, A.: Ising formulations of many NP problems. Front. Phys. 2, 5 (2014) 6. Naghibi, T., Hoﬀmann, S., Pﬁster, B.: A semideﬁnite programming based search strategy for feature selection with mutual information measure. IEEE Trans. Pattern Anal. Mach. Intell. 37(8), 1529–1541 (2014) 7. Mucke, S., Heese, R., Muller, S., Wolter, M., Piatkowski, N.: Quantum feature selection. arXiv preprint: arXiv:2203.13261 (2022) 8. Storn, R., Price, K.: Diﬀerential evolution-a simple and eﬃcient heuristic for global optimization over continuous spaces. J. Global Optim. 11(4), 341 (1997)

Solving Feature Selection Problem by Quantum Optimization Algorithm

201

9. Farhi, E., Goldstone, J., Gutmann, S.: A quantum approximate optimization algo rithm. arXiv preprint: arXiv:1411.4028 (2014) 10. Lasserre, J.B.: A max-cut formulation of 0/1 programs. Oper. Res. Lett. 44(2), 158–164 (2016) 11. Bittel, L., Kliesch, M.: Training variational quantum algorithms is NP-hard. Phys. Rev. Lett. 127(12), 120502 (2021) 12. Moler, C., Van Loan, C.: Nineteen dubious ways to compute the exponential of a matrix, twenty-ﬁve years later. SIAM Rev. 45(1), 3–49 (2003) 13. Rohra, J.G., Perumal, B., Narayanan, S.J., Thakur, P., Bhatt, R.B.: User localization in an indoor environment using fuzzy hybrid of particle swarm optimization & gravitational search algorithm with neural networks. In: Deep, K., et al. (eds.) Proceedings of Sixth International Conference on Soft Computing for Problem Solving. AISC, vol. 546, pp. 286–295. Springer, Singapore (2017). https://doi.org/ 10.1007/978-981-10-3322-3_27 14. C ,inar, I., Koklu, M., Tas, demir, S,: Classiﬁcation of raisin grains using machine vision and artiﬁcial intelligence methods. Gazi Muhendislik Bilimleri Dergisi 6(3), 200–209 (2020) 15. Cinar, I., Koklu, M.: Classiﬁcation of rice varieties using artiﬁcial intelligence methods. Int. J. Intell. Syst. Appl. Eng. 7(3), 188–194 (2019) 16. Fokoue, E.: UCI machine learning repository (2020). [WebLink] 17. Arzamasov, V., Bohm, K., Jochem, P.: Towards concise models of grid stability. In: 2018 IEEE International Conference on Communications, Control, and Computing Technologies for Smart Grids (SmartGridComm), pp. 1–6. IEEE (2018) 18. Lyon, R.J., Stappers, B., Cooper, S., Brooke, J.M., Knowles, J.D.: Fifty years of pulsar candidate selection: from simple ﬁlters to a new principled real-time classiﬁcation approach. Mon. Not. R. Astron. Soc. 459(1), 1104–1123 (2016) 19. Candanedo, L.M., Feldheim, V.: Accurate occupancy detection of an oﬃce room from light, temperature, humidity and CO2 measurements using statistical learning models. Energy Build. 112, 28–39 (2016) 20. Charytanowicz, M., Niewczas, J., Kulczycki, P., Kowalski, P.A., Lukasik, S., Zak, S.: Complete gradient clustering algorithm for features analysis of x-ray images. In: Pietka, E., Kawa, J. (eds.) Information Technologies in Biomedicine. Advances in Intelligent and Soft Computing, vol. 69, pp. 15–24. Springer, Berlin (2010). https://doi.org/10.1007/978-3-642-13105-9_2 21. Singh, A.P., Jain, V., Chaudhari, S., Kraemer, F.A., Werner, S., Garg, V.: Machine learning-based occupancy estimation using multivariate sensor nodes. In: 2018 IEEE Globecom Workshops (GC Wkshps), pp. 1–6. IEEE (2018) 22. Lazzaretti, A.E., et al.: A monitoring system for online fault detection and classiﬁcation in photovoltaic plants. Sensors 20(17), 4688 (2020) 23. Chawla, N.V., Bowyer, K.W., Hall, L.O., Kegelmeyer, W.P.: Smote: synthetic minority over-sampling technique. J. Artif. Intell. Res. 16, 321–357 (2002) 24. Spall, J.C.: Overview of the simultaneous perturbation method for eﬃcient optimization. Johns Hopkins APL Tech. Digest 19(4), 482–492 (1998) 25. Sartakhti, J.S., Zangooei, M.H., Mozafari, K.: Hepatitis disease diagnosis using a novel hybrid method based on support vector machine and simulated annealing (SVM-SA). Comput. Methods Program. Biomed. 108(2), 570–579 (2012) 26. Ferrari Dacrema, M., Moroni, F., Nembrini, R., Ferro, N., Faggioli, G., Cremonesi, P.: Towards feature selection for ranking and classiﬁcation exploiting quantum annealers. In: Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 2814–2824 (2022)

A Methodology of Extraction DC Model for a 65 nm Floating-Gate Transistor Thinh Dang Cong1,2 and Trang Hoang1,2(B) 1 Department of Electronics Engineering, Ho Chi Minh City University of Technology

(HCMUT), Ho Chi Minh City 72506, Vietnam [email protected] 2 Vietnam National University, Ho Chi Minh City 71308, Vietnam

Abstract. Floating-gate Metal-Oxide Semiconductor (MOS) has been investigated and applied in many applications such as artificial intelligence, analog mixed-signal, neural networks, and memory fields. This study aims to propose a methodology for extracting a DC model for a 65 nm floating-gate MOS transistor. The method in this work uses the combination architecture of MOS transistor, capacitance, and voltage-controlled voltage source which can archive a high accuracy result. Moreover, the advantage of the method is that the MOS transistor was a completed model which enhances the flexibility and accuracy between a fabricated device and modeled architecture. In our work, the industrial standard model Berkeley Short-channel IGFET Model (BSIM) 3v3.1, level 49 was deployed, and the DC simulation was obtained with the use of LTspice tool. Keywords: Floating-gate transistor · modeling extraction · tunneling effects · CMOS technology · gate leakage current

1 Introduction In recent decades, the transistor named floating-gate has been studied and developed rapidly since the one has attractive characteristics. One of the important characteristics is the nonvolatile which is able to keep data information even without the power supply. There are many designs and applications adopted by the transistor such as neural networks, analog designs, and memories [1–4]. Nowadays, the use of floating-gate transistors in deep neural network fields has emerged as a promising application as the result of an explosion of artificial intelligence [5, 6]. Therefore, proposing a precise model of floating-gate transistor has been becoming essential problem in order to serve the need of demands. On the other hand, looking at the development of the semiconductor industry, because the floating-gate transistor is a design based on the Complementary Metal-Oxide Semiconductor (CMOS) technology, the transistor suffers the scaling downtrend [7]. When the scaling down in CMOS to a few nanometers, the feature sizes of the transistor are shrunk dramatically. Hence, the physical phenomenon and leakage current become dominant and lead to the difficulty in modeling the transistor with high accuracy. © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 N.-N. Dao et al. (Eds.): ICIT 2023, LNDECT 187, pp. 202–213, 2023. https://doi.org/10.1007/978-3-031-46573-4_19

A Methodology of Extraction DC Model for a 65 nm Floating-Gate Transistor

203

Much research has been published in this field to solve the problems mentioned above. Regarding the impact of the leakage current problem, in 2015, work [8, 9] provided a method that models the gate leakage current (GLC) in sub-100 nm technology. The model consists of a voltage dependent current source, the channel current is a sum of gate-source and gate-drain currents. In addition, a current mirror circuit is designed for validation the model. The results are promising when the leakage current is considered. However, the model is for a 90 nm process and may not be appropriate in advanced ones. Moreover, in 2019 the authors of work [10] proposed a method to extract the leakage currents factors including Fowler-Nordheim tunneling and Poole-Frenkel emission. However, the paper modeled with Verilog A only and did not deploy the commercial formats. In 2020, a study in [11] gave the application of a floating-gate transistor in neuron circuits with promising performance. Although the transistor is designed and extracted model successfully, the study did not mention the procedure to model the transistor, which is important for improving in further works. While many works have been focused on enhance the leakage current or transient models, little attention has been paid to DC modeling. Regarding the DC model of the floating-gate transistor, most of the published studies use the capacitive coupling methodology where the floating gate potential is a function of the control gate, source, drain, and bulk voltages. There are three main categories of the method which are the dummy cell, the floating gate cell, and a new method without using coupling coefficient approaches. In the literature review in [12], the third approach is the most accurate and simple one, which is presented in [13]. The authors in [13] gave the very first time a new compact model which performed without using the constant value of the capacitive coupling coefficient. That model uses the MOS transistor along with the floating gate-control gate capacitance. The validation is on EEPROM and Flash memory and appropriate for the 0.35 µm and 0.25 µm technology. While the study did not give a detailed procedure to model, the MOS transistor usage came from the industrial library which is not flexible to adapt with different fabricated floating-gate transistor. In other words, the method requires the fabricated floating-gate transistor which follows according to the device structure of the industrial MOS transistor. Hence, in this paper, a methodology was suggested to extract the DC model of the floating-gate transistor for 65 nm CMOS technology. The method deployed the combination of a MOS transistor, a capacitance, and a voltage-controlled voltage source. While the MOS transistor was fully extracted following the industrial standard BSIM3v3.1, level 49, the capacitance and voltage-controlled voltage source were determined based on the virtual fabrication results in TCAD tools. Regarding the result simulations after modeling, such as drain current versus control gate voltage at three conditions, including initial condition, VSB varies, and VD varies, and drain current versus drain-source voltage when VCG varies. The extracted model can archive the high accuracy. Furthermore, as the completed model for the MOS transistor, the method of this work provides the considered results in term of flexibility when the parameters can be investigated to obtain the target results.

204

T. D. Cong and T. Hoang

This article is organized as follows: Sect. 2 describes the basic concepts of the floating-gate transistor including the device structure and the DC operation. Next, the methodology of the extraction model is presented in Sect. 3. Following that, Sect. 4 gives the results and discussion of the DC model of the virtual fabricated device after extracting. Finally, the concluding remarks are shown in Sect. 5.

2 Floating-Gate Transistor Concepts 2.1 Device Structure A floating-gate transistor is given in Fig. 1. The transistor has two gates which are the control gate and the floating gate. The purpose of the floating gate is storage charge during operation or even when removing the power supply. The charge can be retained in the floating gate for many years depending on the quality of the floating-gate transistor and the application. Besides, the layer of Inter-poly dielectric (IPD) is fabricated between the control gate and the floating gate. The one is not designed for charge moving but must prevent the leakage charge during expected conditions. Moreover, the layer provides the important parameter of the transistor which is the capacitive coupling ratio to lower supply voltages. The material of the IPD layer is known as Oxide-Nitride-Oxide (ONO).

Fig. 1. Floating-gate transistor structure [13]

One of the most important layers of the transistor is the one placed under the floating gate, that is the tunnel oxide layer. The layer not only decides the speed of the device during writing and erasing but also protects the device to avoid the leakage charge from the floating gate. Compared to the IPD, the tunnel oxide is thinner as the purpose of high speeds. Basically, SiO2 is used for that layer. Nowadays, many studies have investigated the material of the tunnel oxide layer with new materials such as Al2 O3 , ZrO2 , and HfO2 [14, 15]. 2.2 DC Operation As mentioned in the above section, many studies have been investigated to model the write and erase operations. In contrast, very few studies have been worked on the proposed DC model. In this section, the two main approaches to DC modeling, which are the classical method in calculating floating gate voltage and the charge balance model, are presented.

A Methodology of Extraction DC Model for a 65 nm Floating-Gate Transistor

205

First, for the classical floating gate voltage calculation method, based on the schematic in Fig. 1, the main detail of the method is that the floating gate voltage is derived from the equation below [16]. QFG = 0 = (VFG − VCG )CCG + (VFG − VD )CD + (VFG − VS )CS + (VFG − VB )CB (1) where the total charge is assumed as zero, VFG , VCG , VD , VS , VB are the voltages at the floating gate, control gate, drain, source, and substrate, respectively. The capacitance CCG , CD , CS , CB are the capacitances of the floating gate-control gate, floating gatedrain, floating gate-source, floating gate-bulk. After several calculations, the floating gate voltage, and drain current in the triode region and saturation region are determined as formulas. VFG = αCG (V CG + f .VD )

(2)

VDS < αCG .(VCG + f .VDS − VTCG )

(3)

In Triode condition:

IDS = β

CG

CG .[ VCG − VT .VDS − f −

1 2 .VDS ] 2.αCG

(4)

In Saturation condition: VDS ≥ αCG .(VCG + f .VDS − VTCG ) IDS =

β CG 2 .αCG .(VCG + f .VDS − VTCG ) 2

(5) (6)

where total capacitance CT = CCG + CD + CS + CB , with αi = Ci /CT as the coupling coefficient relative to the electrode i, where i can be one among CG, D, S, and B. Parameter f is defined as the relative between the capacitance CD and CCG . VT is the threshold voltage of the device. However, the accuracy of this method is not high since the limitation of the device structure where the floating gate cannot be accessed as isolation materials. Hence, the coupling coefficients, which depend on the bias conditions, must consider the biasdependence in order to archive more accuracy [17]. On the other hand, the charge balance model shows better accuracy when using a MOS transistor and a voltage-controlled voltage source to model a floating-gate transistor. The schematic for that model is given in the following Fig. 2 [13]. The charge model can be illustrated in the equation which shows the related between the charge on the MOS gate, the charge on the bottom plate of the CCG , and the one that comes and removes on the floating gate during transient operations. QG (VFG , VS , VS , VD ) + CCG (VFG − VCG ) = QFG

(7)

The floating gate potential is calculated without the constant capacitive coefficient providing more accuracy and dramatically enhance compared to the first approach method.

206

T. D. Cong and T. Hoang

Fig. 2. DC model of the device in charge balance model

3 Methodology in Model Extraction In this section, we propose a method to extract the virtual fabricated floating-gate transistor for CMOS 65 nm process. The modeled structure consists of a capacitance C1, a voltage-controlled voltage source E1, and a MOS transistor. The schematic of the model is given in the following Fig. 3.

Fig. 3. The proposed model of the floating-gate transistor

Where the capacitance C1 is presented for the capacitance between the control gate and the floating gate. The dielectric material is ONO which is mentioned in the previous sections. Besides that, the voltage-controlled voltage source E1 has the value of 2.5E+16 calculated based on the potential related between the control and floating gates. In order to determine the E1 value, the virtual fabrication results in the TCAD tool have to be taken into account. Next, looking at the MOS transistor, the transistor was extracted based on the virtual fabrication processes and parameters. Moreover, the industrial BSIM3V3.1, level 49 was studied to archive the high accuracy of the model. The fabricated parameters are given in the Table 1 below where the essential parameters of the process were used for the model extraction.

A Methodology of Extraction DC Model for a 65 nm Floating-Gate Transistor

207

Table 1. Input configuration of extraction from a virtual fabricated simulation Fabricated Parameters Parameter

Value

Unit

L

0.065

nm

W

1

µm

Area of drain and source

0.005425

µm2

Squares between contact and channel

0.3594

µm

Squares between LDD/N+ and channel

0.0388

µm

Aluminum gate

1

N+ poly gate

1

P+ poly gate

0

N well

0

P well

1

Vt adjust dose

0

cm−2

Gate oxide thickness

90

Å

NSS

3E+11

cm−2

Starting wafer resistivity

10

ohm-cm

Well dose

8E+12

cm−2

Well drive time

1330

min

Well drive temperature

1100

C

LDD source and drain dose

2.45E+12

cm−2

LDD source and drain drive time

175

min

LDD source and drain drive temperature

1000

C

Field oxide thickness

6000

Å

Minority carrier lifetime in the well

1

µs

Drain and source dose

2.45E+12

cm−2

Drain and source siliside

0

Regarding the model of the BSIM3v3.1, level 49, there are 21 parameters were extracted in this work. While following the formulas of the BSIM3v3.1, level 49, minor adjustments were done in order to modeling the behavior of the floating-gate transistor. Moreover, the threshold voltage parameter has to be manually modified for reproducing the characteristics of the device [13]. The extracted model with parameter values is shown in the Table 2 above. The model was imported into the LTspice tool for validation and evaluation. In the next section, the simulation results of the model and the comparison to the virtual fabrication with TCAD simulation results are presented.

208

T. D. Cong and T. Hoang Table 2. Extracted parameters for BSIM3v3.1, level 49

BSIM3V3.1, Level 49 Parameters in this work Parameter

Value

Unit

LEVEL

49

VERSION

3.1

MOBMOD

1

CAPMOD

2

TOX

9.00E−09

m

XJ

1.84E−07

m

NCH

4.23E+16

cm−3

NSUB

4E18

cm−3

XT

2.39E−07

m

VTH0

0.2E−001

V

U0

12.32

cm2 /v-s

WINT

1.8E−06

m

LINT

1.84E−07

m

PCLM

5.00

NGATE

5.00E+20

m−3

RSH

1082.55

ohm/sq

JS

1.35E+11

A/m2

JSW

1.35E+11

A/m

CJ

3.97E−04

F/m2

MJ

0.5

PB

0.92

V

*CJSW

3.40E−10

F/m

*MJSW

3.40E−10

F/m

*PBSW

5.70E−10

F/m

4 Result This section presents the DC simulation results of the floating-gate transistor with the extracted model. Moreover, the results of the simulated fabrication structure are given in this section. The simulations include drain current versus control gate voltage at initial condition, drain current versus control gate voltage when VSB varies, drain current versus control gate voltage VD varies, and drain current versus drain source voltage when VCG varies. While the fabricated structure adopts the TCAD tool for simulations, the extracted model of the floating-gate transistor uses the LTspice tool. The testbench circuits and results are shown in the following paragraphs.

A Methodology of Extraction DC Model for a 65 nm Floating-Gate Transistor

209

4.1 Drain Current Versus Control Gate Voltage at Initial Condition Firstly, the drain current and control gate voltage was simulated with the extracted model from Sect. 3. While the voltage of the drain was configured at 1 V, the ones of source and substrate were forced to zero value. Besides, the control gate voltage was varied from 0 V to 3V in order to study the characteristics of the transistor.

Fig. 4. Testbench of drain current versus control gate voltage with initial condition

Fig. 5. Drain current versus control gate voltage with initial condition

Looking at the results, the charts are almost similar between the TCAD simulation and LTspice. The threshold voltages, which is one of the critical parameters of the floating-gate transistor, are around 0.2 V in both simulations. The testbench circuit and results are presented in Fig. 4 and Fig. 5.

210

T. D. Cong and T. Hoang

4.2 Drain Current Versus Control Gate Voltage When VSB Varies When studying the impact of the body effect, this work shows the results of the drain current versus the control gate voltage with the change of the VSB . The VSB varies from 0 V to 0.45 V with a step value of 0.15. The testbench circuit and results are given in Fig. 6 and Fig. 7. FGMOS Control_gate Source Substrate

Drain

V1

V2

V3

V4

3V

0V

{V3}

1V

Fig. 6. Testbench of drain current versus control gate voltage when VSB varies

Fig. 7. Drain current versus control gate voltage when VSB varies

Figure 7 illustrates the threshold voltage of the transistor increase when VSB increase. The behavior is similar to the results in the TCAD tool and the one of the traditional Metal-Oxide-Semiconductor Field-Effect Transistor (MOSFET). Although the threshold voltages witness nearly changes, the currents of the LTspice simulations are higher than the ones of the TCAD simulation when VSB increase from 0.15 V to 0.45 V.

A Methodology of Extraction DC Model for a 65 nm Floating-Gate Transistor

211

4.3 Drain Current Versus Control Gate Voltage When VD Varies Figure 8 and Fig. 9 demonstrate the schematic to study the VD effects and the simulated results. The drain currents are decreased when reducing the VD from 1.4 V to 0.8 V with a step of 0.2 V.

Fig. 8. Testbench of drain current versus control gate voltage when VD varies

Fig. 9. Drain current versus control gate voltage when VD varies

It is clear that when VD varies from 1 V to 1.2 V, the drain currents are not significantly different between both simulation tools. However, the gaps become bigger when reducing the voltage to 0.8 V or increasing to 1.4V, especially 1 V of the control gate forward. 4.4 Drain Current Versus Drain Voltage When VCG Varies Finally, the testbench to investigate the relation between the drain current and drain voltage, and the simulation results when varying VCG are given in Fig. 10 and 11. A small step of VCG from 6 V to 5.25 V results in the decrease of current. However, the results between TCAD and LTspice simulations are not completely matching. Whereas

212

T. D. Cong and T. Hoang

the currents of the former simulation are bigger compared to the latter simulations when the drain voltage is lesser than 0.5 V, the trend is inverted when the drain voltage is bigger than 0.5 V.

Fig. 10. Testbench of drain current versus drain voltage when VCG varies

Fig. 11. Drain current versus drain voltage when VCG varies

In short, the results of the extracted model are good with a high accuracy at the default condition which is the main condition of the virtual fabricated floating-gate transistor. Whereas the BSIM3v3 level 49 is flexible and not complicated in modeling the transistor, the accuracy of the advanced technology should be considered. Moreover, the ignoring of zero bias capacitance between the gate of the MOS transistor and the source, drain, and substrate per meter of gate width leads to a practical mismatch.

5 Conclusion This paper was successful in proposing the methodology of extraction DC model of the 65 nm floating-gate transistor. The model consists of the MOS transistor, capacitance, and voltage-controlled voltage source. While the MOS transistor was completely modeled

A Methodology of Extraction DC Model for a 65 nm Floating-Gate Transistor

213

by deploying the BSIM3v3.1 level49, the capacitance and voltage-controlled voltage source values were proposed based on the virtual fabrication results. Regarding the results, the DC simulation was performed, including drain current versus control gate voltage at initial condition, when VSB varies, VD varies, and drain current versus drain source voltage when VCG varies, with the high accuracy and with the use of LTspice tool. Acknowledgment. We acknowledge the support of time and facilities from Ho Chi Minh City University of Technology (HCMUT), VNU-HCM for this study.

References 1. Bi, J.-S., et al.: Heavy ion induced upset errors in 90-nm 64 Mb NOR-type floating-gate flash memory. Chin. Phys. B 27(9) (2018) 2. Gu, X., et al.: A novel word line driver circuit for compute-in-memory based on the floating gate devices. Electronics 12(5) (2023) 3. Rana, C., et al.: Low voltage floating gate MOSFET based current differencing transconductance amplifier and its applications. J. Semicond. 39(9) (2018) 4. Wunderlich, R.B., Adil, F., Hasler, P.: Floating gate-based field programmable mixed-signal array. IEEE Trans. Very Large Scale Integr. (VLSI) Syst. 21(8), 1496–1505 (2013) 5. Agarwal, S., Garland, D., Niroula, J., et al.: Using floating-gate memory to train ideal accuracy neural networks. IEEE J. Explor. Solid State Comput. Dev. Circuits 5(1), 52–57 (2019) 6. Han, R., et al.: Floating gate transistor-based accurate digital in-memory computing for deep neural networks. Adv. Intell. Syst. 4(12) (2022) 7. The international technology roadmap for semiconductors (ITRS) (2013) 8. Saheb, Z., El-Masry, E.: Practical simulation model of floating-gate MOS transistor in sub 100nm technologies. Int. J. Electr. Comput. Energe. Electron. Commun. Eng. 9(8) (2015) 9. Saheb, Z., El-Masry, E.: Modelling of direct tunneling gate leakage current of floating-gate CMOS transistor in sub 100 nm technologies. Analog Integr. Circuits Signal Process. 84(1), 67–73 (2015) 10. Park, E.-J., Choi, J.-M., Kwon, K.-W.: Behavior modeling for charge storage in single-poly floating gate device. J. Nanosci. Nanotechnol. 19 (2019) 11. Kim, T., Park, K., Jang, T., et al.: Input-modulating adaptive neuron circuit employing asymmetric floating-gate MOSFET with two independent control gates. Solid State Electron. 163, 107667 (2020) 12. Kalyan, B.S., Singh, B.: Design and simulation equivalent model of floating gate transistor. In: 2015 Annual IEEE India Conference (INDICON). IEEE (2015) 13. Larcher, L., et al.: A new compact DC model of floating gate memory cells without capacitive coupling coefficients. IEEE Trans. Electron Dev. 49(2), 301–307 (2002) 14. Keeney, S.N.: Dielectric scaling challenges and approaches in floating gate non-volatile memories. In: Proceedings-Electrochemical Society, pp. 151–158 (2004) 15. Ng, T.H., et al.: Fabrication and characterization of a trilayer germanium nanocrystal memory device with hafnium dioxide as the tunnel dielectric. Thin Solid Films J. (2004) 16. Hamzah, A., Ahmad, H., Tan, M.L.P., et al.: Scaling challenges of floating gate nonvolatile memory and graphene as the future flash memory device: a review. J. Nanoelectron. Optoelectron. 14(9), 1195–1214 (2019) 17. Larcher, L., Pavan, P., Albani, L., Ghilardi, T.: Bias and W/L dependence of capacitive coupling coefficients in floating gate memory cells. IEEE Trans. Electron Dev. ED-48(9), 2081–2089 (2001)

imMeta: An Incremental Sub-graph Merging for Feature Extraction in Metagenomic Binning Hong Thanh Pham3 , Van Hoai Tran1,2(B) , and Van Vinh Le4 1 Faculty of Computer Science and Engineering, Ho Chi Minh City University of Technology (HCMUT), 268 Ly Thuong Kiet Street, District 10, Ho Chi Minh City, Vietnam [email protected] 2 Vietnam National University Ho Chi Minh City, Linh Trung Ward, Thu Duc City, Ho Chi Minh City, Vietnam 3 Faculty of Information Technology, Hoa Sen University, 8 Nguyen Van Trang Street, District 1, Ho Chi Minh City, Vietnam [email protected] 4 Faculty of Information Technology, Ho Chi Minh City University of Technology and Education, 1 Vo Van Ngan Street, Linh Chieu Ward, Thu Duc City, Ho Chi Minh City, Vietnam [email protected]

Abstract. Metagenomic binning is a crucial step in understanding microbial communities without culturing. Many unsupervised binning methods follow the two-phase paradigm. In the ﬁrst phase, speciﬁc features of metagenomic sequences, also known as reads, are extracted without relying on reference databases. The second phase involves applying clustering methodologies to group the reads into likely similar species, which are further studied in subsequent metagenomic steps, such as assembly and annotation. Speciﬁc well-studied methods refrain from building features for individual reads to improve computation performance and reduce input sensitivity. Instead, these methods create overlapping graphs that illustrate the closeness of reads based on their k-mer frequency distribution. Read nodes with high connectivity are then merged into sub-graphs, generating a feature for each sub-graph. This study introduces a novel unsupervised algorithm that incrementally merges sub-graphs into larger sub-graphs, expecting to obtain variablesized groups. This approach diﬀers from the ﬁxed-size-based methods proposed previously. Empirical results demonstrate that the proposed approach achieves higher accuracy than other well-known short-read methods.

Keywords: Metagenomic binning weighted clustering

· incremental graph merging ·

c The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 N.-N. Dao et al. (Eds.): ICIT 2023, LNDECT 187, pp. 214–223, 2023. https://doi.org/10.1007/978-3-031-46573-4_20

imMeta: An Incremental Sub-graph Merging in Metagenomics

1

215

Introduction

Metagenomics is the study ﬁeld which has revolutionized our understanding of microbial communities and their functional potential. With the development of high-throughput sequencing technologies, environmental samples are sequenced directly without needing cultivation and isolation in laboratories. However, due to the mixture of genetic material from multiple species within a single sample, the analysis of metagenomic data poses signiﬁcant computational challenges for research communities. Metagenomic binning is the task of classifying sequences that represent individual microbial organisms. The partitioning results help biologists understand the composition of microorganisms in the environment. Otherwise, researchers can use the results to perform further analysis steps in metagenomic projects such as DNA annotation and assembly [9]. Binning approaches can be divided into supervised and unsupervised methods. Supervised methods classify sequences based on the guidance of reference information from available databases. MEGAN [1,4] is one of the earliest binning algorithms which utilizes homology information between analyzed and reference sequences for the classiﬁcation. The method can group metagenomes eﬀectively, but it requires much computation time due to the homology search using BLAST [17] or DIAMOND [2] algorithm. Recently proposed approaches such as Kraken [16] and Ganon [6] deal with the problem by only extracting and comparing k-mer from the sequences instead of a whole sequence. Some other supervised methods, such as DeepMicrobes [5] and NBC [8], use genomic signatures (e.g. k-mer frequency) as classiﬁcation features. NBC applies the Na¨ıve Bayes method to identify and separate short reads, while DeepMicrobes utilizes the strength of deep learning strategy to ﬁnd a better classiﬁcation quality. It is a fact that reference databases are incomplete or only sometimes available, and thus it may reduce the quality performance of the supervised methods. In order to overcome the challenge, some unsupervised approaches are proposed in the literature. MetaBinner [13] and LRBinner [14] concentrate on extracting clusters from data of contigs or long sequences using multiple types of features. Focusing on analyzing sort reads, MetaCluster 5.0 [12] is a two-round binning approach which aims to cluster sequences using a genomic signature of k-mer frequency. In addition, BiMeta [11] and MetaProb [3] are two-phase algorithms which group sequences with a ﬁxed size based on overlapping information in their ﬁrst phases. Those approaches continue applying diﬀerent strategies to merge the groups into clusters using composition features. Some unsupervised binning methods utilize the strength of graph-based strategy in their clustering process. TOSS [10] builds a graph of unique k-mer extracting from sequences and then is based on the graph to group sequences using a graph partitioning algorithm in its ﬁrst step. The second step of the approach also applies MCL - a graph-based method to continue to merge the sequences groups. OBLR [15] and METAMVGL [18] also apply graph-based algorithms to classify metagenomic sequences. While OBLR generates a graph based only

216

H. T. Pham et al.

on overlapping information between reads, METAMVGL uses information on sequence overlap and the paired-end feature of reads. This work proposes an unsupervised binning method called imMeta, which applies an incremental sub-graph merging for feature extraction approach for metagenomic binning. The approach builds a weighted overlapping graph of sequences and incrementally merges sub-graphs into larger sub-graphs, expecting to obtain variable-sized groups. The following section presents the details of the proposed method. Experiment results show the strength of the proposed method on metagenomic datasets of short sequences. Some conclusion is stated in the ﬁnal section.

2 2.1

Methods Fundamentals and Notations

A metagenomic dataset R is a set of metagenomic reads. Each read has a unique identiﬁer, RID. An l-mer is a short sequence with a length of l. With a list of nucleotides A, C, G, T , it is obvious that there are 4l possible l-mers of length l, denoted by Q. Besides, let Qr be a collection of l-mers of a read r. A length read n consists of n − l + 1 l-mers in total. For example, if r = T CT AGA and l = 3, then Qr = {T CT, CT A, T AG, AGA}. Let N be a set of natural numbers. f req(r) ∈ N|Q| is deﬁned as the frequency vector of l-mers in read r. The overlapping level of two reads ri and rj is the number of shared l-mer between ri and rj . A read ri overlaps with another read rj , denoted as ri m rj if they share at least m l-mer in common, where m is a threshold parameter. ri m rj ≡ |Qri ∩ Qrj | ≥ m This work uses a weighted overlapping graph Gm (R, Em ) to represent the overlaps between DNA sequences. Given a set of vertices R, Em is an edge list containing pairs of reads with at least m l-mer. We have Em ⊂ R × R, Em = {(ri , rj )|ri , rj ∈ R ∧ ri = rj ∧ ri m rj }, in which, (ri , rj ) ∈ Em ≡ ri m rj . In addition, the list of vertices and edges of a graph G are denoted by V (G) and E(G). G(P ) is deﬁned as the method to build a subgraph of G with P vertices, where P ⊆ V (G). Two parameters αmin and αmax are used to set the lower and upper bounds of the number of vertices in each graph. Proposition 1. Let Gm (R, Em ) and Gn (R, En ) be two overlapping graphs of a set of metagenomic reads R, where m and n are overlap threshold parameters. Gm is a subset of Gn if and only if m is greater than or equal to n. Gm ⊆ Gn when m ≥ n Proof. Gm (R, Em ) and Gn (R, En ) are two graphs with the same vertex set R. It means that every vertex in Gm is also a vertex in Gn , and vice versa. The edge sets Em and En may diﬀer but must consist of pairs of vertices from R.

imMeta: An Incremental Sub-graph Merging in Metagenomics

217

It can be deduced from the deﬁnition of ri m rj that if two reads overlap by at least m l-mer, they also overlap by at least n l-mer for any n ≤ m. Hence, the edge set Em is a subset of En . ri m rj ⇒ |Qri ∩ Qrj | ≥ m ≥ n ⇒ ri n rj ≡ Em ⊆ En 2.2

Algorithms

The proposed algorithm classiﬁes sequences by gradually combining smaller subgraphs of related reads into larger ones to produce groups of varying sizes. This approach diﬀers from previous methods based on ﬁxed-size criteria for clustering. imMeta uses a parameter m to control the edge density of sub-graphs and iteratively merges them based on similarity scores. If Gm and Gn are two graphs with diﬀerent edge densities, a connected component of Gm will belong entirely to a connected component of Gn if m ≥ n. It means that the partitions of the nodes will be merged in a tree-like structure as m decreases. The binning process begins with a weighted overlapping graph constructed from a set of reads from the dataset. The input parameter speciﬁes the upper and lower limits of the total number of shared reads, shared reads reduction rate, and the maximum component size. The output is a list of merged groups of related reads. imMeta method consists of three phases: Grouping reads, Incremental Sub-graph merging, and Group clustering. Phase 1: Grouping Reads. The preliminary step involves grouping the reads by their overlap level, measured by the number of l-mers they have in common. The more l-mers two reads to share, the more likely they are from the same species. Phase 2: Incremental Sub-graph Merging. Phase 2 of imMeta, presented in Algorithm 1, proceeds by iterating over the values of m representing the overlap level. Each iteration determines all the connected components of Gm using an updated value of m (Line 4). Each connected component c of Gm identiﬁes a set of partitions in P contained in c, denoted by P (Gcm ) (Line 6). The algorithm proceeds diﬀerently depending on the state of the partition P (Gcm ) after it is computed. The algorithm goes to the next iteration if the partition has one element (Line 7) or the graph Gm (P (Gcm )) is disconnected (Line 9). Note that Gm (P (Gcm )) may diﬀer from Gm because some partitions may have been ﬁxed earlier when they are greater than or equal to the maximum component size αmax . The algorithm skips the merging step when partition list P (Gcm ) reaches the maximum component size αmax (Line 11) and marks them as f ixed. Otherwise, the method substitutes old partition P (Gcm ) with a updated partition {∪Pj ∈P (Gcm ) Pj } (lines 15–16).

218

H. T. Pham et al.

Algorithm 1: Incremental Sub-Graph Merging Incremental SubGraph Merging for Feature Extraction

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18

Input: Graph G(R, E), maximum shared reads M AX SHARED READS, minimum shared reads M IN SHARED READS, shared reads reduction rate β, maximum component size αmax Output: List of sub-graphs P m ← M AX SHARED READS P ← {{r}, ∀r ∈ R} repeat Gm = ∪Gcm foreach c ∈ Gm do P (Gcm ) = {Pj |Pj ∈ P ∧ Pj ⊆ V (Gcm ) ∧ ¬f ixed(Pj )} if |P (Gcm )| = 1 then Do nothing else if Gm (P (Gcm )) is disconnected then Do nothing else if |P (Gcm )| ≥ αmax then foreach Pi ∈ P (Gcm ) do f ixed(Pi ) = TRUE else P ← P \ P (Gcm ) P ← P ∪ {∪Pj ∈P (Gcm ) Pj } m←m−β until m > M IN SHARE READS

Phase 3: Group Clustering. In the ﬁnal phase of imMeta, the merged groups are clustered using a distance-based clustering algorithm known as weighted kmeans. This algorithm considers the distance between the centroids of groups to determine an optimal clustering of them.

3

Experimental Results

The proposed approach is compared with three available binning algorithms, including MetaCluster 5.0 [12], BiMeta [11], and MetaProb [3]. The parameters of imMeta are set as follows: maximum component size αmax = 5000, minimum shared reads M IN SHARED READS = 5, maximum shared reads M AX SHARED READS = 80, and shared reads reduction rate β = 5. The three remaining binning algorithms use their default parameters. The experiments were performed on a computer system with 64 GB RAM and Intel(R) Xeon(R) CPU E5-4627 v4 running on the CentOS Linux operating system.

imMeta: An Incremental Sub-graph Merging in Metagenomics

3.1

219

Dataset

Nine short-read datasets named S1 to S9 presented in Table 1 are used in this experiment. MetaSim [7] tool generates the datasets from real bacterial genomes downloaded from the NCBI (National Center for Biotechnology Information) database. Table 1. Paired-end short-read datasets Dataset No. of reads No. of species Abundance ratio

3.2

S1

96,367

2

1:1

S2

195,339

2

1:1

S3

338,725

2

1:1

S4

375,302

2

1:1

S5

325,400

3

1:1:1

S6

713,388

3

3:2:1

S7

1,653,550

5

1:1:1:4:4

S8

456,224

5

3:5:7:9:11

S9

2,234,168

15

1:1:1:1:1: 2:2:2:2:2: 3:3:3:3:3

Performance Metrics

This work uses three performance metrics, precision, recall, and F − measure, to assess the quality of binning algorithms. Let n be the number of species in a metagenomic dataset and C be the number of clusters a binning algorithm returns. Let Aij be the number of reads from species j assigned to cluster i. The metrics are calculated as follows. C

maxj Aij precision = Ci=1 n i=1 j=1 Aij n j=1 maxi Aij recall = C n i=1 j=1 Aij + #unassigned reads F −measure is estimated to consolidate precision and recall unitedly to produce one single number. F − measure = 2 ·

precision · recall precision + recall

220

3.3

H. T. Pham et al.

Results

Table 2 presents results of MetaCluster 5.0, BiMeta, MetaProb, and imMeta on the short-read datasets. It can be seen in the table that imMeta outperforms the remaining approaches. For more details, the proposed method gets a higher F −measure than the three remaining approaches on 6 out of 9 samples. imMeta also returns higher average F − measure value from 3.5% to 33.2% compared with MetaCluster 5.0, BiMeta, and MetaProb. In the case of dataset S9 with a large number of species, imMeta demonstrates the ability to analyze complex metagenomic data better than the other approaches. The proposed approach gets the lowest F − measure on sample S7 and the highest on sample S6 among the cases. Considering the image of clustering results on the two datasets in Fig. 1, it can be seen that items belonging to diﬀerent genomes in dataset S6 are well separated, while reads of diﬀerent genomes in dataset S7 are very close and can be a reason why it is challenging to analyze sample S7. Table 2. F − measure on short paired-end read datasets Dataset MetaCluster 5.0 BiMeta MetaProb imMeta S1 S2 S3 S4 S5 S6 S7 S8 S9

0.672 0.631 0.415 0.460 0.643 0.492 0.652 0.529 0.639

Average 0.570

3.4

0.978 0.581 0.978 0.994 0.690 0.858 0.843 0.743 0.791

0.991 0.901 0.928 0.908 0.832 0.970 0.782 0.769 0.719

0.990 0.906 0.986 0.870 0.918 0.996 0.752 0.867 0.833

0.828

0.867

0.902

Parameter Evaluation

The Impact of the Maximum Component Size. The section considers the eﬀect of maximum component size αmax on the clustering results of imMeta. Figure 2 shows F − measure values of the method on ﬁve short read datasets (S1–S4, S8) with diﬀerent values of αmax from 500 to 8000. It can be seen in the line chart that imMeta achieves lower F − measure when αmax is small (less than 3000). Most cases return better results with values αmax 4000 to 6000. The proposed algorithm slightly decreases quality performance with αmax greater than 6000.

imMeta: An Incremental Sub-graph Merging in Metagenomics

221

Fig. 1. Illustration of clusters on dataset S6 and S7

Fig. 2. Results of imMeta with diﬀerent values of maximum component size parameter

The Impact of the Shared Reads Reduction Rate. Dataset S9 is used to test the performance of imMeta under diﬀerent values of β - the shared reads reduction rate. Throughout the tests, the maximum group size αmax is set at 5000, and the number of shared reads is controlled by gradually decreasing from 80 to 5. As Table 3 shows, reducing the number of shared reads too fast can lower the quality of the results. As the β value decreases, the algorithm takes more iterations to converge, but the number of groups from Phase 2 also declines. Having fewer groups is because the merged groups have less variation when they grow slower. The increase of a parent graph containing subgraphs with too large size will occur less often. As growing slowly, the group is not in the f ixed status early because of its size.

222

H. T. Pham et al. Table 3. The impact of the shared reads reduction rate on dataset S9 β

4

Total groups Avg. size Std Dev P recision Recall F − measure

40 1,050,071

2.1

10.0

0.455

0.483

0.469

35 111,861

20.0

187.2

0.880

0.774

0.823

30 197,882

11.3

110.8

0.843

0.793

0.817

25 59,792

37.4

282.1

0.855

0.851

0.853

20 187,975

11.9

113.8

0.861

0.777

0.817

15 38,371

58.2

363.1

0.935

0.827

0.878

10 54,325

41.1

278.6

0.910

0.777

0.838

5

73.4

418.2

0.912

0.767

0.833

30,457

Conclusion

This study utilizes the strength of graph-based strategy in metagenomic binning problems. The proposed algorithm is intense consideration of the internal connectivity inside the connected component of the graph to have a better partitioning decision. With the eﬀectiveness of the proposed approaches on tested datasets, it can be used as a promising tool for analyzing complex metagenomic data. In future works, a possible strategy that can be used to improve the performance of imMeta is to separate small groups and large groups and process them diﬀerently. Otherwise, enhancing the proposed method for analyzing long sequences or contigs is also our future research. Acknowledgement. We acknowledge the support of time and facilities from Ho Chi Minh City University of Technology (HCMUT), VNU-HCM for this study.

References 1. Ba˘ gcı, C., Patz, S., Huson, D.H.: DIAMOND+ MEGAN: fast and easy taxonomic and functional analysis of short and long microbiome sequences. Curr. Protoc. 1(3), e59 (2021) 2. Buchﬁnk, B., Xie, C., Huson, D.H.: Fast and sensitive protein alignment using diamond. Nat. Methods 12(1), 59–60 (2015) 3. Girotto, S., Pizzi, C., Comin, M.: MetaProb: accurate metagenomic reads binning based on probabilistic sequence signatures. Bioinformatics 32(17), i567–i575 (2016) 4. Huson, D.H., Auch, A.F., Qi, J., Schuster, S.C.: Megan analysis of metagenomic data. Genome Res. 17(3), 377–386 (2007) 5. Liang, Q., Bible, P.W., Liu, Y., Zou, B., Wei, L.: DeepMicrobes: taxonomic classiﬁcation for metagenomics with deep learning. NAR Genom. Bioinform. 2(1), lqaa009 (2020). https://doi.org/10.1093/nargab/lqaa009 6. Piro, V.C., Dadi, T.H., Seiler, E., Reinert, K., Renard, B.Y.: ganon: precise metagenomics classiﬁcation against large and up-to-date sets of reference sequences. Bioinformatics 36(Supplement 1), i12–i20 (2020)

imMeta: An Incremental Sub-graph Merging in Metagenomics

223

7. Richter, D.C., Ott, F., Auch, A.F., Schmid, R., Huson, D.H.: Metasim-a sequencing simulator for genomics and metagenomics. PLoS ONE 3(10), e3373 (2008) 8. Rosen, G.L., Reichenberger, E.R., Rosenfeld, A.M.: NBC: the Naive Bayes classiﬁcation tool webserver for taxonomic classiﬁcation of metagenomic reads. Bioinformatics 27(1), 127–129 (2011) 9. Roumpeka, D.D., Wallace, R.J., Escalettes, F., Fotheringham, I., Watson, M.: A review of bioinformatics tools for bio-prospecting from metagenomic sequence data. Front. Genet. 8, 23 (2017) 10. Tanaseichuk, O., Borneman, J., Jiang, T.: Separating metagenomic short reads into genomes via clustering. Algorithms Mol. Biol. 7(1), 1–15 (2012) 11. Vinh, L.V., Lang, T.V., Binh, L.T., Hoai, T.V.: A two-phase binning algorithm using l-mer frequency on groups of non-overlapping reads. Algorithms Mol. Biol. 10(1), 2 (2015) 12. Wang, Y., Leung, H.C., Yiu, S.M., Chin, F.Y.: Metacluster 5.0: a two-round binning approach for metagenomic data for low-abundance species in a noisy sample. Bioinformatics 28(18), i356–i362 (2012) 13. Wang, Z., Huang, P., You, R., Sun, F., Zhu, S.: MetaBinner: a high-performance and stand-alone ensemble binning method to recover individual genomes from complex microbial communities. Genome Biol. 24(1), 1 (2023) 14. Wickramarachchi, A., Lin, Y.: Binning long reads in metagenomics datasets using composition and coverage information. Algorithms Mol. Biol. 17(1), 14 (2022) 15. Wickramarachchi, A., Lin, Y.: Metagenomics binning of long reads using readoverlap graphs. In: Jin, L., Durand, D. (eds.) RECOMB-CG 2022. LNCS, vol. 13234, pp. 260–278. Springer, Cham (2022). https://doi.org/10.1007/978-3-03106220-9 15 16. Wood, D.E., Lu, J., Langmead, B.: Improved metagenomic analysis with kraken 2. Genome Biol. 20, 1–13 (2019) 17. Ye, J., McGinnis, S., Madden, T.L.: Blast: improvements for better sequence analysis. Nucl. Acids Res. 34(suppl 2), W6–W9 (2006) 18. Zhang, Z., Zhang, L.: METAMVGL: a multi-view graph-based metagenomic contig binning algorithm by integrating assembly and paired-end graphs. BMC Bioinform. 22, 1–14 (2021)

Virtual Sensor to Impute Missing Data Using Data Correlation and GAN-Based Model Nguyen Thanh Quan1,2(B) , Nguyen Quang Hung1,2 , and Nam Thoai1,2 1

High Performance Computing Laboratory, Faculty of Computer Science and Engineering, Advanced Institute of Interdisciplinary Science and Technology, Ho Chi Minh City University of Technology (HCMUT), 268 Ly Thuong Kiet Street, District 10, Ho Chi Minh City, Vietnam {ntquan.sdh20,nqhung,namthoai}@hcmut.edu.vn 2 Vietnam National University Ho Chi Minh City, Linh Trung Ward, Thu Duc District, Ho Chi Minh City, Vietnam Abstract. Continuous automation of conventional manufacturing and industrial processes is an inevitable trend of the Fourth Industrial Revolution. Every aspect of life has been aﬀected by Internet of Things (IoT), which refers to the growing number of connected and functional devices that rely on data gathered by IoT sensors. In fact, a system’s activity may be terminated or become faulty as a result of a sensor failure because the data ﬂow was disrupted. Therefore, this work leverages machine learning methods which have lately been used in a number of applications and produced cutting-edge outcomes to resolve missing data problem with a virtual sensor solution. The approach uses the correlation of collected data to support a Generative Adversarial Networks (GAN) based model in imputing missing values. Compared to other recent virtual sensor techniques using other machine learning models, our approach achieved better performance around 20% in the considered datasets with diﬀerent metrics.

Keywords: IoT data imputation

1

· virtual sensor · machine learning · GAN · missing

Introduction

One of the most signiﬁcant and promising technical areas at the moment is the Internet of Things (IoT). According to some market researchers, there are currently over 20 billion linked devices, and that ﬁgure is expected to grow to 73 billion IoT devices by 2025, according to IHS Markit [1]. IoT enables people to live and work more intelligently while also regaining total control over their lives. In essence, the IoT gives organizations a real-time view of how their systems actually function, allows them to automate procedures, and lowers personnel costs. To collect data from the environment, diﬀerent applications need diﬀerent sensors. Sensors are used to build a front end for the entire process, with their c The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 N.-N. Dao et al. (Eds.): ICIT 2023, LNDECT 187, pp. 224–233, 2023. https://doi.org/10.1007/978-3-031-46573-4_21

Virtual Sensor to Impute Missing Data

225

primary goal being to collect data from their surroundings. These sensors have direct or indirect connections to IoT networks. Therefore, any issue with sensors can result in missing data and aﬀect how these applications function. This paper proposes a novel approach inspired by Generative Adversarial Imputation Nets (GAIN) [2] based on GAN [3] to create virtual sensors which can replace faulty physical sensors. Our proposal leverages the statistical correlation properties of collected data from neighbors’ physical sources to select the most correlative data to join the virtual sensor creation. In general, this work brings two main contributions: – Practical contribution: the approach is considered as an essential data provisioning method with virtual sensors as it assures the continuous operation of the applications/systems relying on complete data. – Academic contribution: the approach is a new missing-data imputation proposal which is more accurate and reliable when data correlation is considered. There are seven sections in this paper. The related work is discussed in Sect. 2. Section 3 forms the general problem description. Components of the virtual sensor is presented in Sect. 4. Section 5 describes the algorithm and the conﬁguration. Additionally, we analyze and discuss further the experimental results in Sect. 6. Section 7 will discuss drawbacks of our work and propose some future directions to improve them.

2

Related Work

Virtual Sensor has been developed and used in a variety of areas. Eushay Bin Ilyas et al. [4] introduced virtual sensor as an independent estimator trained through machine learning algorithms, namely Artiﬁcial Neural Network (ANN), Linear Regression (LR), Support Vector Regression (SVR) using historical data as well as data from nearby sensors. Similar to this, the authors [5] also developed a way to maintain an indoor tracking system operational in the event that one physical sensor fails. Vitale et al. [6] oﬀered a method for creating a virtual sensor that produced entirely new data by leveraging other accessible parameters. In another study [7], virtual sensor was viewed as a bridge between communication interfaces and applications. The relationships between data sources can also be mined to draw new conclusions. According to the authors [8], a white box model [9] which physical relations must be deﬁned as mathematical equations since all relations should be predeﬁned, a black box model [6] where data is purely used to train neural networks, and a grey model [8] with data correlation assumption are the three methods to develop virtual sensors. A virtual sensor solution mentioned in [10] was researched with the purpose of solving an optimization problem where the unnecessary physical sensors can be replaced by virtual ones when the best subset of sensors that are worth keeping in a given room is determined. Additionally, a virtual sensor solution to reduce the criticality of high-risk failures, increase detection ability in [11] and the work in [12] to take advantage of virtual sensors in handling additional responsibilities that physical sensor devices cannot demonstrated the practical application of virtual sensors in reality.

226

3

N. T. Quan et al.

Problem Description

Consider the space where sensors are observing environmental information creates a network with a d -dimensional space S = S1 x ... x Sd where Si is the dimension of sensor Si . Let X = (X1 , ..., Xd ) is an observed variable of sensor S (continuous due to the characteristics of the environmental datasets) taking values in S, whose distribution will be denoted P(X). According to GAIN architecture, there is M = (M1 , ..., Md ) which is a random variable taking values in {0, 1}d derived from the X. We will name X the data vector observed by sensor S, and M the mask vector. ˜i = Si ∪ {∗} is established, where With each i ∈ {1, ... , d}, a new space S ∗ is simply a point not in any Si , representing a missing value at the time the ˜d . A new random variable X ˜=S ˜1 x ... x S ˜ = (X ˜1, sensor S has trouble. Deﬁne S ˜ ˜ ..., Xd ) ∈ S is also deﬁned in the following way Eq. (1): observed ˜ i = Xi X (1) ∗ unobserved Note that M expresses which components of X are observed by sensor S, so M ˜ can be easily recovered from X.

4

Virtual Sensor Components

The virtual sensor which is developed in this paper is a machine learning GANbased model [3], inspired by GAIN [2], so it also consists of two components, generator and discriminator. These components are trained in the adversarial process in which two neural networks compete with each other to achieve high accuracy in missing data imputation. 4.1

Generator

˜ with Pearson The generator, G, of the virtual sensor takes the three variables X correlation Pc arrangement, M, noise N as inputs, then produces the predicted ¯ X. ¯ X ˆ as Eq. (2) and Eq. (3): Basically, the random variables are deﬁned X, ¯ = G(X|Pc, ˜ X M, (1 − M) N)

(2)

ˆ =MX ˜ + (1 − M) X ¯ (3) X ¯ is the imputed values vector which where is the element-wise multiplication. X ˆ is the complete the data points should have been observed by the sensor S. X data vector after imputation of the faulty sensor that includes the historical data ˜ and replacing ∗ with the corresponding extracted from the partial observation X ¯ value of X. The noise passed into the generator is (1 − M) N, because the target dis˜ tribution is P (X|X).

Virtual Sensor to Impute Missing Data

4.2

227

Discriminator

As previously mentioned, the discriminator, D, is considered as an adversary to train G. The discriminator tries to distinguish which values are observed by sensor S and which values are estimated by the virtual sensor. Basically, the discriminator can be described by the function D: X → [0, 1]d ˆ corresponding to the probability that i-th value with the i-th component of D(X) ˆ of X is observed normally by the sensor S. 4.3

Data Correlation Arrangement

We position the most correlated sensors next to the sensor that is missing data in order to enable our virtual sensor to understand the trending distribution of the dataset. By doing so, we expect the virtual sensor to impute missing values more reliably. The sensor network space can be described as follow Eq. (4): X = (Xf s , Xp1 , Xp2 , ..., Xpd )

(4)

where Xf s is the variable observed/unobserved of the missing data sensor, Xp−ith are respectively from the most correlative sensor to the worst one. By calculating the correlation, “noise” or “bias” can be eliminated in the process of imputing missing data, so the performance of estimators trained by machine learning algorithms like our virtual sensor, ANN and Support Vector Machine or any other models tends to be higher [13,14]. We can absolutely write a sensor’s observed and missing values as a matrix as Eq. (5): ⎡ f s p1 p2 p3 ⎤ x1 x1 x1 x1 ... xp1d ⎢ f s p1 p2 p3 ⎥ ⎢x2 x2 x2 x2 ... xp2d ⎥ ⎢ ⎥ ⎢ p2 p3 pd ⎥ (5) Mat = ⎢ ∗ xp1 ⎥ x x ... x 3 3 3 3 ⎢ ⎥ ⎢ ⎥ ⎣ ... ... ... ... ... ... ⎦ p2 p3 pd ∗ xp1 n xn xn ... xn where xfi s are observed values and ∗ unobserved values by sensor S, xpi i |i ≤ d are the values of correlated sensors that are arranged left to right corresponding to the most correlative to the worst correlative ones, d is the total number of sensors joining the imputation process. 4.4

Hint

We inherit the hint described by Yoon et al. as a random variable H which gets its values in a hint space H. H is obtained using equation Eq. (6): H = B M + 0.5 (1 − B)

(6)

228

N. T. Quan et al.

where B ∈ {0, 1}d is a random variable gotten by uniformly sampling k from {1, 2, ..., d} and applying the equation Eq. (7). 1 if j = k (7) Bj = 0 otherwise 4.5

Objective

D is trained to have the highest possible chance of properly predicting M. G, on the other hand, is trained with the hope that it can reduce the likelihood that D would correctly predict M. The general objective function and loss function of our solution are Eq. (5) and Eq. (6) below:

5

minmaxL(D, G) G D

(8)

T ˆ L(D, G) = EX ˆ |Pc, M, H[M logD((X|Pc, H) T ˆ +(1 - M) log(1 − D(X|Pc, H)))]

(9)

Algorithm

We ﬁrst train D with a ﬁxed G using mini-batches of (128 samples) data, 12800 iterations by default. At each loop, n independent samples of N, B, M with ˜ Pearson correlation arrangement are drawn to produce the imputed data X according to Eqs. (2) and (3). Apply the hint H mechanism, then the estimated ˆ is calculated by D(X|Pc, ˆ mask M H) based on the process of training D. Next, update G by preserving the recent trained D ﬁxed. Repeat the steps until the training loss converges. Details of the algorithm is described in Table 1.

6

Experiments

We conducted several experiments on Solar Power (21 sensors) [4,15] dataset. This data collection was also used in a recent work (ANN/LR/SVR virtual sensor solution) [4] which is considered as the baseline result to evaluate our virtual sensor approach. Besides, our method was veriﬁed with an indoor temperature (12 sensors) [10] dataset to prove its ability in working with a diversity of datatypes. Table 2 shows more about the characteristics of the faulty sensor having missing data used in this work. 6.1

Performance of the Proposed Virtual Sensor

The performance of the approach was given in Table 3 and 4. We reported the Normalized Root Mean Square Error (NRMSE), Mean Absolute Error (MAE)

Virtual Sensor to Impute Missing Data

229

Table 1. Virtual Sensor Algorithm Algorithm. Pseudo-code of the algorithm Calculate Pearson correlation and sensor arrangement Input: Data of the faulty sensor Sf and data frame of sensor candidates Sc Output: Numpy 2-dimensional array with Pearson arrangement for i = 0..nSc do pi ← Pc(Sf, Sci ) end for Rearrange Sc list with descending order based on Pearson Convert Sc data frame to 2-d numpy array Begin training process Input: D size of batch nD , G size of batch nG Output: Imputed data. Stochastic gradient descent (SGD) is applied. while until training loss reaches convergence: (A) Optimize D x|pc, m}, noise and hint Generate nD samples through {˜ for i = range(nD ): xi |pc, mi , zi ) xi ← G(˜ ˜i + (¯ x−x ¯ mi ) x ˆ i ← mi x hi ← ui mi + (0.5 − 0.5 ui ) end for Use SGD to update D. (B) Optimize G Generate nD with the same way above for i = range(nG ): hi ← ui mi + (0.5 − 0.5 ui ) end for Use SGD to update G. end while

which are the primary metrics to verify our virtual sensor’s performance, the total missing points and the number of physical sensors having the correlative score passing the deﬁned threshold joined the imputation process. We evaluated our virtual sensor on the datasets with 10%, 15%, and 33% missing data percentages respectively to show the robustness of the approach on various missing proportions. Tables 3 and 4 display the RMSE and MAE comparison between using GAIN and applying our virtual sensor on the datasets

230

N. T. Quan et al.

with 10%, 15% missing data proportions. In order to prevent model over-ﬁtting or biases, 5-fold cross validation was used in all of the trials. Precisely, our solution consistently scored (Normalized) RMSE below 0.1 which is deemed to be ideal. There is up to 23% of RMSE improvement compared to GAIN for the temperature collection whereas in terms of the remaining dataset, the percentage of improvement was lower. Similarly, MAE score is also around 20% better with the temperature dataset and around under 10% with the rest. All of the results, though, continue to be favorable and comparable to GAIN. The two datasets’ properties help to understand and provide an explanation for the disparate performance. Temperature data has a high degree of correlation between its components because sensors were used to gather information placed in an indoor space, resulting in close proximity between spots for the data’s trending variability. Contrarily, the sensors used in the Solar dataset were placed outside in a speciﬁc region; as a result, the distance between each device was considerable, which unquestionably aﬀected how similar the data were. Table 2. Characteristics of the Faulty Sensor’s Dataset Dataset Sample Mean STD Solar

24000 9255

Tem.

769658 30.9

Min Max

13978 0 1.3

25% 50% 75%

83988 0

14.3

2669 12550

38 30.3 30.8 31.5

Table 3. Virtual Sensor Performance of 10% Missing Data Dataset Model RMSE

MAE

Solar

GAIN 0.041 ± 0.005 530 ± 82 VS 0.040 ± 0.002 510 ± 39

Temp.

GAIN 0.057 ± 0.025 VS 0.05 ± 0.01

0.49 ± 0.2 0.45 ± 0.16

#Missing points #Predictors 2297 2297

20 18

834 834

11 10

Table 4. Virtual Sensor Performance of 15% Missing Data Dataset Model RMSE

MAE

Solar

GAIN 0.051 ± 0.003 689 ± 94 VS 0.049 ± 0.001 628 ± 21

Temp.

GAIN 0.052 ± 0.019 VS 0.04 ± 0.003

#Missing points #Predictors 3446 3446

20 18

0.43 ± 0.15 1251 0.33 ± 0.03 1251

11 10

Virtual Sensor to Impute Missing Data

231

Table 5 shows the MAE, R2 between our approach with the virtual sensor created by ANN, LR and SVR mentioned in the related work. Our virtual sensor gave better MAE score with around 10% of improvement against the results of ANN, LR, SVR. R2 score still looks comparative with the ANN/LR/SVR virtual sensor. Besides, RMSE is 0.072, also an ideal result. In order for the result to make sense, we performed exactly the same scenario veriﬁed in [4] with 33% of the missing data separation for hold-out test data, 5-fold cross validation was selected, and the validation was done on the same sensor. Table 5. Virtual Sensor Performance Comparison With 33% Missing Data Model

RMSE

MAE

R2

ANN

N/A

13992

0.77

LR

N/A

14298

N/A

Virtual sensor 0.072 ± 0.01 12544 ± 3012 0.74 ± 0.08

In order to make the comparison result more convincing, only 8 highest correlating sensors were chosen to become predictors and participate in predicting missing values. This strategy was applied because the authors in [4] limited that number as the maximum threshold when selecting physical sensors’ data to train ANN/LR/SVR models. When the number of predictors was 18 as ideally deﬁned in Tables 3 and 4, the result was much signiﬁcantly better with RMSE: 0.052 ± 0.002, MAE: 650 ± 36, and R2 : 0.86 ± 0.009. In terms of the temperature dataset, the result still looks promising with RMSE: 0.06 ± 0.003, MAE: 0.51 ± 0.05, and R2 : 0.78 ± 0.02 on 33% missingdata proportion which is equivalent to 2751 missing points. Obviously, the longer a physical sensor is inactive, the worse the data is imputed by a virtual one. 6.2

Virtual Sensor Prediction Accuracy

We chose the highest missing rate 15% in our experiments to visualize the distance between the actual observed values and the imputed values.

Fig. 1. Virtual sensor’s performance on Temperature dataset - 15% missing data

232

N. T. Quan et al.

Fig. 2. Virtual sensor’s performance on Solar Power dataset - 15% missing data

The imputed and real data are depicted in Figs. 1 and 2. It is clear that there is not much of a gap between expected and observed data. The imputed ones continue to guarantee the datasets’ overall value distribution for missing data percentages. Therefore, it is possible that the virtual sensor will be able to produce data that is more accurate. By removing the least correlated sensors based on Pearson correlation, it helps the virtual sensor understand the trending variability of a dataset so that it can make better predictions, hence, this removes bias in the data. The experimental results also show a possible beneﬁt for enterprises when the actual number of sensors installed can be reduced, therefore the deployment and management costs are undoubtedly lower, and the productivity is higher. However, as previously indicated, our method is expected to function with data from environmental sensors, so it necessitates that the data input be pleased with Pearson since Pearson is best suited for measurements made using an interval scale. On the other hand, accuracy may be compromised.

7

Conclusions and Future Work

With the use of historical data from the problematic sensor and data correlation calculations from adjacent data sources, the novel virtual sensor approach, used in this study is able to deal with the missing-data problem. Numerous studies using real-world datasets have been conducted to support two advantageous hypotheses: enhancing GAIN’s imputation accuracy, defeating ANN/LR/SVR virtual sensor’s performance and guaranteeing the accurate operation of applications and systems that rely on complete data. Future research will look into solutions to apply the virtual sensor to the optimization problem in order to reduce the quantity of physical sensors required for data collection. Furthermore, we consider extending the solution so that it can work with multiple data dimensions where a sensor can collect more than one kind of data such as temperature, humidity, wind speed, etc. in parallel.

Virtual Sensor to Impute Missing Data

233

Acknowledgement. This research was conducted within 58/20-DTDL.CN-DP Smart Village project sponsored by Ministry of Science and Technology of Vietnam. We acknowledge Ho Chi Minh City University of Technology (HCMUT), VNUHCM for supporting this study. We acknowledge the support of time and facilities from High Performance Computing Laboratory, HCMUT. We also acknowledge the support and collaboration from TIST Lab.

References 1. IHS Markit. 8 in 2018: The top transformative technologies to watch this year (2018). https://cdn.ihs.com/www/pdf/IHS-Markit-2018-Top-TransformativeTechnology-Trends.pdf 2. Yoon, J., Jordon, J., Schaar, M.: Gain: missing data imputation using generative adversarial nets. In: International Conference on Machine Learning, pp. 5689–5698. PMLR (2018) 3. Goodfellow, I., et al.: Generative adversarial nets. Adv. Neural Inf. Process. Syst. 27 (2014) 4. Ilyas, E.B., Fischer, M., Iggena, T., T¨ onjes, R.: Virtual sensor creation to replace faulty sensors using automated machine learning techniques. In: 2020 Global Internet of Things Summit (GIoTS), pp. 1–6. IEEE (2020) 5. Pedrollo, G., Konzen, A.A., de Morais, W.O., de Freitas, E.P.: Using smart virtualsensor nodes to improve the robustness of indoor localization systems. Sensors 21(11), 3912 (2021) 6. Vitale, A., Corraro, F., Genito, N., Garbarino, L., Verde, L.: An innovative angle of attack virtual sensor for physical-analytical redundant measurement system applicable to commercial aircraft. Adv. Sci. Technol. Eng. Syst. J. 6, 698–709 (2021) 7. Furdik, K., Lukac, G., Sabol, T., Kostelnik, P.: The network architecture designed for an adaptable IoT-based smart oﬃce solution. Int. J. Comput. Netw. Commun. Secur. 1(6), 216–224 (2013) 8. Yoon, S., Choi, Y., Koo, J., Hong, Y., Kim, R., Kim, J.: Virtual sensors for estimating district heating energy consumption under sensor absences in a residential building. Energies 13(22), 6013 (2020) 9. Guzm´ an, C.H., et al.: Implementation of virtual sensors for monitoring temperature in greenhouses using CFD and control. Sensors 19(1), 60 (2018) 10. Brunello, A., Urgolo, A., Pittino, F., Montvay, A., Montanari, A.: Virtual sensing and sensors selection for eﬃcient temperature monitoring in indoor environments. Sensors 21(8), 2728 (2021) 11. Wei, C., Song, Z.: Virtual sensors of nonlinear industrial processes based on neighborhood preserving regression model. IFAC-PapersOnLine 53(2), 11926–11931 (2020) 12. Mukherjee, N., Bhunia, S.S., Bose, S.: Virtual sensors in remote healthcare delivery: some case studies. In: Healthinf, pp. 484–489 (2016) 13. Luor, D.-C.: A comparative assessment of data standardization on support vector machine for classiﬁcation problems. Intell. Data Anal. 19(3), 529–546 (2015) 14. Anysz, H., Zbiciak, A., Ibadov, N.: The inﬂuence of input data standardization method on prediction accuracy of artiﬁcial neural networks. Procedia Eng. 153, 66–70 (2016) 15. ITK Digital. Solar power panel dataset at open data dk. https://www.opendata. dk/city-of-aarhus/solcelleanlaeg

An Edge AI-Based Vehicle Tracking Solution for Smart Parking Systems Doan Viet Tu1,2 , Pham Minh Quang1,2 , Huynh Phuc Nghi1,2(B) , and Tran Ngoc Thinh1,2 1

Ho Chi Minh City University of Technology (HCMUT), 268 Ly Thuong Kiet, District 10, Ho Chi Minh City, Vietnam {tu.doanqsb,quang.pham24,nghihp,tnthinh}@hcmut.edu.vn 2 Vietnam National University - Ho Chi Minh City (VNU-HCM), Thu Duc, Ho Chi Minh City, Vietnam

Abstract. Smart Parking technology has emerged as a prominent and inﬂuential trend with the potential to alleviate traﬃc congestion in urban areas. In the ﬁeld of Internet of Things (IoT) and computer vision, extensive research has been conducted on this topic. Particularly, in numerous parking lots, the tracking of vehicles entering and exiting using surveillance cameras has become a primary task to ensure the security of the premises. In this study, we propose an eﬃcient vehicle tracking solution by leveraging deep learning technology on a low-latency and costeﬀective IoT Edge device, NVIDIA Jetson Nano, a well-known platform with accurate detection in computer vision tasks. Our approach is based on Convolutional Neural Networks (CNNs) and enables the long-term tracking of multiple objects. We train the system using various learning rates on our customized dataset to enhance the accuracy of vehicle recognition from diverse angles. Subsequently, we evaluate the system’s performance in a practical setting, speciﬁcally tracking vehicles as they move in and out of an outdoor parking lot through surveillance cameras. The experimental results demonstrate high performance, achieving a mean MOTA (Multiple Object Tracking Accuracy) of 59% and MOTP (Multiple Object Tracking Precision) of 76.86% at an average frame rate of 4–5 FPS (frames per second) on Jetson Nano. These outcomes validate the system’s eﬃciency and indicate its potential for future development and deployment. Keywords: smart parking systems

1

· CNN · object tracking · embedded

Introduction

In the ﬁeld of computer vision, object tracking stands out as one of the most prominent tasks. The primary goal of visual object tracking is to assign a distinct identiﬁer to an object within a video frame, detect its presence or absence in the current frame, and accurately determine its spatial location. This area of c The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 N.-N. Dao et al. (Eds.): ICIT 2023, LNDECT 187, pp. 234–243, 2023. https://doi.org/10.1007/978-3-031-46573-4_22

An Edge AI-Based Vehicle Tracking Solution for Smart Parking Systems

235

research ﬁnds extensive applications in real-world scenarios, including but not limited to autonomous vehicle tracking, robotics, traﬃc monitoring, and smart parking. Object tracking encompasses both short-term tracking and long-term tracking. Overcoming the challenges associated with tracking objects in long videos involves addressing several issues. These include short-term association with unobstructed objects and long-term association with objects that undergo occlusion and subsequently reappear. Existing methods typically focus on these tasks separately and are tailored to speciﬁc scenarios, resulting in solutions that heavily depend on engineering and lack universal applicability. There are diverse challenges in object tracking such as change of illumination, partial or full occlusion, change of target appearance, blurring caused by camera movement, presence of similar objects to the target, changes in video image quality through time, etc. Conventional computer vision methods prove inadequate in achieving robust tracking performance, especially in long-term tracking scenarios. In this paper, we propose an eﬀective solution for vehicle tracking by leveraging deep learning technology on embedded edge devices. These devices have characteristics such as low latency, low cost, and relatively good detection accuracy, making them suitable for computer vision tasks. Our system aims to address the practical challenge of vehicle management in speciﬁc garages or parking lots. To train and evaluate our approach, extensive experiments have been conducted using the VERI-776 [3] and VIRAT [4] datasets. The results demonstrate high accuracy with a processing speed of approximately 4–5 frames per second on NVIDIA Jetson Nano, surpassing the performance of some existing methods. These results highlight the potential of our approach for future advancements in this ﬁeld. The main contributions of this paper are as follows: – We propose a vehicle tracking system for parking lots implemented on a costeﬀective edge computing platform. – To enhance vehicle detection accuracy, we address a limitation of the DeepSORT framework [8] by applying a CNN that we trained to discriminate car re-identiﬁcation on the VERI-776 [3] dataset. Unlike the original DeepSORT, which focuses on pedestrian tracking, our approach expands its applicability to vehicles. During the training phase, we iteratively adjust parameters, such as the learning rates, to achieve better results. The rest of the paper is organized as follows. Section 2 discusses the related works on object tracking and Edge computing. Section 3 outlines our proposed method in detail. Section 4 presents the results and evaluation of our approach. Lastly, Sect. 5 presents the concluding remarks.

2

Related Work

We have made some research into several object detection and tracking architectures that could potentially be used for deployment on resource-limited Edge

236

D. V. Tu et al.

devices. In general, all of these proposed models struggle with the trade-oﬀs between accuracy and real-time performance. In [7], the authors present a solution to tackle the implementation challenges of Deep Neural Networks (DNNs) on Edge devices. Speciﬁcally, the proposed object detection system is designed for deployment on the MAX78000 DNN accelerator. The system emphasizes the need for conciseness, eﬀectiveness, and comprehensiveness by incorporating various stages such as model training, quantization, synthesis, and deployment. Notably, experimental results demonstrate the system’s capability to produce highly accurate real-time detection using a compact 300-KB DNN model. The inference time for this model is recorded at a remarkable 91.9 ms, accompanied by an energy consumption of merely 1.845 mJ. In [5], the authors address the challenge of accommodating the substantial model size associated with contemporary Deep Learning tasks during deployment on Edge devices with limited resources. The paper explores the implementation of loop fusion and post-training quantization techniques to achieve real-time performance without compromising prediction accuracy. The ﬁndings demonstrate that the application of optimization techniques, such as quantization, can lead to a notable 2x-6x enhancement in the speed of inference execution, which is crucial for practical deployment on Edge devices. However, it is important to note that this speed gain is accompanied by a signiﬁcant decline in accuracy, raising concerns about the trade-oﬀ between speed and precision. In [2], the paper focuses on a real-life scenario involving parking space detection utilizing the YOLOv3 network. The authors introduce a residual structure to extract deep vehicle parking space features and use four diﬀerent scale feature maps for object detection so that deep networks can extract more ﬁne-grained features. The experimental results demonstrate that this approach enhances the accuracy of vehicle and parking space detection while simultaneously reducing the missed detection rate. In [1], a multi-object tracking framework is proposed, which incorporates Kalman ﬁltering in image space and frame-by-frame data association using the Hungarian method. The association metric employed in this approach quantiﬁes the overlap between bounding boxes. This straightforward methodology emphasizes the inﬂuence of object detector performance on tracking outcomes by achieving favorable results at high frame rates. It also provides valuable insights for practitioners engaged in multi-object tracking, underscoring the pivotal role of the object detector in achieving successful tracking results on datasets such as the MOT challenges. Although the framework achieves good performance in terms of tracking precision and accuracy, it tends to have a relatively high number of identity switches. This is primarily due to the limitation of the association metric, which is eﬀective only when there is low uncertainty in state estimation. Consequently, the framework encounters challenges when tracking objects through occlusions, particularly in scenarios captured by frontalview cameras. In general, numerous related studies propose diverse techniques for object detection and tracking, and conduct various experiments, all aiming to address the trade-oﬀ among diﬀerent metrics. In our speciﬁc research, considering the constraints of resource-limited Edge devices, we prioritize accuracy as a key consideration in making trade-oﬀs during our implementation process.

An Edge AI-Based Vehicle Tracking Solution for Smart Parking Systems

3

237

Proposed Method

This section presents the recommended system architecture that can be deployed on IoT Edge devices that speed up neural network computation. The system is then ﬁne-tuned with our customized dataset for greater accuracy when detecting vehicles. During the training phase, we also change the model’s learning rates to get the best results (Fig. 1).

Fig. 1. System architecture of DeepSORT [8].

In this system, frame sequences are utilized as input, which can be sourced from video ﬁles or camera feeds. These sequences are then fed into the ﬁrst CNN model. The YOLOv5 handles the detection phase, then it crops the objects in the bounding box and transfers them into the Data Association or Association phase where the second CNN model, without a classiﬁer layer, extracts the feature of the objects. Next, the features are passed to Matching Cascade strategy section in order to increase the accuracy of data association. This system manages the life cycle of a track using a variable with 3 states called: tentative, confirmed, deleted : – The state tentative is when the new track is initialized with a value to scout. – If the value still exists for the next 3 frames, it will be changed to the confirmed state. – The tracks with the conﬁrmed value if disappears will not be deleted for the next 30 frames. – If the tracks are lost in less than 3 frames, it will be deleted.

238

D. V. Tu et al.

Fig. 2. Fine-tune CNN feature extractor block with customized Veri776 dataset

Matching Cascade compares the tracked objects from previous frames to build Cost Matrix. During computing Cost Matrix, two results are obtained. The ﬁrst result is the unmatched measurement in the frame, which means this object track is new, so the new track is assigned a tentative variable which is managed in Track Management phase. If the track appears consecutively in the next three frames, the tentative variable is changed into the confirmed variable. On the other hand, if the track exists for less than three frames, the deleted variable is activated. In the worst case, the track may exist in the next 30 frames. Following this process, those track objects along with their output conditions, such as the expected output class of ’vehicles’, are drawn into bounding boxes with track ids. Finally, the tracks that have just output go into KF Estimation state for further frames and are assigned the tentative variables again which are the inputs of Cost Matrix computing and the cycle is looped again. In the ﬁrst part, the DeepSORT method utilizes motion and appearance metrics to associate valid tracks. In the second part, the same data association strategy employed in the SORT method is employed to link recently created tentative tracks with unmatched detections. The motion metric is incorporated by the (squared) Mahalanobis distance between the predicted states and the detections. The second metric is based on the smallest cosine distance which measures the distance between each track and its appearance features. The appearance features are computed using a pre-trained CNN model. This pre-trained model is a Wide Residual Network (WRN) [9], described in Table 1. In the training phase, we both consider this original WRN and a new modiﬁed version proposed by ZQPei [10], described in Table 2, for comparison.

An Edge AI-Based Vehicle Tracking Solution for Smart Parking Systems Table 1. Wide Residual Network [9]

239

Table 2. Modiﬁed WRN[10]

Name

Size/Stride Output Size

Name

Conv1

3 × 3/1

32 × 128 × 64

Conv1

3 × 3/1

64 × 64 × 32

Conv2

3 × 3/1

32 × 128 × 64

MaxPool2

3 × 3/2

64 × 64 × 32

MaxPool3

3 × 3/2

32 × 64 × 32

Residual3

3 × 3/1

64 × 64 × 32

Residual4

3 × 3/1

32 × 64 × 32

Residual4

3 × 3/1

64 × 64 × 32

Residual5

3 × 3/1

32 × 64 × 32

Residual5

3 × 3/2

128 × 32 × 16

Residual6

3 × 3/2

64 × 32 × 16

Residual6

3 × 3/1

128 × 32 × 16

Residual7

3 × 3/1

64 × 32 × 16

Residual7

3 × 3/2

256 × 16 × 8

Residual8

3 × 3/2

128 × 16 × 8

Residual8

3 × 3/1

256 × 16 × 8

Residual9

3 × 3/1

128 × 16 × 8

Residual9

3 × 3/2

512 × 8 × 4

Dense10

128

Batch & L2

128

Residual10 3 × 3/1 AvgPool11 3 × 3/1

512 × 8 × 4 512

Size/Stride Output Size

The customized Veri776 dataset is then used to ﬁne-tune the CNN feature extractor, aiming to enhance the accuracy of vehicle detection, as depicted in Fig. 2. To get the best model, we also adjust the model’s learning rates during the training phase.

4 4.1

Experimental Results Training Phase

Experimental Setups. This paper leverages the NVIDIA Jetson NanoTM Developer Kit as the Edge computing platform of choice. For the training phase, we utilize the VERI-776 dataset [3] to train our models. In order to evaluate the performance and eﬀectiveness of our approach, we employ the VIRAT dataset [4] which encompasses diverse video sequences capturing vehicles in parking lots under varying weather conditions. The benchmark considers a few important evaluation metrics, which are MOTA (Mul-tiple Object Tracking Accuracy), MOTP (Multiple Object Tracking Precision), IDF1 (ID F1 Score), MT (Mostly Tracked). ML (Mostly Lost), FN (False Negatives), FP (False Positives) and IDSW (ID Switches). Training Phase. As mentioned earlier in Sect. 3, the DeepSORT method incorporates two CNNs. To enhance the eﬀectiveness of the re-identiﬁcation task, we propose applying a ﬁne-tuning approach to train the feature extractor WRN using the VERI-776 dataset [3]. This dataset comprises 776 distinct cars, each captured from 10 to 20 diﬀerent angles, with a total of 49,357 images. By utilizing this training method, we can signiﬁcantly improve the accuracy of vehicle detection, as the original DeepSORT method primarily focuses on pedestrian

240

D. V. Tu et al.

tracking. We employ the CrossEntropy loss function and the Stochastic Gradient Descent (SGD) optimizer with a learning rate of 0.1, with an input size of 128 × 64. Furthermore, we explore alternative optimizers, such as Adam with variations of the learning rate for better results. 4.2

Evaluation Table 3. Comparison between WRN models. Method

MOTA MOTP IDF1

MT

ML

FP

FN

IDsw

Original WRN 59.08% 76.84% 76.6% 30.1% 36.3% 12211 33628 196 ZQPei WRN

59.0%

76.84% 77.4% 30.8% 37.0% 12437 33609 143

Comparing WRN Models. This is the comparison between the original WRN [9] and the modiﬁed WRN by ZQPei [10]. The setup uses input images from the VERI-776 [3] dataset with the size 128 × 64 and the optimizer is SGD (lr = 0.1) with Cross Entropy Loss function. The whole DeepSORT system is tested on the customized VIRAT [4] dataset with input videos of size 1920 × 1080. From Table 3 we saw that ID switches in the WRN model by ZQPei are around 35% less than the original model, although keeping nearly the same MOTA and MOTP. We also tried on some more parameter modiﬁers on ZQPei’s model such as changing the optimizer into Adam and also tried with diﬀerent learning rates. Because the system used the same YOLOv5 detector, we will not notice a big diﬀerence on either MOTA, MOTP, IDF1, MT or ML metrics. However, the change of optimizer in the WRN model used for appearance integrating for ReID purposes will result in the ID switches of the system. From Table 4, WRN with SGD(lr = 0.05) will yield the best result with the least ID switches, only 100 switches compared to the others. And Adam(lr = 0.01) is the worst one. Below are the line graphs explaining how diﬀerent optimizers can achieve diﬀerent results. With the SGD Optimizer from Fig. 3, we could see that at the 10–20 epochs for each learning rate, there was a gap between each other and the learning rate of 0.01 was the ﬁrst to reach its stable accuracy (96.06%) on epoch 20. The training accuracy for the WRN model using SGD optimization with an LR of 0.1 was 91.41% after 20 epochs. For the remaining training epochs, the accuracy remained comparatively constant and increased up to 97.16%. Similar results were obtained with an LR of 0.05, which achieved a peak validation accuracy of 97.24% at epoch 50, as opposed to the smallest LR (0.01), which can only achieve 97.17% accuracy at epoch 47. However, in the end, all the 3 learning rates reached a very small loss. For accuracy, they behaved quite the same as in the Loss graph, with the learning rate of 0.1 having the least accuracy in the beginning. However, after epoch 20, all the learning rates were becoming stable, especially at the learning rates 0.05 and 0.1, and the accuracy increased drastically from 60% - 73% to 84% then 93%. All of them also reach an accuracy of more than 99% at the end.

An Edge AI-Based Vehicle Tracking Solution for Smart Parking Systems

241

Fig. 3. SGD Optimizer Learning Rate Comparison. (Red is lr = 0.1, Green is lr = 0.05, Blue is lr = 0.01, Solid line is Training, Dot Line is Validation).

Additionally, we trained the modiﬁed WRN using Adam optimization and three diﬀerent learning rates: 0.01, 0.005, and 0.001(Fig. 4). From the Loss graph we could see that with a learning rate of 0.01, the training process is having a gap in loss compared with a learning rate of 0.001 or 0.005. The loss with lr 0.01 also decreased drastically at epoch 20 (2.8746) and still gradually decrease in loss afterward. However with learning rates of 0.005 and 0.001, they began to decrease in loss tremendously in the ﬁrst 10 epochs (0.8715 and 0.8021), and starting from epoch 20, we can see that they are becoming stable and will not change much. The Error graph from the right shows that with lr = 0.01 the accuracy of training and validating lines are pretty low with just nearly 15%, however, they still gradually increase and reach their peak at epoch 50, with training accuracy at approximately 52% and validation accuracy at 66%. Back to the learning rate of 0.005 and 0.001, their accuracy reached the stable state just after 20 epochs and top at 50 epochs with approx 99.99% for learning of 0.005 and 0.001. From Table 4, WRN with SGD(lr = 0.05) will yield the best result with the least ID switches, only 100 switches compared to the others. And Adam(lr = 0.01) is the worst one. Below are the graphs explaining how diﬀerent optimizers can achieve diﬀerent results. The Speed. When we run inference on the computer with the speciﬁcations of CPU i7-9750H and GPU GTX 1650 4GB on a 74-second-length video, the ﬁnetuned DeepSORT method got a quite high FPS of 19.65 (GPU) and relatively low FPS of 3.14 (CPU) in terms of tracking a car going into the parking lot. When running on Jetson Nano Developer Kit deployed with TensorRT(int8), the result is 4.93 FPS which is predictable since we focus our work on accuracy. This result is around 6.3 times slower than running on the GPU and around 1.5 times faster than running on the CPU (Table 5).

242

D. V. Tu et al.

Fig. 4. Adam Optimizer Learning Rate Comparison (Red is lr = 0.01, Green is lr = 0.005, Blue is lr = 0.001, Solid line is Training, Dot Line is Validation) Table 4. Comparison of diﬀerent optimizers with diﬀerent learning rates for ZQPei WRN model. Method

MOTA MOTP IDF1

MT

ML

FP

FN

IDsw

SGD(lr = 0.1)

59.0%

76.84% 77.4%

30.8%

37.0%

12437 33609 143

SGD(lr = 0.01)

58.0%

76.86% 77.0%

30.8%

35.6%

13113 33876 256 12480 33506 100

SGD(lr = 0.05)

59.0%

76.84% 77.4%

30.0%

35.6%

Adam(lr = 0.01)

57.5%

76.86% 77.7%

30.7%

36.36% 13372 34133 314

Adam(lr = 0.005) 58.84% 76.82% 77.32% 30.0% Adam(lr = 0.001) 59.0%

76.84% 77.4%

37.0%

12551 33599 115

30.76% 36.36% 12330 33608 164

Table 5. Compare the performance of speed on diﬀerent hardware. Hardware device CUDA Cores FP32 Performance FPS

5

GTX 1650

1,280

4.4 TFLOPS

19.65

CPU i7-9750H

–

–

3.14

Jetson Nano

384

845 GFLOPS

4.93

Conclusion

This paper primarily focuses on the tasks of multi-object tracking and longterm object tracking within the domain of video surveillance. Our objective is to deploy a suitable framework on Edge devices and reﬁne the model through ﬁne-tuning using a customized dataset tailored to the speciﬁc real-life scenario of vehicle tracking in parking lots. Overall, we thoroughly investigate and evaluate the eﬀectiveness of the ﬁne-tuned DeepSORT framework. The obtained results serve as compelling evidence, highlighting the potential and practical-

An Edge AI-Based Vehicle Tracking Solution for Smart Parking Systems

243

ity of our proposed solution in eﬀectively addressing the challenges prevalent in video surveillance, with particular emphasis on vehicle tracking in parking lots. Acknowledgement. We acknowledge Ho Chi Minh City University of Technology (HCMUT), VNU-HCM for supporting this study.

References 1. Bewley, A., Ge, Z., Ott, L., Ramos, F., Upcroft, B.: Simple online and realtime tracking. In: 2016 IEEE International Conference on Image Processing (ICIP), pp. 3464-3468 (2016). https://doi.org/10.1109/ICIP.2016.7533003 2. Ding, X., Yang, R.: Vehicle and parking space detection based on the improved Yolo network model. J. Phys: Conf. Ser. 1325(1), 012084 (2019). https://doi.org/ 10.1088/1742-6596/1325/1/012084 3. Liu, X., Liu, W., Mei, T., Ma, H.: PROVID: progressive and multimodal vehicle reidentiﬁcation for large-scale urban surveillance. IEEE Trans. Multimedia 20(3), 645–658 (2018). https://doi.org/10.1109/TMM.2017.2751966 4. Oh, S., Hoogs, A., Perera, A., et al.: A large-scale benchmark dataset for event recognition in surveillance video. CVPR 2011, 3153–3160 (2011). https://doi.org/ 10.1109/CVPR.2011.5995586 5. Stein, E., Liu, S., Sun, J.Z.: Real-time object detection on an edge device (Final Report). https://api.semanticscholar.org/CorpusID:215777549 (2019) 6. Vaquero, L., Brea, V.M., Mucientes, M.: Real-time siamese multiple object tracker with enhanced proposals. Pattern Recogn. 135, 109141 (2023). https://doi.org/10. 1016/j.patcog.2022.109141 7. Wang, G., et al.: BED: a real-time object detection system for edge devices. In: Proceedings of the 31st ACM International Conference on Information & Knowledge Management (CIKM ’22). Association for Computing Machinery, New York, pp. 4994–4998 (2022). https://doi.org/10.1145/3511808.3557168 8. Wojke, N., Bewley, A., Paulus, D.: Simple online and realtime tracking with a deep association metric. In: 2017 IEEE International Conference on Image Processing (ICIP), Beijing, China, pp. 3645–3649 (2017). https://doi.org/10.1109/ICIP.2017. 8296962 9. Zagoruyko, S., Komodakis, N.: Wide residual networks. In: Wilson, R.C., Hancock, E.R., Smith, W.A.P. (eds.) Proceedings of the British Machine Vision Conference 2016, BMVC 2016, York, UK, September 19–22, 2016. BMVA Press (2016). https://www.bmva.org/bmvc/2016/papers/paper087/index.html 10. ZQPei, The modiﬁed wide residual network implementation (2022). https://github. com/ZQPei/deep sort pytorch 11. Ren, S., et al.: Faster R-CNN: towards real-time object detection with region proposal networks. IEEE Trans. Pattern Anal. Mach. Intell. 39, 1137–1149 (2015)

Low-Light Image Enhancement Using Quaternion CNN Truong Quang Vinh1,2(B) , Tran Quang Duy1,2 , and Nguyen Quang Luc1,2 1 Faculty of Electrical and Electronics Engineering, Ho Chi Minh City University of

Technology, Ho Chi Minh City, Vietnam [email protected] 2 Vietnam National University – Ho Chi Minh, Ho Chi Minh City, Vietnam

Abstract. Images captured in low-light conditions suffer severely from degradations such as poor visibility or unexpected noise which cause challenges for computer vision systems. This paper presents an algorithm for low-light enhancement using Quaternion Convolutional Neural Network (QCNN). The proposed algorithm integrates 3-tuple (RGB) and one zero/gray channel into Unet model to extract informative detail of color and texture for image reconstruction. In addition, an attention module for QCNN has been proposed to help the model focus to learn important parts of Quaternion feature map. The experimental results show that the proposed method outperforms the conventional methods using deep learning algorithms. Keywords: Low light image · Image enhancement · Quaternion · Unet · CNN

1 Introduction Low-light image enhancement plays an important part in computer vision systems such as video surveillance systems, object detection and recognition systems, automated driving systems. Enhancing low-light image is still a challenging task since it needs to deal with color correction, contrast equalization, brightness intensification, and noise reduction simultaneously given the low-quality input only [1]. To tackle these difficulties, a lot of low-light enhancement methodologies have been studied in recent years. In general, there are some approaches for low-light image enhancement such as histogram equalization (HE) approaches [2], Retinex theory (RT) based approaches [3], and deep learning-based approaches [4, 5]. HE-based aim to increase the contrast by simply applying dynamic range stretching while Retinex-based methods estimate illumination map to recover original brightness and contrast. Kuldeep Singh et al. [2] proposed two recursive histogram equalization methods for low light image enhancement. These methods apply recursive histogram decomposition based on exposure-based thresholds and individual sub-histogram equalization and thus give very effective results for low-exposure images. A well-known Retinex-based method is LIME, i.e. Low-light Image Enhancement via Illumination Map Estimation) [3]. LIME provides great performance in lightless image by estimating the illumination for each © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 N.-N. Dao et al. (Eds.): ICIT 2023, LNDECT 187, pp. 244–254, 2023. https://doi.org/10.1007/978-3-031-46573-4_23

Low-Light Image Enhancement Using Quaternion CNN

245

pixel to find the total maximum value in the three colors channels of the image. However, most of HE-based and RT-based methods only focus on recovering brightness and contrast and ignore the impact of noise. That is the reason for the growth of many deep learning-based methods researching in this field. Chen et al. [4] introduced great work of using Fully Convolutional Network (FCN) to reconstruct images with high quality. The only disadvantage of this method is that it has to use raw image collected from specific camera type such as SONY or FUJI, which is not very convenient for users who do not have DSLR camera. Ai et al. [5] introduced an upgrade version of FCN when combine Unet architecture [6] and attention technique [7]. This Attention Unet is trained on SID dataset, the same dataset as previous research and has outperformed many traditional methods. Feifan Lv et al. [8] also applied attention-based technique to build up a complex model and achieve great performance by training on their own synthetic dataset. In general, most of previous researches treated three channels in color images as independent components. Which mean the channels will be applied the same operations to achieve convolution results. This unwittingly does not take advantage of color’s interrelationship between channels and may leads to loss of feature information. There are some researches about this aspect and Quaternion Convolutional Neural Network (QCNN) has been proposed. This field of numbers has proved its advantages in many different computer vision tasks such as image denoising [9], color image classification [10], color image forensics [11], etc. In QCNN model, color image can be represented as a Quaternion matrix, each element of which is a Quaternion consisting of 3-tuple (RGB) and one zero/gray channel [12]. Convolution layers and other components of traditional CNN are also reconstructed in Quaternion field with Quaternion operations. In this way, feature maps are represented in higher dimensional space, thus the neural network can achieve richer feature information. Focusing on these advantages, we propose a novel Quaternion-based Attention Unet (QAUNET) for low-light image enhancement. This work is the integration between an Attention Unet architecture and Quaternion version of convolutional network components. By using Quaternion operation and image representation, we can extract more informative details of color and structure and propagate along the deep fully convolution network. Based on these features, Unet architecture with attention mechanism can reconstruct image from low-light, noise input. Overall, our main contributions with this work can be addressed from three aspects: • The Quaternion CNN algorithm with RGB tube and a zero/gray channel are modeled in mathematic formation to represent an image in higher dimension space. • The attention module for Quaternion CNN has been proposed to help the model focus to learn important part of Quaternion feature map. • The proposed Quaternion-based Unet model combined with the attention module is implemented and applied for low-light image enhancement. The remain parts of this paper are as follows. Section 2 presents the background of Quaternion CNN and CNN approaches for low-light image enhancement. Section 3 describes the proposed methods for low-light image enhancement. The experiment results are provided in Sect. 4. Finally, the conclusion is given in Sect. 5.

246

T. Q. Vinh et al.

2 Background 2.1 Quaternion Algebra Quaternion is a kind of hyper complex numbers, which is first proposed by Hamilton in 1843 [13]. Quaternion can be considered as an extension of the real and complex domains, including one real part and one imaginary part. Usually, this imaginary part has three components. A Quaternion number q belongs to Quaternion domain H, i.e., q H can be defined as q = a1 + bi + cj + d k, where a, b, c, d are real numbers, and 1, i, j, k are the Quaternion unit basis. i, j, k are imaginary components and obey Hamilton’s rules. i2 = j 2 = k 2 = ijk = −1 ij = −ji = k; ik = −kj = i; ki = −ik = j

(1)

We can define some Quaternion operations like real or complex numbers as following rules. Addition or subtraction: q1 ± q2 = (a1 ± a2 ) + (b1 ± b2 )i + (c1 ± c2 )j + (d1 ± d2 )k

(2)

q∗ = a − bi − cj − dk

(3)

Conjugation:

Norm: q =

a2 + b2 + c2 + d 2

(4)

Hamilton product: q1 q2 = (a1 a2 − b1 b2 − c1 c2 − d1 d2 ) + (a1 b2 + b1 a2 + c1 d2 − d1 c2 )i + (a1 c2 − b1 d2 + c1 a2 + d1 b2 )j + (a1 d2 + b1 c2 − c1 b2 + d1 a2 )k

(5)

According to the result of Hamilton product, we can see that each element of the outputs is related with all Quaternion elements of two input factors. These can be applied to build up a Quaternion convolution operation that allows the convolution filters extract the interrelationship information across different color channels. Original image colors space only has 3 dimensions while Quaternion domain has 4 dimensions. Zhu et al. [9] proposed a method to map color image to Quaternion space by ignoring the real part and only use 3 imaginary parts to represent R, G, B channels respectively. Parcollet et al. [10] presented other way to deal with the problem by concatenating a zero channel before R, G, B as a zero-real part of Quaternion. This method seems more meaningful when it uses all of 4 dimensions of Quaternion space with its powerful operations. This is also the method that we used in the proposed QAUNET model.

Low-Light Image Enhancement Using Quaternion CNN

247

2.2 Quaternion Convolutional Neural Network Based on Hamilton product, Parcollet et al. [10, 6] presented Quaternion Convolutional layer which take 4 channels image as input. The convolution operation uses Hamilton product instead of traditional one. In QCNN, activation functions are applied for each l and S l be the activation output and convolution output at channel separately. Let γab ab layer l and at the indexes (a, b) respectively. A definition of the convolution process is: l l = α(Sab ), γab

(6)

l is computed as following: where α is a Quaternion split activation function, and Sab l = Sab

K−1 K−1 c=0 d =0

l−1 wl γ(a+c)(b+d )

(7)

In this equation, wl is a Quaternion-valued kernel which has size (K × K). 2.3 CNN Approaches for Image Enhancements With the development of deep learning, there were many architectures that have been proposed for image enhancements. In this section, we analyze some recent modern CNN architectures that can be effective for low light image enhancement. Unet was developed by Olaf Ronneberger [6] for Bio-medical Image Segmentation and won the ISBI cell tracking challenge in 2015. After that, it has been found to be very effective for tasks where the output is of similar size as the input and the output needs that amount of spatial resolution. A Unet has two symmetrical parts: Encoder with convolutional and pooling layers for extracting feature and Decoder for reconstructing image from feature extracted in Encoder. ResNet [14] is a CNN architecture developed by Microsoft in 2015 and archive the first prize at ILSVRC challenge 2015 with 3.57% top-5 error rate. With Residual Block (ResBlock), Resnet become popularity in the research community because it solves the vanishing gradient problem during training a CNN and make model get higher accuracy. The combination of Resnet and Unet call ResUnet was proposed by FastAI. By replacing the convolution block in Unet on each level with ResBlock, the new network can get better performance than the original Unet almost every time. The Unet, Resnet, and ResUnet are potential CNN architectures that can be applied effectively for low light image enhancements. However, these architectures need to be improved for feature extraction of color low light images, so it can provide high quality reconstructed images.

3 Proposed Quaternion Attention Unet 3.1 Quaternion ResUnet In the previous section, we discussed Unet, Resnet, and ResUnet. Since the ResUnet outperforms the other architectures, we apply Quaternion convolutions to this network to take advantage of color’s interrelationship for color image feature extraction. We

248

T. Q. Vinh et al.

also discussed about Quaternion algebra and two ways to represent a color image in Hamilton domain. In this paper, we use Parcollet’s way to represent color image and apply Quaternion product based on this representation to build our Quaternion ResUnet. In encoder part, we use Quaternion convolution layer instead of real-valued convolution layer in each of encoder in order to build up a fully quaternion features extraction branch. The activation function is also a Quaternion split activation in (8) with α is ReLU [15] function. In symmetry decoder part, we use nearest neighbor interpolation for up-sampling and a Quaternion convolutional layer next to it to correct the distortion made by interpolation. l = T σ1 WxT xil + WgT gi + bg + b qatt l αil = σ2 (qatt (xil , gi ; att ))

(8)

where xil is i-th feature map at level l, gi is gating signal from encoder branch and WxT , WgT are transposed weights matrixes of 1 × 1 × 1 convolution layer corresponding to xil and gi . The summation then is fed in to a RELU activation function σ1 . T and b are transposed weights matrixes and biases of a 1 × 1 × 1 convolution layer. Finally, a Sigmoid activation function σ2 is applied to calculate attention factor αil for weighting the feature maps as described in Fig. 1. 3.2 Quaternion Attention Module Attention Module [7] in CNN is discovered to focus on the most important spatial region on feature maps. On each feature map, an attention map with the same height and weight is produced. The weight α of attention map at position (x, y) is used to evaluate the importance of the corresponding pixel at the attention map. The larger α is, the more important pixel is. By using this method, network only needs to learn to focus on important parts of the image. Therefore, in this research, we proposed Quaternion Attention module which is integrated to our Quaternion ResUnet to enhance model performance. The attention gate can steadily decrease features responses in irrelevant background regions and increase the features responses in relevant foreground regions [16]. The regulation of convolution parameters in the layer l − 1 can be calculated as follows.

l−1 l−1 )) ∂(xil ) ∂(αil f (xil−1 ; l−1 ) ∂(αil ) l l ∂(f (xi ; = = α + x i ∂( l−1 ) ∂( l−1 ) ∂( l−1 ) ∂( l−1 ) i

(9)

The gating signal g and feature map xl are used to calculate the attention map of xl as shown in Fig. 1. During the process of backward pass, the gradients originating from

background regions are reduced as shown in (9). Where

∂(xil ) ∂( l−1 )

is derivative of xil w.r.t

weights of previous layer l−1 . αil stands for attention factor at level l and layer i. The Quaternion Attention module is built by applying Quaternion convolutional layers and Quaternion split activation functions to the conventional real-valued attention gate, as shown in Fig. 1. Because of the property of Quaternion functions, the attention

Low-Light Image Enhancement Using Quaternion CNN g

Wg:1×1×1 Fg×Hg×Wg×

l

x

249

Wx:1×1×1

ReLU ( 1) Fint×HgWgDg

Sigmoid ( 2) HgWgDg

l

Resampler Hx×Wx×Dx

Fl×Hx×Wx×

Fig. 1. Attention Module [16]

map now has 4 weights for each pixel of xl but the signification of attention map just required one weight. To obey this signification, after calculating the 4-dimension Quaternion attention map, we map it to 1-dimention map by calculating Quaternion norm as shown in (10). q = a2 + b2 + c2 + d 2 (10)

3.3 The proposed Quaternion Attention Unet model We combine Quaternion ResUnet and Quaternion Attention module to build a Quaternion Attention Unet (QAUNET) model. The proposed model takes advantage of the Quaternion convolutions to achieve better feature information of color images. Besides, the Quaternion Attention module helps the model more focusing on important textures of color images, and thus it can further improve the performance of the proposed model.

4 Experimental Results 4.1 Datasets Most existing methods for processing low-light images were evaluated on synthetic data or on real low-light images without ground truth. In our experiment, we used Low-light Image Enhancement Datasets (LOL) which was collected in the real scene on both lowlight and ground truth. This dataset was first introduced and used to train RetinexNet [17, 18]. It was constructed by capturing in daytime with different camera configurations. First, the normal-light images are captured and then, with the same scenes, the lowlight image are captured by decreasing the exposure and ISO. This dataset contains 500 low/normal-lights PNG image pairs and was split to two subsets: 485 for training set (97% of total data size) and 15 for evaluating set (3% of total data size). 4.2 Training of Quaternion CNN In order to train the proposed QAUNET model faster, we utilize Batch Normalization [19] (BN) Layer in CNN. Yin et al. [11] introduced the list of equations of building a Quaternion version of BN. We used these functions to re-build Quaternion BN layers and evaluate with experiment results.

250

T. Q. Vinh et al.

We referred to QCNN architecture from [10, 20] to compare performance on model that uses Quaternion BN and not. The models have 4 Quaternion Convolution layer with BN following in BN version. They were training for classification task on CIFAR 10 [21] dataset. After 150 epochs, we obtained the results shown in Fig. 2. In general, BN model performs better in model’s loss and accuracy. This means that BN is necessary for architecture to reach the better performance in Quaternion space.

Fig. 2. Quaternion Batch Normalization Learning Curves

Fig. 3. Learning Curves of QAUNET and AUNET

In order to compare with the proposed QAUNET model, we implement a realvalued Attention Unet architecture referred from Ai et al. [5] which performs well in LOL dataset. Training is performed during 600 epochs with the Adam optimizer [22], learning rate of 0.0001 and division by 10 after 150 first epochs, Mean Square Error loss function defined in [23] is applied. We used Google Collaboratory Pro Service to train our network with Tesla P100 GPU. Our model is trained on LOL dataset, using Mean Square Error with an effort to reconstruct a high-quality version of low light images. After finishing the training phase, we obtained learning curves shown in Fig. 3. We can find that the performance of our QAUNET is consistently better than that of real-valued Attention Unet. The loss function of our QAUNET converges more quickly in both training phase and validation phase and reaches smaller loss finally. However, QAUNET’s learning curves have some unstable points in training phase. Training results demonstrate the advantages of Quaternion Convolution operation that we discussed.

Low-Light Image Enhancement Using Quaternion CNN

251

4.3 Performance Evaluations In the experiments, we used three metrics MSE, PSNR, and SSIM to evaluate the performance of the proposed QAUNET model. That are Peak Signal to Noise Ratio [24] (PSNR) and Structure Similarity Index Measure [25] (SSIM). These are well known benchmarks for image restoration in general and low-light image enhancement. Both PSNR and SSIM in training phase and validation phase of QAUNET perform better than the original Attention Unet model [5] as shown in Fig. 4 and Fig. 5. According to the result in Table 1, the improvement of the validation PSNR and SSIM of the proposed method are 2.59 dB and 7%, respectively. The better result of the proposed model is due to the color feature map obtained from the Quaternion convolution layers with the attention module. Results on the low-light image enhancement task of LOL dataset are shown in Fig. 6. It is worth noticing that the QAUNET produced a better image restoration than realvalued one. After 600 epochs, QAUNET performed well in color reconstruction, while the real-valued model still cannot reconstruct color areas smoothly. This is result of using Hamilton product instead of dot product in convolution operation. Color space information and interrelationship between different channels have been extracted and used for restoring the image’s color, structure, and details.

Fig. 4. PSNR Evaluation

Fig. 5. SSIM Evaluation

252

T. Q. Vinh et al. Table 1. Performance evaluation of the proposed models

Model

Evaluation metrics MSE loss

PSNR (dB)

SSIM

Quaternion Attention Unet

Training

0.00032916

32.104114

0.913958

Validation

0.00806230

23.576250

0.860455

Attention Unet [5]

Training

0.00133201

29.958316

0.878463

Fig. 6. Restoration results: a, e: Ground truth image; b, f: low light image; c, g: enhanced images from Attention Unet; d, h: enhanced images from Quaternion Attention Unet.

5 Conclusion and Future Work This paper proposes a Quaternion Attention Unet network which is a combination of Quaternion Convolution Network, Attention mechanism, and ResUnet architecture. Our work is a solution for low-light enhancement problem in image processing. QAUNET model can enhance a low-light image with low PSNR and SSIM to get a better-quality image. The key approach is using Quaternion version of convolution operation and attention mechanism to map color image to Quaternion space. This method helps the network learn the interrelationship between different channels, which helps the model restore the image’s color. In general, our model has almost the same parameters compared to Attention Unet, but still does better than the real-valued version on LOL dataset. Our model can be used for applications such as night-sight autonomous cars or surveillance cameras. Acknowledgements. This research is funded by Vietnam National University Ho Chi Minh City (VNU-HCM) under grant number C2022-20-15.

Low-Light Image Enhancement Using Quaternion CNN

253

References 1. Wang, W., Wu, X., Yuan, X., Gao, Z.: An experiment-based review of low-light image enhancement methods. IEEE Access 8, 87884–87917 (2020) 2. Singh, K., Kapoor, R., Sinha, S.Kr.: Enhancement of low exposure images via recursive histogram equalization algorithms. Optik 126(20), 2619–2625 (2015) 3. Guo, X., Li, Y., Ling, H.: LIME: low-light image enhancement via illumination map estimation. IEEE Trans. Image Process. 26(2), 982–993 (2017) 4. Chen, C., Chen, Q., Xu, J., Koltun, V.: Learning to see in the dark. In: 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, pp. 3291–3300 (2018) 5. Ai, S., Kwon, J.: Extreme low-light image enhancement for surveillance cameras using attention U-Net. Sensors 20(2), 495 (2020) 6. Ronneberger, O., Fischer, P., Brox, T.: U-Net: convolutional networks for biomedical image segmentation. arXiv:1505.04597 (2015) 7. Woo, S., Park, J., Lee, J.Y., Kweon, I.S.: CBAM: convolutional block attention module. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11211, pp. 3–19. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01234-2_1 8. Lv, F., Li, Y., Lu, F.: Attention guided low-light image enhancement with a large scale low-light simulation dataset. arXiv:1908.00682 (2020) 9. Zhu, X., Xu, Y., Xu, H., Chen, C.: Quaternion convolutional neural networks. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11212, pp. 645–661. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01237-3_39 10. Parcollet, T., Morchid, M., Linarès, G.: Quaternion convolutional neural networks for heterogeneous image processing. arXiv:1811.02656 (2018) 11. Yin, Q., Wang, J., Luo, X., Zhai, J., Jha, S.Kr., Shi, Y.-Q.: Quaternion convolutional neural network for color image classification and forensics. IEEE Access 7, 20293–20301 (2019) 12. Grigoryan, A.M. , Agaian, S.S.: Quaternion and octonion color image processing with MATLAB. SPIE (2018) 13. Hazewinkel, M., Gubareni, N., Kirichenko, V.V.: Algebras, Rings and Modules, vol. 1. Springer, Dordrecht (2004). https://doi.org/10.1007/1-4020-2691-9 14. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, pp. 770–778 (2016) 15. Krizhevsky, A., Sutskever, I., Hinton, G.E.: ImageNet classification with deep convolutional neural networks. Commun. ACM 60(6), 84–90 (2017) 16. Oktay, O., et al.: Attention U-Net: learning where to look for the pancreas. arXiv:1804.03999 (2018) 17. Parthasarathy, S., Sankaran, P.: An automated multi scale retinex with color restoration for image enhancement. In: 2012 National Conference on Communications (NCC), Kharagpur, India, February, pp. 1–5 (2012) 18. Wei, C., Wang, W., Yang, W., Liu, J.: Deep retinex decomposition for low-light enhancement. arXiv:1808.04560 (2018) 19. Ioffe, S., Szegedy, C.: Batch normalization: accelerating deep network training by reducing internal covariate shift. arXiv:1502.03167 (2015) 20. Theis, L., Shi, W., Cunningham, A., Huszár, F.: Lossy image compression with compressive autoencoders. arXiv:1703.00395 (2017) 21. Krizhevsky, A.: Learning multiple layers of features from tiny images. Technical report TRUniversity of Toronto, Toronto (2009) 22. Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. arXiv:1412.6980 (2017)

254

T. Q. Vinh et al.

23. Sammut, C., Webb, G.I.: Encyclopedia of Machine Learning. Springer, New York (2010). https://doi.org/10.1007/978-0-387-30164-8 24. Huynh-Thu, Q., Ghanbari, M.: Scope of validity of PSNR in image/video quality assessment. Electron. Lett. 44(13), 800 (2008) 25. Wang, Z., Bovik, A.C., Sheikh, H.R., Simoncelli, E.P.: Image quality assessment: from error visibility to structural similarity. IEEE Trans. Image Process.Image Process. 13(4), 600–612 (2004)

Leverage Deep Learning Methods for Vehicle Trajectory Prediction in Chaotic Traﬃc Tan Chau, Duc-Vu Ngo, Minh-Tri Nguyen, Anh-Duc Nguyen-Tran, and Trong-Hop Do(B) Faculty of Information Science and Engineering, University of Information Technology, Vietnam National University of Ho Chi Minh City, Ho Chi Minh City, Vietnam {20520926,20520950,20522052,20521198}@gm.uit.edu.vn, [email protected] Abstract. Vehicle trajectory prediction is a crucial task in the ﬁeld of intelligent transportation systems, with applications in traﬃc management, autonomous driving, and advanced driver assistance systems. In this project, we aim to demonstrate the capabilities of this technology by collecting and analyzing video footage of traﬃc in Thu Duc District, Ho Chi Minh city. Utilizing YOLOv7 for object detection and DeepSORT for object tracking, we are able to accurately gather the coordinates of vehicles in the scene. Using these coordinates, we then employ various hybrid models such as CNN-LSTM and CNN-GRU to predict the future trajectory of the vehicles. The results are then visualized and superimposed onto the original video footage to showcase the capabilities of these predictive models. Keywords: Deep Learning · Vehicle Detection and Tracking Trajectory Prediction · Time Series Regression

1 1.1

· Vehicle

Introduction Vehicle Trajectory Prediction

Vehicle trajectory prediction is an important task in the ﬁeld of intelligent transportation systems. The goal of vehicle trajectory prediction is to predict the future positions of vehicles in a given scene. It plays a crucial role in various applications such as traﬃc management, autonomous driving, and advanced driver assistance systems. Vehicle trajectory prediction [1] is a challenging task in the ﬁeld of intelligent transportation systems. The goal is to predict the future positions of vehicles in a given scene, which plays a crucial role in various applications such as traﬃc management, autonomous driving, and advanced driver assistance systems. 1.2

The Challenges in Vietnamese Traﬃc

Nowadays, with the rapid advancement in technology, the development of selfdriving cars has become a major focus in the automotive industry. One of the key c The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 N.-N. Dao et al. (Eds.): ICIT 2023, LNDECT 187, pp. 255–265, 2023. https://doi.org/10.1007/978-3-031-46573-4_24

256

T. Chau et al.

components of a self-driving car is the ability to accurately predict the trajectory of other vehicles on the road. This is crucial for the safe and eﬃcient operation of self-driving cars, as they need to be able to anticipate the actions of other vehicles in order to handle unexpected situations that may arise. However, the implementation of self-driving cars in Vietnam poses a number of signiﬁcant challenges. One of the biggest obstacles is the chaotic and dense nature of the country’s roads. The heavy traﬃc and high number of motorcycles on the roads make it diﬃcult for autonomous vehicles to navigate and predict the actions of other vehicles. Furthermore, the lack of lane markings and clear road signage adds to the complexity of the operating environment. Additionally, the density of the traﬃc can also cause problems for the sensors and cameras used by self-driving cars to perceive their surroundings. These factors make it a diﬃcult task for self-driving cars to operate safely and eﬃciently in Vietnam [2]. The high number of vehicles 1 on the road can cause sensors to become overwhelmed, making it diﬃcult for the self-driving car to accurately detect and track other vehicles.

2

Related Work

Our team set out to conduct an experiment. We traveled through the city of Ho Chi Minh, speciﬁcally the Thu Duc district, to record footage of the traﬃc. These videos captured the scenery as well as the movement of other vehicles on the road. We then used a combination of YOLOv7 [3] and DeepSORT [4] to detect and track the vehicles in the video. The coordinates of the bounding boxes and the type of vehicles were recorded and written to a ﬁle. This ﬁle was then preprocessed to become a tidy dataset that could be used for further analysis and predictions. The input of this problem is the coordinates of the center of the vehicle bounding boxes. The output of this is the predicted coordinates of its in the next frame. The time series approach is the best approach for this problem because we can predict the future position base on the past positions of the vehicles. Vehicle trajectory prediction can be considered both a time-series forecasting problem and a time-series regression problem. Both of them involve using historical data on the movement of the vehicle to predict its future path or location. This problem can be considered both time-series regression and time-series forecasting problems. The diﬀerence between them is time-series regression may have some other features that are not the predicted values in the series of time. These features may be the type of vehicle, the weather, and many other factors which can aﬀect the trajectory of the vehicle. We choose time series regression approaches because the trajectory of the vehicle may be aﬀected by many outsides and inside factors.

Leverage Deep Learning for Vehicle Trajectory Prediction in Chaotic Traﬃc

3

257

Methods

Vehicle trajectory prediction involves three sub-problems: vehicle detection, tracking, and trajectory prediction. Detection identiﬁes vehicles in a video, tracking monitors their movement, and trajectory prediction uses past coordinates to anticipate future positions. The detection module identiﬁes vehicles, the tracking module supervises their coordinates, and the prediction module employs CNNLSTM and CNN-GRU models to forecast trajectories based on previous stages data. 3.1

Vehicle Detection

YOLO [5] (You Only Look Once) is a real-time object detection system developed by Joseph Redmon and Ali Farhadi in 2016. It is a convolutional neural network (CNN [6]) based approach for object detection and is considered one of the fastest and most accurate object detection algorithms in the ﬁeld of computer vision. YOLO [5] divides an image into a grid and uses CNNs to predict the class and bounding box coordinates of objects within each grid cell. The key advantage of YOLO [5] is its real-time processing capability, making it suitable for applications such as self-driving cars and real-time video surveillance. Since its release, there have been several versions of YOLO [5], such as YOLOv2 [7], YOLOv3 [8], and YOLOv7 [3] , each with increasing accuracy and performance improvements. YOLOv7 [3] is the latest version of YOLO [5], which is considered to be the most accurate and eﬃcient version of YOLO [5], and has brought a revolution in the object detection ﬁeld. YOLOv7 [3] is the latest version of the YOLO [5] system, which boasts several improvements over its predecessor. These improvements include the use of a new architecture called SPP-Net, which allows the system to eﬃciently process images of any size, and the use of a new loss function called CIoU loss, which improves the accuracy of bounding box predictions. Additionally, YOLOv7 [3] also includes several other modiﬁcations such as a new anchor system and a new data augmentation technique. 3.2

Vehicle Tracking

DeepSORT [4] (Deep Appearance-based Single Object Real-Time Tracking) is a real-time object tracking algorithm that utilizes deep learning to track objects in video sequences. It is an extension of the popular SORT(Simple Online and Realtime Tracking) algorithm, which uses the Kalman ﬁlter [9] to track objects in videos. DeepSORT improves upon SORT by incorporating deep learning to improve appearance-based object tracking. Kalman Filter. Kalman ﬁlter [9] is an algorithm that uses a series of measurements observed over time, containing statistical noise and other inaccuracies,

258

T. Chau et al.

and produces estimates of unknown variables that tend to be more accurate than those based on a single measurement alone. To understand how Kalman ﬁlter [9] works, ﬁnding out step-by-step ways to implement this is necessary. The ﬁrst step is building a mathematical model. We have to make sure that the problem is ﬁt to the Kalman ﬁlter [9] condition. Xt = AXt−1 + But + Wt

(1)

Yt = HXt + Vt

(2)

where Xt is the actual value, Yt is the measurement value, ut is the control signal or we can understand it as the external factor in our system (usually this factor is 0), Wt is the stochastic error, Vt is the measurement error. The matrices A, B, and H are used to model the system dynamics and observation model respectively. Matrix A is the state transition matrix, B is the control matrix and H is the observation matrix. When (1) and (2) is satisﬁed. We can continue to implement the algorithm. The process has 2 distinct sets of equations: Time Update, also known as the prediction process, and Measurement Update, which is also known as the correction process. Both of them are applied at each t timestamp. The Time Update process, is also known as the predictions process. It tried ˆ t¯ and the value of the error covariance to predict the value of the state ahead X ahead Pt¯. The process can be deﬁned by Eq. (3) and Eq. (4). ˆ t = AX ˆ t−1 + But X

(3)

Pt¯ = APt−1 AT + Q

(4)

ˆ t¯ is the predicted value of the timestamp t,Pt¯ is the predicted error where X ˆ t¯−1 is the predicted value of the timestamp covariance of the timestamp t, X t − 1, Pt¯ − 1 is the predicted error covariance of the timestamp t − 1. Q is the stochastic error variance or we can understand it as the variance of Wt . The Update state (or correction process )refers to the updating of the state estimate based on the measurements obtained from the system. The correction process compares the estimated state and the actual measurement. The process can be deﬁned by the (5), (6), (7) and (8) equations. ˆt¯ z˜t = zt − H x

(5)

Kt = Pt¯H T (R + HPt¯H T )−1

(6)

x ˆt = x ˆt¯ + Kt (zt − H x ˆt¯)

(7)

Pt = (1 − Kt H)Pt¯

(8)

Leverage Deep Learning for Vehicle Trajectory Prediction in Chaotic Traﬃc

259

Equation (5) calculate the measurement residual z˜t , Eq. (6) calculate the ˆt , Eq. (8) calKalman gain Kt , Eq. (7) calculate the updated state estimate x culate the updated error covariance P t. The superscripts denote predicted prior estimates. The output of timestamp t will be the input for timestamp t + 1. The implementation of the Kalman ﬁlter [9] requires an initial setup phase. This involves providing initial estimates for the state estimate, denoted as x ˆ0 , and the error covariance matrix, represented as P0 . We can use the ”initial ignorance” strategy which suggests that a larger value for P0 should be selected for faster convergence of the Kalman ﬁlter [9]. Cosine Metric Learning. In order to develop an eﬃcient feature extraction system for each detected bounding box, the researchers devised an architecture known as the Wide Residual Network. This architecture was speciﬁcally trained on extensive person identiﬁcation datasets to achieve superior performance. Recognizing the time-intensive nature of traditional deep neural networks during both training and inference, the authors opted for shallow neural networks, speciﬁcally Wide Residual Networks. Remarkably, despite having a modest number of layers (16 layers), this architecture surpasses other counterparts with thousands of layers in terms of performance. Moreover, the training and inference time for the Wide Residual Network are signiﬁcantly reduced, oﬀering improved eﬃciency. In the DeepSORT framework, a cosine softmax classiﬁer [10] is employed instead of the conventional softmax classiﬁer. This specialized classiﬁer (10) is employed to determine the similarity between the features extracted from detection and those of a track. By utilizing the cosine softmax classiﬁer [10], DeepSORT exhibits enhanced capabilities in dealing with occlusions and other challenges encountered in multiple object tracking scenarios. This modiﬁcation enables DeepSORT to eﬀectively address complexities inherent in such scenarios and enhance tracking accuracy. exp(ωkT + bk ) p(y = k|r) = C T n=1 exp(ωnr + bn ) T exp(K ω ˜ kr ) p(y = k|r) = C T ) ˜ kr n=1 exp(K ω

3.3

(9) (10)

Vehicle Trajectory Prediction

A Convolutional Neural Network (CNN) [6] has emerged as a prominent deep learning architecture renowned for its eﬃcacy in extracting salient features from input data. Comprising convolutional layers, CNNs identify localized patterns and features through the application of ﬁlters. Accompanied by pooling layers, which reduce spatial dimensions while preserving crucial features, and fully connected layers, which facilitate the ﬁnal classiﬁcation based on learned features, CNNs oﬀer a comprehensive framework for data analysis and pattern recognition.

260

T. Chau et al.

In the domain of time series prediction, a Long Short-Term Memory (LSTM) [11] network, a variant of Recurrent Neural Networks (RNNs) [12], demonstrates remarkable suitability. In contrast to conventional RNNs, LSTM networks possess the ability to capture and retain information over extended durations, rendering them particularly adept at handling time series prediction tasks. The sequential nature of the LSTM architecture allows for step-by-step processing of input data, wherein the hidden state from the preceding step is integrated. Similarly, a Gated Recurrent Unit (GRU) [13], another variant of Recurrent Neural Networks (RNNs), exhibits eﬃcacy in addressing time series prediction challenges. GRUs, akin to LSTMs, are speciﬁcally designed to handle long-term dependencies present in time series data. Notably, GRUs employ a distinct architecture featuring two gates: a reset gate, which determines the extent to which previous information should be disregarded, and an update gate, which determines the extent to which new information should be incorporated into the hidden state. In the context of the current project, our objective revolves around leveraging the collective capabilities of Convolutional Neural Network (CNN) and Recurrent Neural Network (RNN) paradigms to achieve precise vehicle trajectory predictions. By harnessing CNN’s proﬁciency in extracting relevant features from input data and RNN’s capacity for handling time series forecasting, we envision the development of robust hybrid models in the form of CNN-LSTM and CNN-GRU architectures. Our aim is to seamlessly integrate the strengths of these models, yielding a solution that is both eﬀective and eﬃcient in its predictive abilities (Fig. 1).

4 4.1

Experiment Experimental Setup and Implementation

Our research team conducted an experimental study focused on capturing video footage of traﬃc in the Thu Duc district of Ho Chi Minh City. The recorded videos encompassed the surrounding environment and the movement of vehicles on the road. To detect and monitor vehicles within the footage, we employed YOLOv7 [3] and DeepSORT algorithms. The resulting output provided us with bounding box coordinates and vehicle types, which were then stored in a structured dataset for subsequent use in the trajectory prediction phase. The vehicle trajectory prediction problem consists of three distinct tasks: Vehicle Detection, Vehicle Tracking, and Vehicle Trajectory Prediction. The initial task involves the detection of vehicle bounding boxes on the street. We ﬁne-tune the pre-trained YOLOv7 model using a dataset collected through the OIDv4 Toolkit. [14] with some additional pictures we captured and labeled. The dataset comprised approximately 10,000 images and labels, focusing on the ﬁve most prevalent vehicle types observed on Vietnamese streets, namely motorcycles, buses, trucks, bicycles, and cars. After ﬁne-tuning the model with the new

Leverage Deep Learning for Vehicle Trajectory Prediction in Chaotic Traﬃc

261

Fig. 1. The process of this project

dataset for 100 epochs, we observed signiﬁcant improvements in both speed and accuracy compared to using only the pre-trained weights. The subsequent task pertains to establishing a vehicle tracking framework. We obtained pre-trained weights designed speciﬁcally for DeepSORT and conﬁgured the necessary environment to ensure the smooth execution of the DeepSORT algorithm [4]. Following a meticulous evaluation of its performance, we decided to integrate the outputs of DeepSORT into our vehicle tracking approach. By combining vehicle detection and tracking mechanisms, we were able to extract relevant information from the recorded video footage. We extracted essential details such as bounding box center coordinates, Track ID, and vehicle type, performed preprocessing on the extracted data, and stored it in a CSV ﬁle. To predict the x and y coordinates of the vehicle in the next frame, we constructed a hybrid model that includes an input layer, a convolutional layer, a pooling layer, an LSTM/GRU layer, and a dense layer with two units. The dataset underwent preprocessing, resulting in the inclusion of 21 distinct features. Among these, ﬁve features were dedicated to encoding the vehicle type using a one-hot representation, while the remaining 14 features captured the x and y coordinates of the preceding seven frames. The ﬁnal two features represented the target variables to be predicted, namely the x and y coordinates. By training the constructed model with this feature-rich dataset, we achieved accurate predictions of the vehicle’s coordinates. The resulting visualization,

262

T. Chau et al.

depicted in Fig. 2, provides a comprehensive representation of various aspects, including the total count of vehicles traversing the street, corresponding bounding boxes, the center point for each vehicle, and the anticipated center point of the bounding box. These elements are visually connected by line segments, illustrating the predicted trajectory of the vehicle.

Fig. 2. The visualization of the results in the video frame

4.2

Metrics

mAP - mean Average Precision. The mAP (mean Average Precision) is a performance indicator that is widely used in evaluating object detection models. This metric calculates the average precision of a model detection results for a given set of objects. Precision is deﬁned as the proportion of True Positives out of all detections made, including both True Positive and False Positives detections. True Positive is known for the IoU (11) of the predicted and the ground truth is higher than a threshold. The IoU (11) of the predicted and the ground truth is less than a threshold is considered False Positive. A∩B (11) A∪B Average Precision (AP) is a commonly used performance metric for evaluating object detection models. It is calculated by generating a Precision-Recall curve and computing the area under the curve (AUC). IoU (A, B) =

AP =

1 1 APr = pinterp (r) 11 r∈0.0,...,1.0 11 r∈0.0,...,1.0

(12)

Leverage Deep Learning for Vehicle Trajectory Prediction in Chaotic Traﬃc

where

263

pi nterp(r) = maxr˜>r p(˜(r))

median Average Precision (mAP) (13) is just the average of Average Precision from all classes. N

mAP =

1 APi n i=0

(13)

Vehicle Trajectory Prediction in Vietnam Using Deep Learning Technique 9 where, n is the number of class Mean Square Error. Mean Square Error (MSE) (14) measures the average of the squared diﬀerence between the predicted and actual target values. The squared diﬀerence punishes larger errors more heavily, which can help prevent over-ﬁtting. The smaller the MSE, the better the model’s performance, indicating that it provides more accurate predictions. N

1 (yi − yˆi )2 M SE = n i=1

(14)

Mean Absolute Error. Mean Absolute Error (MAE) (15) measures the average diﬀerence between the predicted values and the actual values with the absolute value. The MAE is expressed as the average absolute deviation of the predicted values from the actual values and provides a scalar representation of the overall prediction error of the model. N

M AE =

1 |yi − yˆi | n i=1

(15)

Mean Absolute Percentage Error. MAPE (16) stands for Mean Absolute Percentage Error. MAPE commonly used evaluation metric for measuring the prediction accuracy of time series models. MAPE is calculated as the average absolute percentage deviation of the predicted value from the actual value, where the percentage deviation is calculated as the absolute diﬀerence between the actual and predicted value divided by the actual value. N

1 |yi − yˆi |100 M AP E = n i=1

(16)

where n is the number of observations in the time series and actual and predicted are the actual and predicted values of the time series, respectively.

264

T. Chau et al. Table 1. The results of the vehicle detection model P

R

mAP@[.5] mAP@[.5:.95]

0.861 0.83 0.892

4.3

0.668

Experimental Result

We have the vehicle detection model by ﬁne-tuning the pre-trained weights of the YOLOv7 [3] model. We trained it with 9738 images and 20520 labels through 100 epochs. The results of the model are shown in Table 1. [email protected] is the mAP with an IoU threshold of 0.5 and [email protected]:.95 is the mAP with an average mAP over diﬀerent IoU thresholds, from 0.5 to 0.95. Table 2 shows us the evaluation table of implementing two hybrid models: CNN-LSTM and CNN-GRU to the Vehicle Trajectory Prediction in Vietnam. We use 3 metrics to evaluate which are MSE, MAE, and MAPE. In Vietnam, the traﬃc is quite dense and chaotic so this can be a tough challenge for the deep learning model to solve this problem. The results are not so good but it is acceptable. The diﬀerence in performance between the two models is not so high. Between the two models, CNN-LSTM seems to have a better performance than CNN-GRU when all metrics of CNNLSTM are better than CNN-GRU. However, the training time of CNN-GRU is 3 times faster than the training time of CNN-LSTM. Table 2. The results of implementing CNN-LSTM and CNN-GRU Models

MSE

MAE MAPE

CNN-LSTM 2239.61 26.34 0.052 CNN-GRU 2630.7 28.98 0.053

5

Conclusion

This research project aimed to develop an experiment and devise a strategy for predicting vehicle trajectories in Ho Chi Minh City, Vietnam. To gather relevant data, we conducted ﬁeld recordings of natural scenery and vehicular movement in speciﬁc locations within the city. The recorded video footage was subsequently processed using DeepSORT [4] with ﬁne-tuned YOLOv7 [3], enabling the extraction of pertinent information. In our analysis, we observed that CNNLSTM exhibited superior performance in the vehicle trajectory prediction stage, although CNN-GRU demonstrated faster training capabilities. Reﬂecting on our experimental ﬁndings, we identiﬁed limitations and areas for improvement in predicting vehicle trajectories. Relying solely on camerabased information may not capture all relevant factors, necessitating additional models for obstacle detection, traﬃc light recognition, and traﬃc sign detection. Alternative data collection approaches, such as using black box cameras and leveraging car sensors, oﬀer improved quality and valuable information. Expanding the model repertoire to include critical event detection can further enhance performance.

Leverage Deep Learning for Vehicle Trajectory Prediction in Chaotic Traﬃc

265

In conclusion, this project encompassed the establishment of an experimental framework and the formulation of a trajectory prediction strategy for vehicles in Ho Chi Minh City. By capturing video footage and employing DeepSORT [4] with ﬁne-tuned YOLOv7 [3], we were able to extract relevant information for analysis. While CNN-LSTM exhibited superior performance in trajectory prediction, CNN-GRU demonstrated faster training. However, limitations were identiﬁed, prompting the need for additional models and data collection approaches to address factors that impact vehicle trajectories. These insights serve as a foundation for future research endeavors aimed at reﬁning and advancing trajectory prediction methodologies in similar contexts.

References 1. Izquierdo, R., Quintanar, A., Parra, I., Fernandez-Llorca, D., Sotelo, M.A.: Vehicle trajectory prediction in crowded highway scenarios using bird eye view representations and CNNs. In: 2020 IEEE 23rd International Conference on Intelligent Transportation Systems (ITSC), pp. 1–6 (2020) 2. Trng, D., Kajita, Y.: Traﬃc congestion and impact on the environment in vietnam: development of public transport system - experience from actual operation of bus in hanoi. J. Civil Environ. Eng. 8, 1 (2018) 3. Wang, C.Y., Bochkovskiy, A., Liao, H.Y.M.: YOLOv7: trainable bag-of-freebies sets new state-of-the-art for real-time object detectors (2022) 4. Wojke, N., Bewley, A., Paulus, D.: Simple online and realtime tracking with a deep association metric. In: 2017 IEEE International Conference on Image Processing (ICIP), pp. 3645–3649, IEEE (2017) 5. Redmon, J., Divvala, S., Girshick, R., Farhadi, A.: You only look once: uniﬁed, real-time object detection. CoRR, vol. abs/1506.02640 (2015) 6. O’Shea, K., Nash, R.: An introduction to convolutional neural networks (2015) 7. Redmon, J., Farhadi, A.: YOLO9000: better, faster, stronger. CoRR, vol. abs/1612.08242 (2016) 8. Redmon, J., Farhadi, A.: Yolov3: an incremental improvement. CoRR, vol. abs/1804.02767 (2018) 9. Kim, Y., Bang, H.: Introduction to Kalman ﬁlter and its applications. In: Govaers, F. (ed.) Introduction and Implementations of the Kalman Filter, Chapter 2. IntechOpen, Rijeka (2018) 10. Wojke, N., Bewley, A.: Deep cosine metric learning for person re-identiﬁcation. In: 2018 IEEE Winter Conference on Applications of Computer Vision (WACV), pp. 748–756. IEEE (2018) 11. Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9, 1735–1780 (1997) 12. Jordan, M.I.: Serial order: a parallel distributed processing approach. Technical report, June 1985–March 1986 13. Chung, J., Gulcehre, C., Cho, K., Bengio, Y.: Empirical evaluation of gated recurrent neural networks on sequence modeling (2014) 14. Vittorio, A.: Toolkit to download and visualize single or multiple classes from the huge open images v4 dataset (2018). https://github.com/EscVM/OIDv4 ToolKit

AIoT System Architectures

Wireless Sensor Network to Collect and Forecast Environment Parameters Using LSTM Phat Nguyen Huu1(B) , Loc Bui Dang2 , Trong Nguyen Van1 , Thao Dao Thu Le1 , Chau Nguyen Le Bao3 , Anh Tran Ha Dieu3 , and Quang Tran Minh4,5 1

4

5

Hanoi University of Science and Technology (HUST), Hanoi, Vietnam {phat.nguyenhuu,thao.daothule}@hust.edu.vn, [email protected] 2 High School for Gifted Students (HNUE), Hanoi, Vietnam 3 Hanoi-Amsterdam High School for the Gifted, Hanoi, Vietnam Ho Chi Minh City University of Technology (HCMUT), Hochiminh, Vietnam [email protected] Vietnam National University HoChiMinh (VNU-HCM), Hochiminh, Vietnam

Abstract. The paper proposes to design a mobile sensor network that collects and predicts air environment parameters using a node consisting of many sensors. The parameters include temperature, humidity, PM2.5 dust concentration, and GPS coordinates. They are sent to the IoT platform via a sim module. We use a Python server to get data and predict three parameters (temperature, humidity, and PM2.5 dust concentration) using long short-term memory (LSTM) network. Parameters are displayed and updated via the web including a map that shows the locations of the node and charts of atmospheric parameters. The sensor node collect and predict mean squared error (MSE), Root Mean Square Error (RMSE), and MAE to 0.0051, 0.0711, and 0.0460, respectively. The results show that the system can apply to a real environment. Keywords: Mobile sensor network · IoT platform · LSTM · AI · parameters prediction

1 Introduction The human is a significant object of health applications. The air population is always a painful problem with the situation of economic development and urbanization as fast as in Vietnam. It directly affects humans and dust sources causing extremely serious air pollution in large cities that make the health of the living for a long time. The notification and warning of air quality are essential to help people monitor the level of air pollution and take measures to prevent pollution. As a result, it helps minimize the adverse effects on human health. Air pollution tends to be heavily proportional to the economic growth rate. The number of people gathered in large urban areas such as Hanoi has caused large amounts of toxic emissions and fine dust causing serious air pollution. In Vietnam, a system of automatic air monitoring stations has been installed to assess the situation of air pollution. However, it still has many weaknesses such as high investment and construction costs. c The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 N.-N. Dao et al. (Eds.): ICIT 2023, LNDECT 187, pp. 269–279, 2023. https://doi.org/10.1007/978-3-031-46573-4_25

270

P. N. Huu et al.

The density of air monitoring stations is still sparse and uneven, unable to cover and reflect the details of all areas. There is no automatic air quality forecast system. Improving the efficiency of collecting detailed air environment data is always a painful problem. Therefore, we need an effective solution to collect detailed data to have an overview of the air quality. The purpose of the paper is to build a mobile sensor network to collect high-range air environment parameters and improve the weaknesses of fixed monitoring stations. In addition, it is also possible to predict them. The rest of the paper is presented as follows. In Sect. 2, we will present related work. In Sect. 3 and 4, we present and evaluate the effectiveness of the proposed model, respectively. Finally, we give a conclusion in Sect. 5.

2 Related Work Currently, air environment monitoring systems become more popular [1–8]. It shows that the main impact on the air quality is fine dust 2.5 pm in Vietnam [5]. In [1], the authors focus on developing the weather monitoring system. The main goal of the study is to use the MQTT for the communication class instead of the direct connection database, which can separate the complexity of the system moving from the RDBMS system. They build a system of information to reduce data collection using the MQTT server and distribute it to users who want to collect separate data. However, it is also necessary to improve performance to access real-time data when applied in a larger system. In [3], they perform real-time air pollution monitoring by a wireless sensor on public transport. This study was part of the Greeniot project in Sweden, using the Internet of Things (IoT) to measure the level of air pollution in the center of Uppsala City. Through the deployment of low-cost wireless sensors, the level of air pollution can be obtained over time in different locations. This reduces the cost of installation and operation compared to the sensor networks. In the paper, they use a 4G network to send data via HTTP protocol that can display the last 10 measurements on vehicles. Finally, they compare the mobile sensor performance compared to the dataset sensor. As a result, the system still has several weaknesses such as a success rate when transmitting data only 70%. It needs the mechanism of storage and transmission to achieve the receiving rate of 100%. In addition, deep artificial neural network models have made certain progress in predicting air environment parameters. A recurrent neural network (RNN) is a typical model to predict time-related data such as weather. Long short-term memory (LSTM) is a particularly improved RNN network to overcome the problems. The RNN network encounters such as the ability to process and store longer-time data, capable of learning non-linear relationships a the elements. In [6], they refer to the comparison of three types of models, namely ANFIS, LSTM, and RNN to predict Greelous gas (GHG). The comparative models have the same parameters when training and testing. As a result, the LSTM and ANFIS networks produce a mean squared error (MSE) and Root mean square error (RMSE) better than RNN. The authors [2] proposed a lightweight LSTM learning network architecture in a stacking style and compared it to two one-way CNN and LSTM models.

Wireless Sensor Network to Collect and Forecast Environment Parameters

271

Fig. 1. The proposed system model.

The result obtained by LSTM has been significantly better than the short-term or one-way network in a short prediction. In the study, they compare the proposed model with the one-way model. The evaluation criterion is MSE with inputs including four parameters, namely temperature, relative humidity, wind speed, and dew points. The results showed that the MSE of three predictions (one, two, and three hours) produced the lowest when compared to the two remaining models. In [4], they realized the importance of air quality forecasting when using data from 66 monitoring stations. Although RMSE is high, the LSTM network model is better than the ANN model when comparing RMSE.

3 Proposing System 3.1 System Overview Figure 1 depicts the general block diagram of hardware design for the sensor node. – Power block uses adapter 5V/2A or backup battery and low voltage module LM2596 to 5V/3A to supply power. – Processing block uses the STM32F103C8T6 micro-controller to process the signal from the sensor and send the data through the communication and display block. – Sensor block uses an AHT-21 temperature and humidity sensor, PM2.5 GP2Y1014AU dust sensor to receive information about parameters from the environment. A GY-NEO 6M v2 GPS navigation module tracks the position of the sensor node. – Displaying block uses LCD2004 screen to display measured parameters from the environment, SD card connection status, and time due to signal transmission by processing block. – Communicating block uses a micro SD card to store measured data and SIM to transfer them to the IoT cloud server. – Real-time block uses the DS3231 real-time IC to monitor the measurement time of the sensors and waits for the time to send data to the server.

272

P. N. Huu et al.

Fig. 2. The diagram of processing data.

Figure 2 depicts the data flow that has been stored. The function details of each block are as follows. – Storage block: The IoT platform is ThingSpeak which can store, visualize, and analyze data using the Matlab language. In this paper, we use it to store data from sensor nodes that can monitor responses. To get data from the IoT platform, an API request is required. – Data processing and prediction block: In the block, we use a web server to receive data from the IoT platform using APIs to process them and the Uvicorn web server uses Python language to build APIs and process them. The deep learning model can be used to predict parameters. In this server, we have built two main APIs, namely process and predict data. – Data processing: JSON data will be saved into variables and arrays to display and predict parameters. In addition, PM2.5 data will be converted to the AQI index. – Prediction of parameters: The received data will be included in the model and prediction. It uses a private API for the front end to get the data after the prediction. – Display block uses the API to provide atmospheric parameters, forecasts, and location information of the most recent measurements. 3.2

System Details

Figure 3 describes the hardware operation. First, the command blocks initialize the reading and writing functions. The data is then collected every 15 min. It is stored on the micro-SD card and sent to the IoT ThingSpeak platform and displayed on LCD to continue to read data from the sensor. The rest will display on LCD. Designing Deep Learning Model – Preparing the dataset: The dataset was collected from the sensor over two months with each data being spaced 15 min. The data includes three parameters of the air environment, namely temperature, humidity, and PM2.5. It will be normalized. – Training model predicts atmospheric parameters using time series and sliding window method. Figure 4 illustrates how sliding window works for time data series. Input (X) will receive the data of the previous 12 h, corresponding to 48 data points of each parameter. The output will be the data from one hour, corresponding to four data points of each parameter. The output is Y .

Wireless Sensor Network to Collect and Forecast Environment Parameters

273

Fig. 3. Algorithm flowchart.

Fig. 4. Process of forecasting data.

The dataset will be divided into three types, namely 70% train, 20% test, and 10% valid data. They will be fed into the LSTM model.

LSTM. Every recurrent network takes the form of a sequence of repeating modules. In standard RNNs, these modules have a very simple structure. LSTM network [9] is a special type of RNN that is capable of learning distant dependencies. However, it has a more complex architecture. They have up to four layers that interact with each other instead of just one layer of neural networks. The first step of LSTM is to decide what information to remove from the cell state. This decision is made by the sigmoid layer called the forget gate. It will take as input (ht−1 and (xt )) and then output a number in [0, 1] for each cell state (ct−1 ). The output is 1 indicating that it keeps all the information. Otherwise, all information will be removed. The next step is to decide what new information we should store in the cell state. This includes two parts. The first is to use a sigmoid layer to decide which values to update. Next is a tanh layer that creates a vector for the new value (ct ) to add to the state. In the next step, we will combine those two values to create an update for the state. Finally, we need to decide on the output. The output value will be based on the cell state. First, we run a sigmoid layer to decide what part of the cell state. We then give it the cell state through a tanh function to reduce it to [−1, 1] and multiply it by the output of the sigmoid gate to get the output.

274

P. N. Huu et al.

Fig. 5. The proposed LSTM model.

Fig. 6. Hardware circuit.

Figure 5 is the layer structure with LSTM 0 to LSTM 47 receiving values at time ×0 to ×47 respectively according to the data processed by the sliding window based on [10]. We then give the dense layer output with a linear activation function that is a matrix of the values of three prediction parameters. Finally, it goes through the reshape layer to transform the matrix into a 3-dimensional for one hour.

4 Simulation and Result 4.1

Product

We proceed to design the complete model after designing all systems. Figure 6 is a hardware image. The circuit is relatively stable. However, there are several display errors such as loss of text on the LCD when starting. However, it does not affect the data collection from the atmosphere. In the data collection and sending process, storage still takes place normally. The data is then also stored on the IoT platform which can also be tracked on it. The web server also receives data from the IoT platform and displays it on the web page. Figure 7 is the web interface.

Wireless Sensor Network to Collect and Forecast Environment Parameters

275

Fig. 7. Website interface.

When data is displayed, the user can also view the details of the data in the last 2 h sent by the sensor node. We can also view the last historical data of 12 h as shown in Fig. 7. 4.2 Training Result In this paper, we use the MSE, RMSE, and mean absolute error (MAE) to evaluate model quality. The accuracy of the method is estimated by [11] M AE =

N 1 |Xi − Yi | , N i=1

N 1 2 M SE = (Xi − Yi ) . N i=1

(1)

(2)

where Xi is the observation parameter with a mean denoted by X; Yi is the prediction parameter with a mean denoted by Y ; and N is a number of instances. More details can be seen in [11]. Table 1 is the model prediction quality results. To better visualize the prediction accuracy of the model, Fig. 8 shows the predicted and actual temperature graphs with blue and red lines. Similar to the humidity and PM 2.5 charts are shown in Figs. 9 and 10. It can be seen that the model has predicted the parameters close to reality from the representation of the three charts in Figs. 8, 9, and 10. The model fits the predictions of these parameters.

276

P. N. Huu et al. Table 1. Predictive quality of the model. Criteria Value MSE

0.0051

RMSE

0.0711

MAE

0.0460

Fig. 8. Predicting of the temperature.

4.3

Discussion

The article focuses on research and design of a sensor network that collects and predicts three air environment parameters. There are several errors displayed on the hardware. The network does not have many different nodes to collect data due to economic limitations and the web server. There are not many functions and capabilities for users and administrators. It is not possible to predict for a long time. To overcome these limitations, we measure such as data enhancement, and reasonable interface optimization for both hardware and software. Besides, we build more functions to manage sensor nodes for administrators and users.

Wireless Sensor Network to Collect and Forecast Environment Parameters

Fig. 9. Predicting of the humidity.

Fig. 10. Predicting of the PM2.5.

277

278

P. N. Huu et al.

5 Conclusion The article designs a network that collects and predicts air environment parameters and uses an IoT platform to store data. It predicts the air environment parameters one hour later and displays the obtained results and forecast results on the website. Based on evaluation criteria such as MSE about 0.0051, RMSE about 0.0711, and MAE about 0.0460, the prediction model is suitable and quite close to reality. However, there are still several weaknesses that cannot be improved in the system. The parameter of min PM2.5 dust concentration has not been measured accurately and the user interface is not intuitive. There are not many functions and the server can not automatically train. It cannot deploy many nodes at the same time due to limitations in terms of time and cost of hardware circuit. The problems will be improved in the future. Acknowledgment. This research is funded by the Hanoi University of Science and Technology (HUST) under project number T2022-PC-013. The authors would like to thank HUST for the financial support. This research is also funded by Vietnam National University Ho Chi Minh City (VNU-HCM) under grant number NCM2021-20-02.

References 1. Tsao, Y.-C., Tsai, Y.T., Kuo, Y.-W., Hwang, C.: An implementation of IoT-based weather monitoring system. In: 2019 IEEE International Conferences on Ubiquitous Computing and Communications (IUCC) and Data Science and Computational Intelligence (DSCI) and Smart Computing. Networking and Services (SmartCNS), pp. 648–652 (2019) 2. Al Sadeque, Z., Bui, F.M.: A deep learning approach to predict weather data using cascaded LSTM network. In: 2020 IEEE Canadian Conference on Electrical and Computer Engineering (CCECE), pp. 1–5 (2020) 3. Kaivonen, S., Ngai, E.: Real-time air pollution monitoring with sensors on city bus. Digit. Commun. Netw. 6, 23–30 (2019) 4. Tsai, Y.-T., Zeng, Y.-R., Chang, Y.-S.: Air pollution forecasting using RNN with LSTM. In: 2018 IEEE 16th International Conference on Dependable, Autonomic and Secure Computing, 16th International Conference on Pervasive Intelligence and Computing, 4th International Conference on Big Data Intelligence and Computing and Cyber Science and Technology Congress (DASC/PiCom/DataCom/CyberSciTech), pp. 1074–1079 (2018) 5. Xu, K., et al.: Show, attend and tell: neural image caption generation with visual attention (2016) 6. Ludwig, S.: Comparison of time series approaches applied to greenhouse gas analysis: ANFIS, RNN, and LSTM, pp. 1–6, June 2019 7. Huu, P.N., Dang, D.D., Khai, M.N., Trong, H.N., Ngoc, P.P., Minh, Q.T.: Monitoring and forecasting water environment parameters for smart aquaculture using LSTM. In: 2022 RIVF International Conference on Computing and Communication Technologies (RIVF), pp. 53– 58 (2022) 8. Huu, P.N., Tuan, H.D., Ngoc, H.L.: Development of warning and predicting system for quality of air in smart cities using RNN. In: 2020 7th NAFOSTED Conference on Information and Computer Science (NICS), pp. 108–113 (2020) 9. Hochreiter, S., Schmidhuber, J.: Long short-term memory. In: Neural Computation, vol. 9, pp. 1735–80, December 1997

Wireless Sensor Network to Collect and Forecast Environment Parameters

279

10. Li, T., et al.: PM2.5: the culprit for chronic lung diseases in China. Chronic Dis. Transl. Med. 4(3), 176–186 (2018) 11. Apaydin, H., et al.: Comparative analysis of recurrent neural network architectures for reservoir inflow forecasting. Water 12, 1–18 (2020)

SCBM: A Hybrid Model for Vietnamese Visual Question Answering Hieu Le Trung1 , Tuyen Dao Cong1 , Trung Nguyen Quoc1 , and Vinh Truong Hoang2(B)

2

1 FPT University, Ho Chi Minh City, Vietnam {hieuLTSE150560,tuyenDCSE150561,trungnq46}@fpt.edu.vn Ho Chi Minh City Open University, Ho Chi Minh City, Vietnam [email protected]

Abstract. Visual Question Answering (VQA) is a popular research topic that has gained attention from diverse ﬁelds like computer vision and natural language processing. While there are existing English foundation models that transfer well in downstream tasks like visual question answering, there is a lack of research in Vietnamese VQA (ViVQA). These attention-based models have shown high eﬃciency in generating spatial maps for relevant image regions or sentence parts, contributing to the success of these models. In this article, we propose a joint modeling approach for attention-based language and vision in ViVQA, called SCBM system, which achieves Accuracy, WUPS score 0.9, and WUPS score 0.0 of 0.6201, 0.6814, and 0.8719 on the ViVQA benchmark dataset, respectively. These results are twice as good as previous baselines (Co-attention combined with PhoW2V), opening up possibilities for further advancements in ViVQA. We also discuss the development path of ViVQA systems to achieve breakthroughs in this ﬁeld.

Keywords: Vietnamese visual question answering model · hybrid model

1

· attention-based

Introduction

Visual question answering (VQA) [4] is a research ﬁeld that has gained popularity and made signiﬁcant progress in recent years. VQA involves a system inferring an answer to a text-based question about an image, and requires reasoning about images and performing computer vision tasks. VQA has potential uses for blind and visually impaired individuals, as well as for enhancing humancomputer interaction and image retrieval systems. It is also a signiﬁcant area for basic research and can be seen as a type of visual Turing test. While there has been considerable research on VQA in English, Japanese, and a few other languages, there are limited studies in Vietnamese due to data and resource constraints. This research aims to establish a baseline for Vietnamese VQA systems, focusing on adapting attention-based models to improve performance on c The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 N.-N. Dao et al. (Eds.): ICIT 2023, LNDECT 187, pp. 280–287, 2023. https://doi.org/10.1007/978-3-031-46573-4_26

SCBM: A Hybrid Model for Vietnamese Visual Question Answering

281

benchmark datasets for the ViVQA task. The study also highlights the potential and current limitations of Vietnamese multimodal tasks, and encourages further research in Vietnamese-related topics. The rest of this paper is organized as follows. Section 2 presents related works of this study. Section 3 shows the proposed approach. Finally, Sect. 4 and 5 give the experimental results and conclusion.

2 2.1

Related Works Visual Attention-Based Models

The attention mechanism is a process that adaptively selects relevant input features in computer vision tasks, such as image classiﬁcation, object identiﬁcation, and semantic segmentation. There are four core categories of attention in computer vision: channel attention, spatial attention, temporal attention, and branch attention, which can be combined in various ways. The Visual Attention Network (VAN) [3] is a state-of-the-art convolutional neural network architecture that leverages attention to selectively focus on important regions of an image, improving accuracy and eﬃciency. VAN can dynamically adjust its attention mechanism and handle occlusions, making it robust to variations in image content and lighting conditions. Self-attention [12], originally used in natural language processing (NLP), is becoming important in computer vision for capturing long-term dependencies and adaptations. However, it has drawbacks in computer vision tasks, such as considering images as 1D sequences, ignoring the 2D structure of images, quadratic complexity for high-resolution images, and ignoring channel dimension adaptation. In Vision Transformer (ViT) [2], the input data is split into smaller 2D patches, ﬂattened into vectors, and augmented with positional embeddings. These embedded visual tokens are then fed into the Multi-head Attention Network (MSP) to generate attention maps, which help the network focus on important regions of the image. Convolutional Vision Transformer (CvT) [13] is a novel architecture that improves the performance and eﬃciency of ViT by incorporating convolutional neural networks (CNNs) into the ViT architecture. CvT introduces a new convolutional token embedding that incorporates local spatial information, enabling the model to better capture shift, scale, and distortion invariance. It also uses a convolutional Transformer block with a convolutional projection to capture local feature interactions and reduce computational cost. By combining CNNs and Transformers, CvT achieves the beneﬁts of both models in capturing local spatial information and feature interactions, as well as capturing global context and improving generalization. The Swin Transformer [7] architecture overcomes limitations of the original ViT by using a hierarchical approach to create feature maps that merge from one layer to another, resulting in lower spatial dimensions. Patch merging is used for convolution-free downsampling. Swin Transformer replaces the traditional multihead self-attention (MSA) module with a Window MSA (W-MSA) and a Shifted Window MSA (SW-MSA) module. The window-based MSA technique reduces

282

H. L. Trung et al.

complexity to linear with the number of patches, but has a limitation of reduced global feature learning. To address this, Swin Transformer introduces a Shifted Window MSA (SW-MSA) module after each W-MSA module, which introduces cross-connections across windows and improves network performance. 2.2

Language Attention-Based Models

BERT [1], a language representation model developed by Google Research in 2018, has transformed the ﬁeld of natural language processing (NLP) with its bidirectional approach that considers both left and right context during training. It uses the Transformer architecture with multiple encoder blocks, each having self-attention mechanisms for better modeling of long-range dependencies. PhoBERT [8], a state-of-the-art pre-trained language model developed by the Vietnamese AI community in 2020, utilizes subword tokenization and masked language modeling to eﬀectively capture the morphology of Vietnamese words and generate coherent text. CLIP (Contrastive Language-Image Pre-training) [9] is a foundation network model that addresses ﬂaws in current computer vision approaches by training on a broad variety of images using natural language supervision. CLIP embedding can be used for multimodal downstream tasks, including ViVQA (Vietnamese Visual Question Answering), by understanding the context of the connection between diﬀerent data distribution areas.

3

Methods

Our proposed SCBM model (shown in Fig. 1) is a novel approach to visual question answering (VQA) that combines four cutting-edge architectures: the window and shifted-window multi-head attention layer from Swin Transformer (S), the Convolution Embedding layer from CvT (C), the PhoBert architecture for question encoding (B), and the Merge Attention mechanism (M) This unique combination of architectures oﬀers several advantages ver traditional VQA models. We ﬁne-tune the pre-trained weights of visual models trained on Imagenet, as well as the entire PhoBert architecture. The proposed model has several advantages. Firstly, it can handle complex and diverse inputs due to the ﬂexibility of the Convolution Embedding layer in CvT, which can handle images of varying sizes and resolutions. This makes the model more versatile and robust compared to other VQA models that are limited to speciﬁc image sizes or resolutions. The window and shifted-window multi-head attention layer in Swin Transformer also provide advantages by allowing the model to attend to diﬀerent regions of the image at multiple scales, extracting detailed features while reducing computational overhead. The BERT architecture for the question encoder is another strength as it captures complex semantic and syntactic relationships in natural language questions, enabling the model to better understand the question’s meaning. The joint attention at the fusion module (Merge attention mechanism) further enhances the model’s ability

SCBM: A Hybrid Model for Vietnamese Visual Question Answering

283

Fig. 1. SCBM Model Architecture.

to combine information from both the image and question eﬀectively, resulting in more accurate answers. However, there are also drawbacks to training the model. Fine-tuning pre-trained models, which were trained on English language and culture, on a dataset in a diﬀerent language like Vietnamese can be challenging. This may aﬀect the model’s ability to generalize to a diﬀerent language and culture, potentially introducing biases in its predictions and other challenges to overcome.

4 4.1

Experiments and Results Dataset

In this article, we utilize the ViVQA benchmark [11], a dataset designed for Vietnamese Visual Question Answering, which is derived from the COCO-QA dataset [10]. The images in the dataset are sourced from the widely recognized

284

H. L. Trung et al.

MS COCO dataset [5], known for its extensive and diverse image collection. The ViVQA dataset comprises 10,328 photos and 15,000 question-answer pairs that pertain to the content of the images. The training data can be found in Fig. 2. The dataset is divided into training and test sets in an 8:2 ratio, and the questions fall into four categories: object, number, color, and location. To conform to the dataset’s annotation rules, we approach the task as an open-ended VQA problem with a ﬁxed-size output list of words.

Fig. 2. Training samples in ViVQA benchmark dataset.

4.2

Experimental Settings

In this paper, we conducted experiments with four diﬀerent models: pre-trained CvT, PhoBert, and Merge attention hybrid model; pre-trained VAN, PhoBert, and Merge attention hybrid model; CLIP-ViT, PhoBert, and Merge attention hybrid model; and our proposed model SCBM. Our implementation is based on PyTorch. – The combination of Convolutional Vision Transformer (CvT) and pretrained Hierarchical Bidirectional Encoder Representations from Transformers (PhoBert) allows the model to eﬀectively reason about the relationship between the image and the text by extracting visual features from the input image and capturing contextual and semantic information from the input text in Vietnamese. This is achieved by merging the attention mechanism. – Another advantage of using VAN and PhoBERT for Vietnamese VQA is their ability to handle complex language and visual information. PhoBERT is speciﬁcally trained on Vietnamese text, enabling it to understand and process the nuances of the language, while VAN is a pre-trained CNN model designed for image captioning tasks, allowing it to extract visual features from images

SCBM: A Hybrid Model for Vietnamese Visual Question Answering

285

and understand their content. By combining VAN and PhoBERT and synthesizing the information using the merge attention mechanism, the model can process both visual and textual information for output prediction. – The Clip-ViT model is a variant of the ViT architecture used as the visual encoder for training the CLIP model. It is pre-trained on a large dataset of images using self-supervised learning, which enables it to learn visual information without relying on explicit labels. The PhoBert language model, on the other hand, is trained on Vietnamese text. The two encodings are fused using a cross-modal attention mechanism, allowing the model to capture the correlation between visual and textual information. However, using English pre-trained models for non-English languages can have a signiﬁcant disadvantage of language bias. These models are trained on largescale datasets, which can introduce biases and stereotypes from the training data. When applied to non-English languages, these biases can result in inaccuracies and errors in the model’s outputs. 4.3

Results

Our proposed system SCBM outperformed previous baselines (Co-attention combined with PhoW2V) and achieved state-of-the-art results in this benchmark. The Accuracy, WUPS 0.9 score, and WUPS 0.0 score were 0.6201, 0.6814, and 0.8719, respectively. These results are twice as good as the previous baselines, indicating the eﬀectiveness of our model in encoding both types of data and combining them to produce excellent results. Please refer to Table 1 for the detailed results of our experiments. Table 1. Compare our model to the baseline methods for ViVQA benchmark dataset. Methods

Metric Accuracy WUPS 0.9 WUPS 0.0

LSTM + W2V

0.3228

0.4132

0.7389

LSTM + FastText

0.3299

0.4182

0.7464

LSTM + EMLO

0.3154

0.4114

0.7313

LSTM + PhoW2Vec

0.3385

0.4318

0.7526

Bi-LSTM + W2V

0.3125

0.4252

0.7563

Bi-LSTM + FastText

0.3348

0.4268

0.7542

Bi-LSTM + ELMO

0.3203

0.4247

0.7586

Bi-LSTM + PhoW2Vec

0.3397

0.4215

0.7616

Co-attention + PhoW2Vec

0.3496

0.4513

0.7786

CvT + PhoBert

0.3805

0.5382

0.7943

CLIP-ViT + PhoBert

0.5227

0.5641

0.8308

VPN + PhoBert

0.5979

Swin Transformer + PhoBert (SCBM) 0.6201

0.6157

0.8623

0.6814

0.8719

286

H. L. Trung et al.

However, there are still many incorrect answers in our predictions. There are two possible reasons for this. First, the COCO-QA dataset we used for benchmarking contains a large number of questions with inherent grammar mistakes, despite our eﬀorts to reﬁne it for incorrect translations. This makes it diﬃcult to determine whether the translations are achieving the desired outcome, especially when the data source quality is not guaranteed. Second, the dataset used for ﬁne-tuning the pre-trained transformers models for English, such as BERT, RoBERTa [6], and CLIP, is not be large and diverse enough. Data mismatch can occur when the distribution of data in the training set is signiﬁcantly diﬀerent from the distribution of data in the ﬁne-tuning set, leading to poor generalization performance of the models as they may not have learned to handle the speciﬁc characteristics of the new scenario. Furthermore, our experiments have shown that model errors often occur in questions belonging to categories 1 (Number) and 3 (Location), while the models perform well with questions from categories 0 (Object) and 2 (Color). This suggests that questions in categories 1 and 3 often require the model to pay attention to diﬀerent positions in the image and synthesize all relevant information to generate accurate answers.

5

Conclusion

In this study, we performed experiments on attention-based models for Vietnamese Visual Question Answering (ViVQA) tasks. We utilized state-of-the-art visual and language processing components in a hybrid setup. Additionally, we introduced a new Hybrid Hierarchical Attention-based Model called SCBM for ViVQA. Our system demonstrated superior performance compared to the baseline (which combined Coattention with PhoW2V), achieving accuracy scores of 0.6201, WUPS 0.9 score of 0.6814, and WUPS 0.0 score of 0.8719. Notably, our system’s performance was nearly twice as superior as the baseline. To further advance our development, we can pursue two strategies. Firstly, we need to address the limitations of Vietnamese VQA due to smaller and less diverse datasets. The dataset used for training VQA models must be suﬃciently large and diverse, covering a wide range of images and questions to capture the variability of both vision and language domains. Unlike English, which has abundant annotated data, Vietnamese has a smaller corpus, making it challenging to train large-scale models. Additionally, the quality of available datasets may be inconsistent and noisy. To overcome this, we can gather data from various sources like social media platforms, news websites, and online forums. Furthermore, improving data labeling techniques, such as crowd-sourcing or semi-supervised learning, can enhance dataset quality. Secondly, there is a critical need for baseline models that can eﬀectively learn and represent the embedding relationship between images and text in Vietnamese. This can be achieved by developing models that integrate multimodal features and contrastive learning approaches, trained on a diverse range of images with natural language supervision. The absence of such models limits their application in tasks that require understanding both text and images, such as visual question answering. In addition, increased investment in research for computational resources to train large-scale models is essential.

SCBM: A Hybrid Model for Vietnamese Visual Question Answering

287

References 1. Devlin, J., Chang, M., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. In: Burstein, J., Doran, C., Solorio, T. (eds.) Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019, Minneapolis, MN, USA, 2–7 June 2019, Volume 1 (Long and Short Papers), pp. 4171–4186. Association for Computational Linguistics (2019) 2. Dosovitskiy, A., et al.: An image is worth 16x16 words: transformers for image recognition at scale. In: 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, 3–7 May 2021. OpenReview.net (2021) 3. Guo, M.H., Lu, C.Z., Liu, Z.N., Cheng, M.M., Hu, S.M.: Visual attention network. arXiv preprint arXiv:2202.09741 (2022) 4. Kaﬂe, K., Kanan, C.: Visual question answering: datasets, algorithms, and future challenges. Comput. Vis. Image Underst. 163, 3–20 (2017) 5. Lin, T.-Y., et al.: Microsoft COCO: common objects in context. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014, Part V. LNCS, vol. 8693, pp. 740–755. Springer, Cham (2014). https://doi.org/10.1007/978-3-31910602-1 48 6. Liu, Y., et al.: Roberta: a robustly optimized BERT pretraining approach. CoRR abs/1907.11692 (2019) 7. Liu, Z., et al.: Swin transformer: hierarchical vision transformer using shifted windows. In: 2021 IEEE/CVF International Conference on Computer Vision, ICCV 2021, Montreal, QC, Canada, 10–17 October 2021, pp. 9992–10002. IEEE (2021) 8. Nguyen, D.Q., Nguyen, A.T.: PhoBERT: pre-trained language models for Vietnamese. In: Cohn, T., He, Y., Liu, Y. (eds.) Findings of the Association for Computational Linguistics: EMNLP 2020, Online Event, 16–20 November 2020. Findings of ACL, EMNLP 2020, pp. 1037–1042. Association for Computational Linguistics (2020) 9. Radford, A., et al.: Learning transferable visual models from natural language supervision. In: Meila, M., Zhang, T. (eds.) Proceedings of the 38th International Conference on Machine Learning, ICML 2021, 18–24 July 2021, Virtual Event. Proceedings of Machine Learning Research, vol. 139, pp. 8748–8763. PMLR (2021) 10. Ren, M., Kiros, R., Zemel, R.S.: Exploring models and data for image question answering. In: Cortes, C., Lawrence, N.D., Lee, D.D., Sugiyama, M., Garnett, R. (eds.) Advances in Neural Information Processing Systems 28: Annual Conference on Neural Information Processing Systems 2015, Montreal, Quebec, Canada, 7–12 December 2015, pp. 2953–2961 (2015) 11. Tran, K.Q., Nguyen, A.T., Le, A.T., Nguyen, K.V.: ViVQA: Vietnamese visual question answering. In: Hu, K., Kim, J., Zong, C., Chersoni, E. (eds.) Proceedings of the 35th Paciﬁc Asia Conference on Language, Information and Computation, PACLIC 2021, Shanghai International Studies University, Shanghai, China, 5–7 November 2021, pp. 683–691. Association for Computational Lingustics (2021) 12. Vaswani, A., et al.: Attention is all you need. In: Advances in Neural Information Processing Systems, vol. 30 (2017) 13. Wu, H., et al.: CVT: introducing convolutions to vision transformers. In: 2021 IEEE/CVF International Conference on Computer Vision, ICCV 2021, Montreal, QC, Canada, 10–17 October 2021, pp. 22–31. IEEE (2021)

A High-Performance Pipelined FPGA-SoC Implementation of SHA3-512 for Single and Multiple Message Blocks Tan-Phat Dang1,2(B) , Tuan-Kiet Tran1,2 , Trong-Thuc Hoang3 , Cong-Kha Pham3 , and Huu-Thuan Huynh1,2 2 3

1 University of Science, Ho Chi Minh City, Vietnam Vietnam National University, Ho Chi Minh City, Vietnam {dtphat,trtkiet,hhthuan}@hcmus.edu.vn University of Electro-Communications (UEC), Tokyo, Japan {hoangtt,phamck}@uec.ac.jp

Abstract. The secure hash algorithm 3 (SHA-3) is an important technique for ensuring data authentication, integrity, and conﬁdentiality. Improving the round function to enhance speed and resource eﬃciency has been a primary concern in most studies. However, processing long messages can consume a signiﬁcant amount of time when retrieving data from external memory. Speciﬁcally, after completing one block of the message, the processor, such as the Central Processing Unit (CPU), is required to prepare the input for handling the next block. In this research, we present a high-performance and ﬂexible hardware architecture for SHA3-512, speciﬁcally designed for applications with short and long messages without software intervention. We contribute two techniques. Firstly, we introduce an architecture designed to handle multiple messages, each with either a single block or multiple blocks. Secondly, we utilize a sequential processing technique for padding, catering to both short and long messages. Additionally, we implement the pipeline technique for the round function. The proposed SHA3-512 architecture is synthesized on Cyclone 5CSXFC6D6F31C6, achieving an impressive throughput of 12.05 Gbps at a clock frequency of 125.57 MHz, with each hash computation taking six clock cycles.

Keywords: SHA-3

1

· multiple blocks · high-performance · FPGA

Introduction

SHA-3 is the latest candidate announced by the National Institute of Standards and Technology (NIST) in the cryptographic hash functions series, which is widely used in various applications in daily life. Since hash functions are widely used, attacks against them are inevitable. For instance, secure hash algorithm 1 (SHA-1) has been entirely broken, and secure hash algorithm 2 (SHA-2), an upgraded version with a similar structure, prompted the creation of a more c The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 N.-N. Dao et al. (Eds.): ICIT 2023, LNDECT 187, pp. 288–298, 2023. https://doi.org/10.1007/978-3-031-46573-4_27

High-perf. Pipelined Impl. of SHA3-512 for Single/Multiple Message Blocks

289

secure hash function, SHA-3. SHA-3 relies on the Keccak algorithm published by NIST [1], which aims to enhance the security of the previous generation of hash functions. The SHA-3 family consists of four hash functions: SHA3-224, SHA3-256, SHA3-384, and SHA3-512. All four functions are based on a common implementation structure called the sponge construction [2]. Many studies have implemented SHA-3 on Field Programmable Gate-Array (FPGA) to improve performance and resource utilization, but we only focus on the high-throughput design. Research on high-performance design techniques on FPGA focuses on methods such as pipeline, unrolling, and utilizing existing resources on the FPGA such as digital signal processor (DSP), random access memory (RAM), or Shift Register Look-up Table (SRL) on Xilinx products. In the references [3,4], the design mentioned processes SHA-3 from padding to the round function, including ﬁve sub-functions (theta, rho, pi, chi, and iota) repeated 24 times corresponding to 24 clock cycles. The pipelining technique is applied in the reference [5], and the research aims to test and select the most suitable number of pipeline stages. The pipeline method in the study [5] uses one or more round functions separated by registers in between. For example, with two pipeline stages, there are two chains of ﬁve sub-functions. With this technique, all cases are completed after 24 clock cycles. The research paper [6] describes a technique to reduce the number of clock cycles from 24 to 12 by concatenating adjacent rounds and processing them within a single clock cycle. To achieve this, the counter used to select the round constant for iota is divided into two parts, even and odd. The study experimented with diﬀerent clock cycle designs for the round function ranging from 2 to 24 cycles. Additionally, internal pipeline registers have been implemented in some studies, as mentioned in [7,8]. These registers are inserted within the round function, speciﬁcally between pi and chi, to divide the critical path into two parts, with a second register at the end of the round. In this study, our goal aims to design a high-speed hardware solution to tackle the challenge of handling multiple blocks of messages. For architectures that prioritize resource eﬃciency or do not employ pipeline techniques, hashing messages with multiple blocks is not problematic since the design processes one block at a time. For pipeline-based architectures, this implies that multiple blocks of various messages are executed concurrently across diﬀerent stages. This necessitates scheduling or managing messages with multiple blocks to attain precise outcomes. The majority of existing research has resolved this issue using software solutions [5,9]. Proposal 1: We propose a multiple-block solution for multiple messages. To minimize software intervention in monitoring the design’s processing, which can lead to poor performance due to increased processing time, we propose a self-managed and versatile design that handles various scenarios, including short and long messages. Proposal 2: We propose a sequential padding technique and a pipelined round function. The padding technique we propose is called sequential padding. This padding technique is based on the sequential data reception process while simultaneously calculating the padding value based on the block size. This technique is perfectly suited for hardware implementation. However, there are four special cases when applying this padding technique.

290

T.-P. Dang et al.

Furthermore, we employ the pipeline technique. We divide the pipelined stage by splitting the number of iterations of the hash function. This does not reduce the critical path but increases the design’s throughput. We consider padding as a stage in the pipeline, contributing to concurrently processing a larger number of blocks. The remaining part of this paper is organized as follows: Sect. 2 presents the theoretical foundations of SHA-3, and Sect. 3 describes the proposed design in detail. Section 4 reports the design and integration results of the system. Section 5 is the conclusion of the paper.

2

Background SHA-3

SHA-3 is deﬁned based on the Keccak function [2], which employs the sponge construction depicted in Algorithm 1. The sponge construction consists of three main parts: padding, absorbing, and squeezing, and processes a ﬁxed total number of bits (b) internally, which is 1600 bits. We can express b as the sum of the rate (r) and the capacity (c), where r represents the amount of input data that can be absorbed in one iteration, and c is the remaining capacity. After padding, the data is fed to absorption, where r contains the input data padded according to the padding rule, and c is ﬁlled with zeros. Algorithm 1. SPONGE Input: N Output: Z P ← N ||0110∗ 1 n ← len(P )/r P ← P0 ||P1 ||...||P( n − 1) S←0 for i ← 0 to n − 1 do S ← f (S ⊕ (Pi ||0)) end for Z ← T runcr (S)

If the input data size is not suitable, it will be adjusted in the padding step to ensure that the input data size is a multiple of r. Padding involves inserting two bits 01 after the input data, followed by a 1, one or more 0 s, and another 1. During the absorbing phase, a sequence of ﬁve sub-functions: theta, rho, pi, chi, and iota, are executed in sequence. These ﬁve sub-functions are repeated 24 times. After completing 24 rounds of the ﬁve sub-functions, the squeezing phase is performed. The high-order bits are truncated, resulting in d = 2c low bits, representing the hash function output for a single-block message. For messages consisting of multiple blocks, this result is fed back into the absorbing phase to XOR with the next message block and repeated until all blocks are processed, generating the ﬁnal result.

High-perf. Pipelined Impl. of SHA3-512 for Single/Multiple Message Blocks

291

Fig. 1. a) Overall design diagram. b) Storage unit. c) The proposed architecture of SHA3-512.

3 3.1

Comprehensive Proposed Design Pre-processing: Multiple-Block Solution

Overall of Proposed Architecture: Figure 1.a depicts the proposed architecture at the system level. The architecture is divided into three main parts: the component interacts with external components like CPU, bus, and external memory, including the conﬁguration (CFG), READ DMA (RDMA), and WRITE DMA (WDMA); pre-processing, which manages data sorting to address the issue of multiple blocks of one or more messages and provides the fastest data for processing blocks; and processing, which performs data hashing, including padding and round functions. The CFG block of SHA3-512 receives instructions from the CPU, which contains the necessary information to control SHA3512 automatically. This enables the system to fetch data from external memory and compute the hash value accordingly. Subsequently, the pre-processing stage acquires instructions from the CFG FIFO and sends the necessary requests to retrieve input data for the RDMA. Once the pre-processing stage has gathered enough data for the hash computation, it transfers this data to the processing

292

T.-P. Dang et al.

phase. Upon completing the hash computation, the resulting hash value is sent to the parallel input serial output (PISO) stage, and the WDMA stores this digest in external memory. Storage Unit: The Storage Unit (SU) in Fig. 1.b is responsible for storing data, instructions, and other information needed for management purposes. The pre-inst takes responsible for managing the instruction to get data from external memory, which consists of RDMA address, current len, and next len. RDMA address is the address from which data needs to be read from memory. Since the storage capacity of the data unit is limited, if the data unit does not have enough capacity to store the data, then RDMA address is updated with a new value for the next DMA operation. The current len is the data for the current DMA operation. This value helps the receiving data from RDMA know when an SU has received enough data. The next len is the amount of data for the next DMA operation. The occupy ﬂag indicates whether the SU is in use. The post-inst consists of WDMA address, ordinal number, and multiple block. The WDMA address is the address where the result is stored in memory and is only used by the WDMA block. The ordinal number is the index of each SU. The multiple block indicates whether this is the last block. The total len contains the total data in a message that needs to be hashed. This value is updated after each execution and helps to identify special cases during padding, which will be discussed further in the following section. Operation of Design: Figure 1.c provides a detailed description of the design’s operation. In this architecture, we utilize ﬁve Storage Units (SUs) to act as local memory for the ﬁve stages: padding and the four round functions. The design features ﬁve pipelined stages in the processing phase. The padding stage is considered one pipelined stage, while the remaining four stages constitute the round function. As a result, each SU serves one message, which can consist of one or more blocks. Moreover, each SU is responsible for managing and scheduling data availability for the message it is handling. This approach allows the preprocessing and processing phases to operate independently. The pre-processing stage proactively provides suﬃcient data to each SU, ensuring they are always ready to perform the hashing computation. Meanwhile, the processing phase primarily focuses on computing the hash function and requesting the next block if the message consists of multiple blocks. This eﬃcient management and scheduling by the SUs facilitate smooth and parallelized processing of multiple messages with single or multiple blocks. As shown in Fig. 2, the operation of SHA3-512 is depicted. The CFG block contains a FIFO to continuously receive instructions from the CPU. Consequently, in the illustration, CFG can receive multiple instructions in succession. Each instruction is assigned to one SU and is retained until the last block is processed. In the ﬁgure, we can observe that instruction 0 represents a message consisting of three blocks (Inst 0 3 blocks). The maximum capacity of an SU is two blocks in this illustration, the size of SU depends on the data unit, we

High-perf. Pipelined Impl. of SHA3-512 for Single/Multiple Message Blocks

293

Fig. 2. Timing chart of the SHA3-512.

can change it to get the appropriate size. As a result, during the ﬁrst RDMA of instruction 0, only the ﬁrst two message blocks (M0-B0 and M0-B1) are fetched. The last block of instruction 0 is immediately fetched once block 0 is removed from the SU, indicating that block 0 has moved to the padding block in the processing phase. Block 0 of message 0 proceeds through the padding stage and then undergoes the round functions 0, 1, 2, and 3, respectively. Since message 0 has more than one block, it is denoted in the multiple block ﬁeld in SU 0. During the processing, round function 2 sends a request to prepare for the next block (Circle 1 in Fig. 2). Consequently, block 1 of message 0 (M0-B1) is fetched into the padding (Circle 2 in Fig. 2). Once the padding of M0-B1 and round function 3 of M0-B0 are completed simultaneously, they are XORed with each other, and the result is fed to round function 1 (Circle 3a and 3b in Fig. 2). The processing follows a similar pattern continuously until the ﬁnal block is completed. 3.2

Processing: Padding and Round Function

Sequential Padding Technique: The padding adds extra information to a message to make its length a multiple of r. The padding uses the pattern 0110*1 to be added to the message. We use the sequential processing strategy, which means that if there are N blocks, blocks from block 0 to block (N-2) do not need padding, and for the last block (N-1), the pattern 0110*1 is calculated and inserted as shown in Fig. 3. Table 1. Special case of padding Case (bit) Near-last block (bit0 → bit575 ) Last block (bit0 → bit575 ) 576

xx...xxxx

01100..01

575

xx...xxx0

11000..01

574

xx...xx01

10000..01

573

xx...x011

00000..01

294

T.-P. Dang et al.

Fig. 3. The proposed padding architecture.

According to our sequential processing strategy, four special cases are described in Table 1, where the block has a length of 576 bits and is not the last block. Therefore, another execution with the padding pattern 0110*1 will occupy the entire 576 bits of this block. Similarly, we have special cases for 575 bits, 574 bits, and 573 bits, where the 0110*1 pattern is split into two separate blocks. The current block will be combined with 0, 01, and 011 for 575 bits, 574 bits, and 573 bits. Meanwhile, the next block will have the form of 110*1, 10*1, and 0*1, respectively, for 575 bits, 574 bits, and 573 bits. Therefore, we have expressions to represent these cases. N ear − last block = data OR (576 b0110 len)

Last block = {1 b1, 575 b0} OR (576 b0110 shif t)

(1) (2)

For data, it refers to the block’s data being padded, while len indicates the number of bits. shift represents the number of bits to be right-shifted in case the least signiﬁcant bit (LSB) is on the right. shift is 0, 1, 2, and 3 for cases 576, 575, 574, and 573, respectively. In Fig. 3, the padding architecture is illustrated, consisting of both the normal case and the special case. The normal case architecture, depicted in Fig. 3, handles all the normal cases and the near-last block for the special case. On the other hand, the special architecture is dedicated to the last block of the special case. For the normal architecture, the left barrel shifter is employed, while the right barrel shifter is used for the special architecture. Round Function: Figure 4.b illustrates the architecture of the round function. The round function is divided into four pipelined stages, and each stage executes six iterations, resulting in a total of 24 iterations before producing the hash value. Each stage comprises ﬁve sub-functions: theta, rho, pi, chi, and iota. The iota step is slightly diﬀerent for each stage, as each stage contains a distinct round constant (RC). Speciﬁcally, round function 0 contains RC values from 0 to 5,

High-perf. Pipelined Impl. of SHA3-512 for Single/Multiple Message Blocks

295

Fig. 4. The round function architecture.

round function 1 contains RC values from 6 to 11, round function 2 contains RC values from 12 to 17, and round function 3 contains RC values from 18 to 23. These RC values are stored in a Read-Only Memory (ROM), each represented with 7-bit data.

4

Result and Evaluation

Our proposed SHA3-512 architecture has been successfully implemented and validated on Cyclone V, as depicted in Fig. 5. For communication between the CPU (ARM Cortex-A9) and SHA3-512, we utilize the Lightweight HPS-to-FPGA bridge, which oﬀers a low-speed, small datapath suitable for conﬁguration purposes. To enhance data retrieval speed from external synchronous dynamic random access memory (SDRAM), we connect the FPGA to the SHA3-512 using the FPGA-to-HPS SDRAM bridge, allowing a direct connection to the SDRAM controller without traversing other components such as L2, L3, and so on. T hroughput =

F requency × #Block_size #Cycle

(3)

T hroughput (4) Area The SHA3-512 design is implemented using Verilog Hardware Description Language (HDL) and simulated with the Questa Intel Starter FPGA Edition 2021.2 tool. The performance measurements, such as FPGA area resources, frequency, throughput, and eﬃciency, were obtained using the Quartus Prime Lite Edition tool version 21.1. To calculate the throughput, we use Eq. (3), which provides the result in gigabits per second (Gbps). The throughput is determined by the maximum frequency (Fmax) achieved from Quartus synthesis (F requency), the number of clock cycles required to generate the hash value (#Cycle), and the size of the input in bits (#Block_size). For SHA3-512, the #Block_size is 576 bits. Eﬃciency is a metric that illustrates the relationship between throughput and area utilization. The units used for calculating eﬃciency depend on the type of device being synthesized. For instance, in Virtex 5 (V5) and Virtex 6 (V6) Ef f iciency =

296

T.-P. Dang et al.

Fig. 5. The proposed SHA3-512 is integrated into the system. Table 2. Implementation results and comparison Ref

Device

Fmax (MHz) Area

TP (Gbps) Eﬃciency

Virtex-5

328.20

4361 Slices

7.87

1.80 Mbps/Slice

Virtex-6

401.20

5528 Slices

9.62

1.74 Mbps/Slice

[4]

Virtex-5

312.98

1304 Slices

7.51

5.75 Mbps/Slice

[6]

Virtex-5

41.64

12487 Slices

11.99

0.96 Mbps/Slice

Virtex-6

57.91

15579 Slices

16.67

1.07 Mbps/Slice

[11]

Virtex-5

273.00

1163 Slices

7.80

6.06 Mbps/Slice

[12]

Virtex-5

289.00

1692 Slices

5.00

2.96 Mbps/Slice

[3]

Prop. Cyclone V 125.57

14608 ALMs 12.05

0.82 Mbps/ALM

devices, eﬃciency is measured in megabits per second per Slice (Mbps/Slice), whereas for Cyclone V devices, it is measured in Mbps/ALM (Adaptive Logic Module). Table 2 compares results among various studies on the full implementation of SHA-3, including padding, round function, and management of round execution, if any. Our design achieved a frequency of 125.57 MHz and used 14,608 ALMs as resources, resulting in a throughput of 12.05 Gbps. Our proposed design demonstrates a signiﬁcantly higher throughput compared to the studies referenced. Speciﬁcally, when compared to [3] on V5, [3] on V6, [4,6,11] on V5, and [12], our design achieves throughputs that are 1.53 times higher, 1.25 times higher, 1.60 times higher, one time higher, 1.54 times higher, and 2.41 times higher, respectively. Unfortunately, [6] on V6 is higher than ours. Our eﬃciency results were lower than those of previous studies because we aimed to design a comprehensive processing design for all cases of multiple messages with signal or blocks without software intervention.

5

Conclusion

SHA-3 is critical in many applications, such as generating random numbers and producing digital signatures. In this study, our design not only addresses the challenges of handling multiple messages with single or multiple blocks but also incorporates a high-performance pipeline architecture. Additionally, we propose

High-perf. Pipelined Impl. of SHA3-512 for Single/Multiple Message Blocks

297

a padding architecture capable of eﬃciently handling both short and long messages, fully implemented in hardware. We achieved an impressive throughput of 12.05 Gbps on Cyclone V without the need for specialized FPGA resources or software support during processing. Acknowledgment. This research is funded by University of Science, VNU-HCM under grant number ÐT-VT 2022–03.

References 1. Bertoni, G., Daemen, J., Peeters, M., Van Assche, G.: Keccak. In: Johansson, T., Nguyen, P.Q. (eds.) EUROCRYPT 2013. LNCS, vol. 7881, pp. 313–314. Springer, Heidelberg (2013). https://doi.org/10.1007/978-3-642-38348-9_19 2. Dworkin, M.: SHA-3 Standard: Permutation-based Hash and Extendable-Output Functions. Federal Inf. Process. Stds. (NIST FIPS). https://doi.org/10.6028/NIST. FIPS.202 3. El Moumni, S., Fettach, M., Tragha, A.: High frequency implementation of cryptographic hash function Keccak-512 on FPGA devices. Int. J. Info. Comp. Secur. 10, 361–373 (2018) 4. Assad, F., Elotmani, F., Fettach, M., Tragha, A.: An optimal hardware implementation of the KECCAK hash function on Virtex-5 FPGA. In: Proceedings of the International Conference on System of Collaborative Big Data, Internet of Things & Security (SysCoBIoTS), Casablanca, Morocco, December 2019, pp. 1–5 (2019) 5. Michail, H.E., Ioannou, L., Voyiatzis, A.G.: Pipelined SHA-3 implementations on FPGA: architecture and performance analysis. In: Proceedings of Workshop on Cryptography and Security in Computing Systems (CS2), Amsterdam, Netherlands, January 2015, pp. 13–18 (2015) 6. El Moumni, S., Fettach, M., Tragha, A.: High throughput implementation of SHA3 hash algorithm on ﬁeld programmable gate array (FPGA). Microelectron. J. 93, 104615 (2019) 7. Mestiri, H., Kahri, F., Bedoui, M., Bouallegue, B., Machhout, M.: High throughput pipelined hardware implementation of the KECCAK hash function. In: Proceedings of International Symposium on Signal, Image, Video and Communication (ISIVC), Tunis, Tunisia, November 2016, pp. 282–286 (2016) 8. Athanasiou, G.S., Makkas, G.-P., Theodoridis, G.: High throughput pipelined FPGA implementation of the new SHA-3 cryptographic hash algorithm. In: Proceedings of International Symposium on Communication, Control and Signal Processing (ISCCSP), Athens, Greece, May 2014, pp. 538–541 (2014) 9. Ioannou, L., Michail, H.E., Voyiatzis, A.G.: High performance pipelined FPGA implementation of the SHA-3 hash algorithm. In: Proceedings of Mediterranean Conference on Embedded Computing (MECO), Budva, Montenegro, June 2015, pp. 68–71 (2015) 10. Assad, F., Fettach, M., el Otmani, F., Tragha, A.: High-performance FPGA implementation of the secure hash algorithm 3 for single and multi-message processing. Int. J. Electr. Comput. Eng. (IJECE) 12(2), 1324–133312 (2021) 11. Sundal, M., Chaves, R.: Eﬃcient FPGA implementation of the SHA-3 hash function. In: Proceedings of IEEE Computer Society Annual Symposium on VLSI (ISVLSI), Bochum, Germany, July 2017, pp. 86–91 (2017)

298

T.-P. Dang et al.

12. Kim, D.-S., Lee, S.-H., Shin, K.-W.: A hardware implementation of SHA3 hash processor using Cortex-M0. In: Proceedings of International Conference on Electronics, Information, and Communication (ICEIC), Auckland, New Zealand, May 2019, pp. 1–4 (2019) 13. Intel FPGA: Cyclone V Hard Processor System Technical Reference Manual. https://www.intel.com/content/www/us/en/docs/programmable/683126/212/hard-processor-system-technical-reference.html

Optimizing ECC Implementations Based on SoC-FPGA with Hardware Scheduling and Full Pipeline Multiplier for IoT Platforms Tuan-Kiet Tran1,2(B) , Tan-Phat Dang1,2 , Trong-Thuc Hoang3 , Cong-Kha Pham3 , and Huu-Thuan Huynh1,2 1

University of Science, Ho Chi Minh City, Vietnam {trtkiet,dtphat,hhthuan}@hcmus.edu.vn 2 Vietnam National University, Ho Chi Minh City, Vietnam 3 University of Electro-Communications (UEC), Tokyo, Japan {hoangtt,phamck}@uec.ac.jp

Abstract. In the context of Industry 4.0 and the Internet of Things (IoT), it is crucial to ensure the safety and security of electronic devices. This paper proposes a design using the SoC-FPGA platform to enhance IoT device security. The design combines a powerful ARM processor with customizable IP cores on the FPGA, resulting in high processing speed. The co-processor performs asymmetric encryption using Elliptic Curve Cryptography (ECC) with the SECP256K1 curve. The main operation is point multiplication, with point addition and doubling as secondary operations. The results demonstrate high eﬃciency, with the ECC core operating at 30 MHz and point addition and doubling taking around 37 microseconds. The point multiplication operation can be processed in approximately 17 ms. This design provides a secure and eﬃcient solution for enhancing IoT device and connection security.

Keywords: ECC

1

· SoC-FPGA · Block-Chain · IoT

Introduction

The development of IoT is transforming industries globally and revolutionizing technology interaction. It is applied in healthcare, agriculture, smart cities, smart homes, and more [1]. Fueled by wireless technologies like 5G, increased device connectivity enables seamless data exchange. However, IoT presents data security, privacy, interoperability, and standardization challenges. Safeguarding data transmitted and stored by IoT devices is crucial for user trust and protection against cyber threats. Mendez Mena et al. [2] surveyed existing research on IoT security, covering attack vectors, best practices, standards, and security solutions such as authentication, encryption, access control, and intrusion detection. c The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 N.-N. Dao et al. (Eds.): ICIT 2023, LNDECT 187, pp. 299–309, 2023. https://doi.org/10.1007/978-3-031-46573-4_28

300

T.-K. Tran et al.

One commonly used way to ensure the security of IoT systems is to use encryption algorithms. However, due to limited hardware and computation capability, IoT devices often use simple algorithms to reduce power consumption and data transfer to processing centers. Aim to design an IoT device capable of handling complex algorithmic computations and suitable for IoT device characteristics. Thanks to SoC-FPGA, we can deploy many IPs responsible for handling the cryptography algorithms such as public key cryptography like RSA, ECC, or Symmetric cryptography like AES [3,4]. In this paper, we proposed a system based on SoC-FPGA with low cost, low power consumption, and high performance for handling the Elliptic Curve Cryptography (ECC), which belongs to the SECP256K1 curve, to ensure security. ECC oﬀers several advantages over other traditional cryptographic algorithms, such as RSA, in the context of IoT due to its eﬃciency and eﬀectiveness in resource-constrained environments. One of the key advantages of ECC in IoT security is its ability to provide strong security with shorter key lengths compared to RSA [5]. The basic operation in ECC is scalar point multiplication, where a point on the curve is multiplied by a scalar. A scalar point multiplication calculates a series of point addition and point doubling. Various ECC processors, such as [6], have been proposed in the literature, using various optimization techniques, such as parallelism, pipelining, and resource sharing, to achieve highspeed performance for NIST 256 prime ﬁeld ECC curve. Marzouqi et al. [7] describe the RNS DBC method, a variant of ECC that allows for faster computation of scalar multiplication operations compared to traditional methods. The article [8] presents an FPGA implementation of an ECC processor over the GF (2m ) ﬁeld for small embedded applications, which uses various optimization techniques, such as projective coordinates and Montgomery ladder algorithm, to improve the eﬃciency of the ECC processor. The article [9] presents the design and implementation of the modular inversion operation using a novel approach called the Extended Euclidean Algorithm (EEA) for ECC. The author describes the optimizations applied to the EEA, such as parallel processing and resource sharing, to improve the eﬃciency of the modular inversion operation on the FPGA. Speciﬁcally, we proposed a system that employs an ECC IP belonging to the SECP256K1 cures and can handle the basic operation of point addition, point doubling, and point multiplication. ECC SECP256K1 oﬀers several advantages for IoT systems, such as providing a high level of security with a shorter key and making it suitable for resource-constrained IoT devices with limited computing power and memory. Additionally, the SECP256K1 curve oﬀers fast computation speeds, making it well-suited for real-time applications in IoT systems. Furthermore, the SECP256K1 curve provides strong resistance against various cryptographic attacks, such as brute force attacks, factorization attacks, and collision attacks. This enhances the security of IoT systems and helps protect sensitive data transmitted or stored within these systems. The paper is structured as follows: Sect. 2 provides background information on Elliptic Curve Cryptography (ECC), including details on the SECP256K1 curve and Scalar Point Multiplication. Section 3 proposes using a pipelined mul-

Optimizing ECC Implementations Based on SoC-FPGA

301

tiplier to accelerate the modular multiplication operation in ECC computations. We also present the hardware design of the multiplier and point operation modules, encompassing point addition, point doubling, and point multiplication operations. Section 4 discusses the implementation results, and the paper concludes in Sect. 5.

2

Background

2.1

Eliptic Curve Cryptography

Elliptic curve on ﬁnite prime ﬁeld GF (p) is deﬁned by a set of point (x, y) satisfying following equation: y 2 = x3 + ax + b

(1)

where a and b satisfy (4a3 + 27b2 = 0) All the operations are on GF (p). Given two points on elliptic curve P1 (x1 , y1 ) and P2 (x2 , y2 ). Point addition and point doubling are deﬁned by the following equations:

where λ=

x3 = λ2 − x1 − x2 y3 = λ(x1 − x3 ) − y1

(3x21 + a)/2y1 (y2 − y1 )/(x2 − x1 )

if P1 = P2 else

(2)

(3)

Especially, if P1 = −P2 = (x2 , −y2 ), (P1 + P2 ) is called inﬁnite point, which is denoted by Ø. 2.2

SECG Standard for 256-Bit Koblitz Curves

SECG Standard for 256-bit Koblitz Curves. Koblitz curve SECP256K1 is an example of an elliptic curve in short Weierstraß form. This standard curve is deﬁned over a 256-bit prime ﬁeld Fp [10]. The parameters are deﬁned as a = 0, b = 7, and p = 2256 − 232 − 977. 2.3

Scalar Point Multiplication

Scalar multiplication of ECC point is based on point addition and point doubling, which can be deﬁned as follow: k

kP = P + P + . + P

(4)

Usually, kP can be calculated by Double-and-Add strategy [11] as shown in Algorithm 1. The operation of scalar point multiplication is known to be the most timeconsuming among all operations in Elliptic Curve Cryptography (ECC), and it constitutes a signiﬁcant portion of high-speed ECC processor designs.

302

T.-K. Tran et al.

Algorithm 1. Right-to-left Point Multiplication Binary Method i Input: k = n−1 i=0 ki 2 and Point P Output: Q = kP 1: Q ← O, G ← P ; 2: for i ← 0 to n − 1 do 3: Q ← 2Q; 4: if ki = 1 then 5: Q ← Q + G; 6: end if 7: end for 8: return Q;

3

Proposed System

The proposed system, shown in Fig. 1, is based on an SoC-FPGA platform. It comprises two main components: the hard processor system (HPS) and the FPGA. The HPS includes ARM cores and SDRAM memory for shared memory with the FPGA. The FPGA incorporates an ECC module IP core for SECP256K1 curve-based public key cryptography. Additionally, the system allows for integrating custom hardware IP cores on the FPGA to enhance processing and computing capabilities. The ECC IP handles scalar point multiplication calculations using Algorithm 1. To enable high-speed data exchange between IP, SDRAM, and peripheral devices, the ECC IP utilizes DMA. The DMA reads input data from SDRAM (EC points, private key, control information) and writes it to the CSR module for input to the ECC module. After computation, the DMA reads output data and writes it to SDRAM for storage. On the HPS side, we developed a custom Linux Kernel to manage the system under the Linux operating system. Custom drivers were created for smooth

Fig. 1. Full system

Optimizing ECC Implementations Based on SoC-FPGA

303

communication with the hardware, supporting cryptography application development. Ultimately, our cryptosystem has been successfully embedded in the DE10-Standard board. The ECC IP can be easily integrated into an SoC-FPGA system and controlled by a hard processor like ARM to handle the ECDSA scheme. The ECDSA scheme is a digital signature scheme that uses elliptic curve cryptography for securing transactions and verifying digital signatures. Our proposed system architecture and ECC IP core can provide a secure and eﬃcient solution for performing public key cryptography and digital signature veriﬁcation, which are essential for ensuring the security and integrity of data in IoT devices. 3.1

Pipelined Multiplier for High Speed ECC

Fig. 2. MULT BLOCK.

To achieve high-speed performance in a high-speed design, pipelining is essential to break the long critical path delay in a large digit-size ﬁeld multiplier. In this paper, we propose an innovative pipelined full precision multiplier that enables low-latency ECC point operations. The multiplication operation over GF (2m ) is denoted as MULT BLOCK and visually represented in Fig. 2. The initial stage of our approach involves two inputs, a and b, each consisting of 256 bits. Input b is further subdivided into eight segments, each with a width of 32 bits. Initially, we load b into a dedicated register called b reg, and then commence the multiplication process by multiplying a with the most signiﬁcant segment of b reg (b reg[255 : 224]). Upon the completion of the multiplication operation, we shift the value of b reg leftward to obtain a new segment. Remarkably, our specially designed multiplier, MULT 256x32, eﬃciently produces the product after just one clock cycle. After eight clock cycles, all eight segments of b are processed, resulting in eight temporary products. These temporary products are stored in registers denoted as p temp, each capable of holding 288 bits. This entire process is denoted as the Multiplication phase. The subsequent step is the Reduction process, during which the register s reg accumulates and adds the values from the

304

T.-K. Tran et al.

p temp registers. This accumulation is performed in a manner that respects the order from most signiﬁcant to least signiﬁcant segments. The resultant value in s reg serves as input for the reduction submodule, responsible for reducing s reg from 512 bits to 256 bits. Upon the completion of the reduction process, the content of s reg is left-shifted by 32 bits. This cyclic process continues until the ﬁnal register of p temp is computed. The result of the MULT BLOCK module is s red with a width of 256 bits, where s red = a × b mod n. The pivotal advantage of the MULT BLOCK lies in its two pipeline stages, Multiplication and Reduction, which enable it to continuously calculate new multipliers for incoming data while concurrently computing the current reduction of preceding data. The Reduction module, as illustrated in Fig. 2, is utilized to reduce the width of the accumulator register s reg to 256 bits. In the Elliptic Curve Cryptography (ECC) ﬁeld, the Barrett reduction algorithm, extensively described in [12], is commonly employed to eﬃciently perform modular arithmetic operations on large numbers. When working with the SECP256K1 set in ECC, the values of m and k in the Reduction module are set to 1 and 256, respectively. This conﬁguration enables the hardware implementation of the Barrett reduction algorithm to be straightforward, facilitating eﬃcient computation. 3.2

Point Generation

The proposed module referred to as ECC GENPOINT, plays a crucial role in ECC computation, speciﬁcally in the calculation of the main operator is scalar point multiplication. Figure 1 illustrates the structure of this module. The core component of ECC GENPOINT is the Point operation, which performs point addition and doubling operations. The Point Generation sub-module also incorporates a ﬁnite state machine to manage input data loading and internal data transfer to temporary registers. The functionality of ECC GENPOINT follows Algorithm 1, while Eq. 5 deﬁnes the values of the Operator variable for conﬁguring ECC GENPOINT. Depending on the Operator signal’s value, ECC GENPOINT can perform three distinct ECC operations: point addition, point doubling, and point multiplication. Our proposed architecture for the module responsible for computing ECC point addition and point doubling, known as the Point operation, is depicted in Fig. 3. To optimize FPGA utilization and minimize area requirements to be suitable with the low hardware resource of Cyclone V chip on DE-10 board, we have used some sub-modules and scheduled by an FSM. Speciﬁcally, we have implemented two sub-modules for modular addition-subtraction, one for modular multiplication and one for modular inversion. The addition-subtraction sub-modules are utilized to perform modular addition or subtraction operations based on the control signals from the state machine. We employ the MULT BLOCK for modular multiplication, as described in the previous section. Additionally, we have included a FIFO to store temporary values required for calculations at speciﬁc stages. The input data for each module is loaded into a register and retained for computation, controlled by either the Add FSM or Double FSM, depending on the Operation signal. The Point module’s operation, outlined in Eq. 5, is

Optimizing ECC Implementations Based on SoC-FPGA

305

Fig. 3. Point operation Procedure: Double FSM

Procedure : Add FSM 1 2 3 4 5 6 7 8 9 10

1 sum1 ← P y − Qy; sum2 ← P x − Qx; 2 −1 inv ← sum2 ; 3 produce ← sum1 · inv; 4 temp1 ← procudre; 5 produce ← produce · produce; 6 sum1 ← produce − P x; 7 sum2 ← sum1 − Qx; Rx ← sum2; 8 sum1 ← P x − Rx; 9 produce ← temp · sum1; 10 sum1 ← produce − P y; Ry ← sum1; 11

sum1 ← P x + P x; sum2 ← P y + P y; sum1 ← sum1 + P x; produce ← sum1·P x; inv ← sum2−1 ; produce ← produce · inv; produce ← produce · produce; temp1 ← produce; sum2 ← produce − P x; sum1 ← sum2 − P x; Rx ← sum1; sum2 ← P x − Rx; produce ← temp · sum2; sum2 ← produce − P y; Ry ← sum2;

determined by the value of the Operator argument. The ADD FSM and DOUBLE FSM are the primary components of the Point operation module, overseeing the retrieval of input data and supplying it to the corresponding sub-modules. ⎧ ⎪ ⎨P + Q R = 2P ⎪ ⎩ kP

if Operator = 0 if Operator = 1 if Operator = 2

(5)

306

T.-K. Tran et al.

The sequential steps involved in the Add FSM and Double FSM can be observed in the aforementioned procedures. Depending on the value of the Operator, the input data for the Point operation module, which includes P (P x, P y) and Q(Qx, Qy), is loaded into either the Add FSM or Double FSM using the DE-MUX. The MUX drives the resulting output, R(Rx, Ry). Additionally, the Add FSM and Double FSM modules require input data from various submodules such as ADD SUB, INVERSE MOD, and MULT BLOCK during their processing steps, which are sum1, sum2, inv mod, and produce, respectively. Moreover, the input data for these submodules, such as a∗ add/mult, b∗ add/mult, inv in, write data, (*note for number 1 or 2) are stored in registers and controlled by the state machine through the load signals such as a∗ add/mult load, inv load, f if o read/write. Furthermore, within both FSMs procedures, the temp variable represents data stored and retrieved from the FIFO. The purpose of utilizing the FIFO is to minimize the number of registers in the design while achieving highspeed data access through the FIFO.

Fig. 4. TignalTap waveforms of ECC IP

4 4.1

Experimental Results Testing and Verification

Figure 4 depicts the waveforms of the ECC IP core while operating in DE10Standard and recorded using the SignalTap Logic Analysis software. Notably, Fig. 4a and 4b exhibit the waveforms of the ECC IP process utilizing two distinct keys. The input data for input point P , namely in point P x, in point P y, the private key privKey, as well as the output points out point x and out out point y, are both written and read between the DMA and SDRAM. Moreover, the data displayed in Fig. 4 corresponds to the data employed in the simulation and is compared to the ECC SECP256K1 common test case to validate the

Optimizing ECC Implementations Based on SoC-FPGA

307

correct functioning of the ECC IP. Ultimately, our cryptosystem, comprising the ECC IP core and the DMA mechanism, has been conclusively demonstrated to accurately execute the ECC multiplication algorithm and exhibit precise readwrite data operations through the DMA interfaces, as conﬁrmed by the SignalTap waveforms. Regarding the synthesis results, when utilized independently, the ECC IP core consumed approximately 49% of the hardware resources of the Cyclone V chip (5CSXFC6D6F31C6). Additionally, the proposed system, illustrated in Fig. 1, occupied around 51% of the Cyclone V chip’s resources, encompassing the ECC IP core, DMA interface, FIFOs, and bus data. In terms of the operating clock, our proposed system can function at 30 MHz for the ECC IP. The throughput of the ECC IP allows for the computation of multiplications with a maximum key width of 256 bits, following Algorithm 1, in approximately 17 milliseconds. Moreover, the operations of point addition and point doubling take around 37 microseconds. Furthermore, we implemented the ECC IP on the Stratix 10 board, where the ECC core consumed only about 2% of the hardware resources and operated at a clock frequency of approximately 78 MHz. This estimation of the ECC IP’s processing time is based on a counter that starts counting when the ECC IP begins calculation and stops when the process is complete. Therefore, with the same count value but a higher frequency, the processing time can decrease linearly. In comparison to other published results, our proposed ECC IP for single scalar multiplication demonstrates improved performance. When operating at a frequency of 30 MHz and utilizing 22 K ALMs for the Cyclone V chip (equivalent to nearly 11 K Slices), our solution achieves an approximate processing time of 17 milliseconds. This outperforms the work by Jo Vliegen [13], which takes 15.76 milliseconds with a frequency of 68.17 MHz. Although our result does not surpass the outcomes of work [7,14], it should be noted that when synthesized with the Statix 10 board, our ECC IP consumes only 17 K ALMs (approximately 8.5 K Slices). Moreover, at a frequency of 78 MHz, our IP attains a processing time of 6.5 milliseconds. While previous works may have higher processing throughput due to operating at higher frequencies, our result is comparable and superior when operating at an equivalent frequency level. Furthermore, despite our research having a less favorable processing speed compared to some other studies, the eﬃcient utilization of cost-eﬀective hardware on FPGA chips makes it highly suitable for implementing ECC encryption on hardware-constrained IoT devices, especially IoT edge devices.

5

Conclutions

This paper presents the development of a compact and high-performance cryptosystem for IoT systems using SoC-FPGA platforms, aimed at delivering a compact yet high-performance solution capable of eﬃciently handling cryptographic processes in IoT systems. It involves the creation of a self-implemented IP core for Elliptic Curve Cryptography (ECC) with SECP256K1 curves. A customized

308

T.-K. Tran et al.

Linux kernel and drivers were developed to eﬀectively control and manage the hardware system. The study evaluates various algorithms and techniques for implementing cryptography on FPGA, emphasizing high-speed processing with custom memory modules, pipeline architecture, and DMA techniques. Our solution addresses security and reliability concerns associated with malware in IoT systems. It oﬀers advantages such as cost-eﬀectiveness, low power consumption, and high performance, making it suitable for deployment in IoT gateway devices or node controllers. Acknowledgment. This research is funded by University of Science, VNU-HCM under grant number ÐTVT 2022-04.

References 1. Miraz, M. H., Ali, M., Excell, P.S., Picking, R.: A review on internet of things (IoT), Internet of everything (IoE) and internet of Nano things (IoNT). In Proceedings of Internet Tech. and Appl. (ITA), Wrexham, UK, pp. 219–224, September 2015 2. Mena, D.M., Papapanagiotou, I., Yang, B.: Internet of things: survey on security. Info. Secu. J. Global Perspect. 27(3), 162–182 (2018) 3. Huynh, H.-T., Tran, T.-K., Dang, T.-P., Bui, T.-T.: Security enhancement for IoT systems based on SoC FPGA platforms. In: Proceedings of International Conference on Recent Advances in Signal Processing, Telecommunications and Computing (SigTelCom), Hanoi, Vietnam, pp. 35–39, August 2020 4. Tran, T.-K., Dang, T.-P., Bui, T.-T., Huynh, H.-T.: A reliable approach to secure iot systems using cryptosystems based on SoC FPGA platforms. In: Proceedings of International Symposium on Electrical and Electronics Engineering (ISEE), Hochiminh city, Vietnam, pp. 53–58, April 2021 5. Bansal, M., Gupta, S., Mathur, S.: Comparison of ECC and RSA algorithm with DNA encoding for IoT security. In: Proceedings of International Conference on Inventive Computation Technologies (ICICT), Coimbatore, India, pp. 1340–1343, January 2021 6. Marzouqi, H., Al-Qutayri, M., Salah, K.: An FPGA implementation of NIST 256 prime ﬁeld ECC processor. In: Proceedings of IEEE International Conference on Electronics, Circuits and Systems (ICECS), Abu Dhabi, UAE, pp. 493–496, December 2013 7. Marzouqi, H., Al-Qutayri, M., Salah, K., Schinianakis, D., Stouraitis, T.: A highspeed FPGA implementation of an RSD-based ECC processor. In: IEEE Transactions on Very Large Scale Integration (VLSI) Systems, vol. 24, no. 1, pp. 151–164, January 2016 8. Harb, S., Jarrah, M.: FPGA implementation of the ECC over GF(2m) for small embedded applications. ACM Trans. Embed. Comput. Syst. 18(2), 1–19 (2019) 9. Dong, X., Zhang, L., Gao, X.: An eﬃcient FPGA implementation of ECC modular inversion over F256. In: Proceedings of International Conference on Cryptography, Security and Privacy (ICCSP), Guiyang, China, pp. 29–33, March 2018 10. Certicom Corp., SEC 2: Recommended Elliptic Curve Domain Parameters, Standards for Eﬃcient Cryptography. https://www.secg.org/sec2-v2.pdf 11. Hankerson, D., Vanstone, S., Menezes, A.: Guide to Elliptic Curve Cryptography, 1st edn. Springer, Berlin, Germany (2003). https://doi.org/10.1007/978-3642-27739-9 245-2

Optimizing ECC Implementations Based on SoC-FPGA

309

12. Barrett, P.: Implementing the Rivest Shamir and Adleman public key encryption algorithm on a standard digital signal processor. In: Proceedings of Advances in Cryptology (CRYPTO), pp. 311–323 (1987) 13. Vliegen, J., et al.: A compact FPGA-based architecture for elliptic curve cryptography over prime ﬁelds. In: ASAP 2010–21st IEEE International Conference on Application-speciﬁc Systems, Architectures and Processors, pp. 313–316. IEEE (2010) 14. Hossain, M.S., Kong, Y., Saeedi, E., Vayalil, N.C.: High-performance elliptic curve cryptography processor over NIST prime ﬁelds. IET Comp. Digit. Tech. 11(1), 33–42 (2016)

Robust Traffic Sign Detection and Classification Through the Integration of YOLO and Deep Learning Networks D. Anh Nguyen1,2(B) , Nhat Thanh Luong1,2 , Tat Hien Le1,2 , Duy Anh Nguyen1,2 , and Hoang Tran Ngoc3(B) 1 Ho Chi Minh City University of Technology (HCMUT), Ho Chi Minh City, Vietnam

[email protected]

2 Viet Nam National University Ho Chi Minh City, Ho Chi Minh City, Vietnam 3 FPT University, Can Tho 94000, Vietnam

[email protected]

Abstract. This paper presents a comparative study on the integration of YOLOv5 (You Only Look Once) object detection model with four popular deep learning networks, namely MobileNetv2, ResNet50, and VGG19, for robust traffic sign detection and classification. The objective is to evaluate the performance of these networks in terms of accuracy and computational efficiency. The proposed methodology consists of two main stages: traffic sign detection and classification. YOLOv5 is employed for efficient traffic sign detection, utilizing a single convolutional neural network to predict bounding boxes and class probabilities directly. This approach allows for real-time performance while maintaining high accuracy. After detecting the traffic signs, the classification stage utilizes the four deep learning networks to classify them into predefined categories. Experimental evaluations are conducted on a benchmark traffic dataset, and the results show that the integrated approach outperforms individual networks in terms of overall accuracy. MobileNetv2 achieves the fastest processing time, followed by ResNet50, and VGG19. These findings assist in selecting the most suitable network based on specific application requirements. Keywords: Traffic Sign · Yolov5 · Deep learning · MobileNetv2 · ResNet50 · VGG19

1 Introduction Traffic sign detection and classification play a critical role in various computer vision applications, including autonomous driving, intelligent transportation systems, and driver assistance systems [1–4]. Accurate and efficient analysis of traffic signs is essential for ensuring road safety and enabling intelligent decision-making by autonomous vehicles. Traditional approaches to traffic sign analysis relied on handcrafted features and rule-based algorithms, which often struggled to handle variations in appearance, lighting conditions, occlusions, complex backgrounds, and the need for real-time processing [5, 6]. © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 N.-N. Dao et al. (Eds.): ICIT 2023, LNDECT 187, pp. 310–321, 2023. https://doi.org/10.1007/978-3-031-46573-4_29

Robust Traffic Sign Detection and Classification Through the Integration

311

However, in recent years, the advent of deep learning has revolutionized the field, offering more robust and accurate solutions for traffic sign detection and classification [7, 8]. Deep learning, particularly convolutional neural networks (CNNs), has shown remarkable success in various computer vision tasks, including object detection and image classification [9, 10]. CNNs can automatically learn discriminative features from data and make predictions with high accuracy. When applied to traffic sign analysis, deep learning models can leverage large-scale datasets to learn complex representations and capture intricate details, leading to improved performance. While deep learning models have achieved impressive results in traffic sign analysis, there are still challenges to address. Standalone models may struggle with the efficient detection of traffic signs and accurate classification due to the unique characteristics of traffic sign data. This motivates the integration of YOLO (You Only Look Once), an efficient object detection model, with deep learning models for classification [11– 13]. By combining the strengths of these approaches, we can achieve robust traffic sign detection and accurate classification in a unified framework. YOLO offers real-time performance by employing a single neural network to directly predict bounding boxes and class probabilities for multiple objects in an image. Its efficiency and accuracy make it a suitable choice for traffic sign detection, providing the ability to handle multiple signs simultaneously and effectively handle complex scenarios [14, 15]. The primary objective of this paper is to evaluate the performance of integrating YOLOv5 with popular deep learning models, including MobileNetv2, ResNet50, and VGG19, for traffic sign detection and classification. The performance evaluation focuses on precision and computational efficiency, considering the specific requirements of traffic sign analysis. To achieve these objectives, the paper is structured as follows: Sect. 2 provides an overview of related work. Section 3 and 4 present the methodology for integrating YOLOv5 with deep learning models. Section 5 presents the experimental results. Section 6 concludes the paper, summarizing the findings, discussing the implications, and suggesting future research directions.

2 Related Work The application of deep learning methods in traffic sign detection and classification has gained significant attention in recent years. Various models, architectures, and techniques have been proposed to improve the accuracy and efficiency of these tasks. Deep learning algorithms, particularly convolutional neural networks (CNNs), have shown great success in various computer vision tasks, including traffic sign detection and classification. The integration of deep learning models with object detection frameworks has emerged as a powerful approach to tackle these tasks effectively. Faster R-CNN and SSD are popular object detection frameworks for traffic sign detection [16]. Faster R-CNN generates accurate bounding box proposals by combining RPN and CNN [17], while SSD achieves real-time performance with multiple convolutional layers [18, 19]. Faster R-CNN, known for its accurate bounding box proposals, has been widely adopted in this field. Studies have reported detection accuracies ranging from 95% to 98% using Faster R-CNN [8]. Ref. [20] introduces a Faster R-CNN model with a ResNet 50 backbone, demonstrating superior detection performance and high

312

D. A. Nguyen et al.

accuracy in traffic sign classification. Ref. [21] proposes a multi-scale attention pyramid network for effective detection of small traffic signs, outperforming existing methods. Ref. [22] discusses the four pillars of small object detection, providing insights relevant to traffic sign detection. Ref. [23] proposes a neural assistance system with a recognition rate above 97%, applicable to ADAS. Lastly, Ref. [24] presents a real-time large-scale traffic sign detection approach using YOLOv3, achieving high mAP above 88%. Another line of research focuses on improving the detection and classification of small traffic signs, which can be challenging due to their limited visual information. One notable approach is the use of pyramid-based methods, such as Feature Pyramid Networks (FPN), which capture multi-scale features to handle objects of different sizes. These methods enhance the representation of small traffic signs and improve their detection accuracy [25]. Transfer learning has been applied in traffic sign recognition to leverage pre-trained models from large-scale datasets like ImageNet. By fine-tuning these models on traffic sign datasets, researchers have achieved improved accuracy and reduced training time [26]. Fine-tuning pre-trained models on ImageNet has resulted in traffic sign classification accuracies exceeding 98% in some studies [27]. This approach enhances both the efficiency and accuracy of traffic sign classification models. Moreover, there has been a growing interest in the application of lightweight deep learning architectures for traffic sign detection and classification. These architectures aim to strike a balance between model complexity and performance, making them suitable for resource-constrained environments. MobileNet, ShuffleNet, and EfficientNet are some of the lightweight network architectures that have been explored for traffic sign detection [28–30]. Studies have reported inference times ranging from 30 to 60 ms using MobileNet for traffic sign detection. It is worth mentioning that the integration of YOLO (You Only Look Once) models, such as YOLOv3 [31] and YOLOv4 [15], has also gained attention in the field of traffic sign detection. YOLO models offer real-time performance by performing object detection and classification in a single pass through the network. They have shown promising results in terms of accuracy and speed [12]. Ensemble methods, which combine multiple models or predictions, have also been employed to improve accuracy. YOLOv5, a renowned object detection framework, is integrated with deep learning models to achieve robust and accurate traffic sign detection [11]. Our paper focuses on the integration of YOLOv5 with a different convolutional neural network (CNN) approach including MobileNetv2, ResNet50, and VGG19 as illustrated in Fig. 1. In this approach, traffic sign images are detected and isolated from random images using YOLOv5. Subsequently, a CNN is employed to classify the isolated traffic sign image into a specific sign class. The details of each stage are thoroughly discussed in our paper. This integrated approach offers end-to-end training, improved accuracy, and real-time performance, making it highly suitable for practical applications.

Robust Traffic Sign Detection and Classification Through the Integration

313

Fig. 1. Pipeline of integrating a Yolov5 with a convolution neural network approach.

3 Traffic Sign Detection with YOLOv5 Model 3.1 Data Collection Training the YOLOv5 model involves two main steps: data collection and annotation, and model training. For traffic sign detection, a diverse and representative dataset of traffic sign images is essential. The dataset should cover various traffic sign categories, lighting conditions, occlusions, and other real-world scenarios. We conducted our experiments using a publicly available dataset from Roboflow [32]. The dataset consists of 2000 annotated images of traffic signs, which were used for training and evaluation purposes. It contains 43 classes of traffic sign categories and variations commonly encountered in real-world scenarios. During the annotation process, bounding boxes are manually labeled around the traffic signs in the images, along with their corresponding class labels. This annotated dataset serves as the training data for the YOLOv5 model. The model training process involves optimizing the network parameters using the annotated dataset. Techniques such as data augmentation, which include random image transformations, can be applied to increase the model’s robustness and improve generalization. 3.2 Yolov5 Structure The YOLOv5 architecture consists of several key components, including backbone networks, path aggregation network (PANet), and detection heads as shows in Fig. 2. These components work together to enable efficient and accurate object detection, including traffic sign detection. The backbone networks in YOLOv5 extract hierarchical features and play a crucial role in capturing visual information. The BottleNeckCSP module improves feature representation by combining different convolutional layers. It enhances the detection performance of YOLOv5, especially for small objects like traffic signs. The SPP module captures multi-scale features by performing pooling operations at different scales. This enables the model to handle objects of varying sizes, such as traffic signs. By integrating the BottleNeckCSP and SPP modules, YOLOv5 becomes more robust in complex scenarios and achieves accurate traffic sign detection. These components enable YOLOv5 to process multi-scale features effectively, making it suitable for real-time applications with high detection performance.

314

D. A. Nguyen et al.

Fig. 2. Yolov5 structure with traffic data input.

PANet (Path Aggregation Network) is a feature pyramid network that addresses the challenge of capturing multi-scale information in object detection. It combines different techniques, such as concatenation, BottleNeckCSP, 1x1 convolutions, Conv 3 × 3 S2, and upsampling, to effectively capture multi-scale features, promote information exchange between different levels, and refine the feature representation for accurate and robust object detection, including traffic sign detection. The concatenation operation combines feature maps from different levels, the BottleNeckCSP module enhances feature fusion, 1 × 1 convolutions reduce dimensionality, Conv 3 × 3 S2 downsamples feature maps, and upsampling increases spatial resolution. These components work together to capture multi-scale features, improve feature representation, and enhance the detection performance of the network, especially for objects with diverse scales like traffic signs. YOLOv5 utilizes detection heads, associated with specific scales or feature maps, to generate bounding box predictions and class probabilities for detected objects, including traffic signs. These heads process features from different layers and employ anchor boxes to predict bounding box coordinates and class probabilities. Post-processing steps such as thresholding, non-maximum suppression, and confidence score filtering are applied to refine the predictions. With multiple detection heads and anchor boxes, YOLOv5 achieves real-time and accurate traffic sign detection and classification.

4 Traffic Sign Classification with Deep Learning Models 4.1 Data Collection We utilized the Traffic Sign Classification dataset available on Kaggle [33]. The provided dataset consists of approximately 58 classes, each containing around 120 annotated traffic sign images enabling the training and evaluation of traffic sign recognition models. Additionally, there are approximately 2000 files available for testing, which can be used to assess the performance of the trained models on unseen data. By incorporating

Robust Traffic Sign Detection and Classification Through the Integration

315

this dataset into our experiments, we were able to train and evaluate the performance of our deep learning models in accurately classifying traffic signs. The availability of this dataset played a important role in the development and validation of our proposed classification approach. We would like to express our gratitude to the creators of the dataset for providing such a valuable resource for our research (Fig. 3).

Fig. 3. Traffic Signs Dataset.

4.2 Structure of Models: MobileNetv2, ResNet50, VGG19 After processing an input image, YOLOv5 produces bounding boxes that indicate the locations and sizes of detected traffic signs. Along with these bounding boxes, YOLOv5 also provides class probabilities for each detected sign. This output serves as the input for deep learning models, including MobileNetv2, ResNet50, and VGG19. Using this information, the deep learning models classify the detected traffic signs by assigning

316

D. A. Nguyen et al.

them specific class labels. By integrating YOLOv5 with these deep learning models, accurate traffic sign detection and classification can be achieved. MobileNetv2 is a compact convolutional neural network tailored for efficient mobile and embedded applications. It uses depthwise separable convolutions and inverted residuals for efficiency, excelling in tasks like traffic sign classification with minimal computational resources. ResNet50 is a robust convolutional neural network with 50 layers, known for its performance in computer vision tasks. It leverages residual connections to avoid the vanishing gradient problem during training, making it effective in intricate image classification tasks, including traffic signs. VGG19 is a popular 19-layer deep convolutional neural network appreciated for its simplicity and effectiveness in image feature extraction. With its small filter sizes (3 × 3) and depth, it’s adept at capturing intricate details, making it suitable for high-accuracy image classification tasks like traffic sign recognition. In terms of traffic sign detection, the choice between these models depends on the specific requirements of the application. MobileNetv2 is suitable for real-time processing and resource-constrained environments, while ResNet50 and VGG19 offer higher accuracy at the cost of increased computational complexity. The selection should consider the trade-off between accuracy, computational efficiency, and available resources.

5 Experimental Results The proposed models were trained using a dataset of 11,000 images from the two datasets mentioned above. The training process was conducted on a computer running Ubuntu 20.04, equipped with an Intel i7 3.4 GHz CPU, an Nvidia GTX 1650-8 GB GPU, and 32 GB RAM. The implementation of the algorithm was done in Python 3.10, utilizing the Tensorflow 2.12.0 and Keras 2.12.0 libraries. These resources provided the necessary computational power and software environment for training and evaluating the models. 5.1 Traffic Signs Detection with Yolov5 Results Figure 4 illustrates the progression of YOLOv5’s performance over 20 epochs during the training phase. The results are visually represented, showcasing the model’s accuracy and improvement throughout the training process. The figure displays the values of key metrics such as train/box_loss, train/obj_loss, train/cls_loss, metrics/precision, metrics/recall, metrics/mAP_0.5, and metrics/mAP_0.5:0.95, providing a comprehensive overview of the model’s performance at each epoch. This graphical representation allows for a more intuitive understanding of the training progression and highlights any trends or patterns observed during the training of YOLOv5. Table 1 shows the precision value of Yolov5 at best epoch is 0.8228, indicating that around 82.28% of the predicted positive instances are true positives, ensuring that the detected traffic signs are indeed present in the image. Additionally, the mAP_0.5 score achieved at best epoch is 0.62476, reflecting the overall accuracy and localization performance of the model, indicating its proficiency in identifying traffic signs with good precision and recall. These metrics indicate that the YOLOv5 model trained up to best epoch has achieved impressive

Robust Traffic Sign Detection and Classification Through the Integration

317

performance in traffic sign detection, exhibiting both high precision and a respectable mAP_0.5 score. Based on these observations, we can conclude that the YOLOv5 model demonstrates promising performance in traffic sign detection (Fig. 5). The model shows improvements in accuracy, efficiency, and the ability to detect traffic signs at different IoU thresholds. Further analysis and evaluation on larger datasets and real-world scenarios will provide more comprehensive insights into the model’s performance and potential areas of improvement.

Fig. 4. Illustrates the progression of YOLOv5’s performance over 20 epochs.

Table 1. Validation Loss of Models Best epoch train/ box_loss 18

train/ train metrics/ metrics/ metrics/ metrics/ obj_loss /cls_loss precision recall mAP_0.5 mAP_0.5:0.95

0.018569 0.0084

0.02266

0.8228

0.74244 0.818

0.62476

5.2 Traffic Signs Classification with CNN Model Results The evaluation of the CNN model can be done using various metrics such as precision, recall, accuracy, and F1-score. These metrics provide insights into the model’s performance in terms of correctly identifying positive samples, correctly classifying actual positives, overall classification accuracy, and the harmonic mean of precision and recall. Precision is the proportion of true positives out of all the predicted positives. It indicates how many of the predicted positive samples are correct. precision =

true positives true positives + f alse positives

318

D. A. Nguyen et al.

Fig. 5. Illustrates the progression of YOLOv5’s performance over 20 epochs.

Recall, also known as sensitivity or true positive rate, measures the proportion of actual positive samples that are correctly classified by the model. recall =

true positives true positives + f alse positives

Accuracy represents the proportion of correctly classified samples, considering both true positives and true negatives, out of all the samples. accuracy =

true positives + true negatives true positives + true negatives + f alse positives + f alse negatives

f1_score is the harmonic mean of precision and recall, providing a balanced measure of the model’s performance on both precision and recall. f 1− score =

2 × precision × recall precision + recall

Table 2. Validation Loss of Models Network

MobileNet-v2

ResNet50

VGG19

Loss

0.116

0.8104

0.101

Accuracy

0.968

0.7705

0.980

val_loss

0.16

0.7858

0.131

vai_accuracy

0.958

0.7697

0.985

f1_score

0.96

0.77

0.98

Table 2 presents the validation loss, accuracy, val_loss, validation accuracy (vai_accuracy), and f1_score for three different networks: MobileNet-v2, ResNet50, and

Robust Traffic Sign Detection and Classification Through the Integration

319

VGG19. Among the three models, VGG19 exhibits the highest accuracy of 0.980, making it the most accurate model for traffic sign detection. On the other hand, MobileNet-v2 stands out as the lightest model with the fewest parameters. It achieves a validation loss of 0.116 and an accuracy of 0.968, offering a good trade-off between model complexity and performance. ResNet50, while competitive, falls behind in accuracy with a validation loss of 0.8104 and an accuracy of 0.7705. Overall, VGG19 excels in accuracy, while MobileNet-v2 offers a lightweight solution.

Fig. 6. Training and Validation of MobileNet-V2, ResNet50, and VGG19.

To draw a more comprehensive conclusion about the model’s accuracy, it is important to consider both the training accuracy and the validation accuracy. The validation accuracy serves as an indicator of how well the model generalizes to new, unseen data. If the training accuracy is significantly higher than the validation accuracy, it could suggest overfitting, where the model may have memorized the training data instead of learning generalizable patterns. Based on the training results depicted in Fig. 6, it is evident that the model’s accuracy steadily improves over the course of 10 epochs. The increasing trend of the accuracy metric demonstrates the effectiveness of the training process in enhancing the model’s ability to correctly classify and detect objects.

320

D. A. Nguyen et al.

6 Conclusion In conclusion, we presented a comparative study on the integration of YOLOv5 with popular deep learning models (MobileNetv2, ResNet50, and VGG19) for robust traffic sign detection and classification. The objective was to evaluate the performance of these networks in terms of accuracy and computational efficiency. The results of the experimental evaluations showed that the integrated approach outperformed individual networks in terms of overall accuracy. YOLOv5 demonstrated promising performance in traffic sign detection, achieving high precision and a respectable mAP_0.5 score. The integration with deep learning models for traffic sign classification further improved the accuracy of the system. Among the deep learning models, VGG19 exhibited the highest accuracy, while MobileNetv2 offered a lightweight solution with good trade-off between model complexity and performance. These findings provide insights for selecting the most suitable network based on specific application requirements. The proposed integrated approach has practical applications in various computer vision tasks, including autonomous driving, intelligent transportation systems, and driver assistance systems. Accurate and efficient traffic sign detection and classification are crucial for ensuring road safety and enabling intelligent decision-making by autonomous vehicles. Future research directions could focus on expanding the evaluation to larger datasets and real-world scenarios to gain more comprehensive insights into the models’ performance. Additionally, exploring other lightweight deep learning architectures and techniques could further enhance the accuracy and efficiency of traffic sign detection and classification systems. Acknowledgements. We acknowledge the support of time and facilities from Ho Chi Minh City University of Technology (HCMUT), Viet Nam National University Ho Chi Minh City (VNU-HCM), and FPT University Can Tho City for this study.

References 1. Ngoc, H.T., Quach, L.-D.: Adaptive lane keeping assist for an autonomous vehicle based on steering fuzzy-PID control in ROS. Int. J. Adv. Comput. Sci. Appl. (IJACSA) 13(10) (2022) 2. Zhu, Z., et al.: Traffic-sign detection and classification in the wild, p. 2118 (2016) 3. Nguyen, V.D., Trinh, T.D., Tran, H.N.: A robust triangular sigmoid pattern-based obstacle detection algorithm in resource-limited devices. IEEE Trans. Intell. Transp. Syst., 10 (2023) 4. Phan, P.H., et al.: Robust autonomous driving control using auto-encoder and end-to-end deep learning under rainy conditions. In: Proceedings of the 2023 8th International Conference on Intelligent Information Technology, Da Nang, Vietnam, pp. 271–278. Association for Computing Machinery (2023) 5. Triki, N., Karray, M., Ksantini, M.: A real-time traffic sign recognition method using a new attention-based deep convolutional neural network for smart vehicles. Appl. Sci. 13( 8) (2023) 6. Saouli, A., Margae, S.E., Aroussi, M.E., Fakhri, Y.: Real-time traffic sign recognition on Sipeed Maix AI edge computing. In: Kacprzyk, J., Balas, V.E., Ezziyyani, M. (eds.) AI2SD 2020. AISC, vol. 1418, pp. 517–528. Springer, Cham (2022). https://doi.org/10.1007/978-3030-90639-9_42

Robust Traffic Sign Detection and Classification Through the Integration

321

7. Miguel, L.-M., et al.: Evaluation method of deep learning-based embedded systems for traffic sign detection (2021) 8. Megalingam, R.K., et al.: Indian traffic sign detection and recognition using deep learning (2022) 9. Islam, M., Yusuf, M.S.U.: Faster R-CNN based traffic sign detection and classification. WSEAS Trans. Signal Process. 18, 10 (2022) 10. Wasif Arman, H., et al.: DeepThin: a novel lightweight CNN architecture for traffic sign recognition without GPU requirements (2021) 11. Wang, J., et al.: Improved YOLOv5 network for real-time multi-scale traffic sign detection. Neural Comput. Appl. 35(10), 7865 (2023) 12. Hu, J., et al.: PSG-Yolov5: a paradigm for traffic sign detection and recognition algorithm based on deep learning. Symmetry 14(11) (2022) 13. Hua, H.K., et al.: Traffic lights detection and recognition method using deep learning with improved YOLOv5 for autonomous vehicle in ROS2. In: Proceedings of the 2023 8th International Conference on Intelligent Information Technology, Da Nang, Vietnam, pp. 117–122. Association for Computing Machinery (2023) 14. Jianming, Z., et al.: ReYOLO: a traffic sign detector based on network reparameterization and features adaptive weighting (2022) 15. Youssouf, N.: Traffic sign classification using CNN and detection using faster-RCNN and YOLOV4. Heliyon 8(12) (2022) 16. Zhang, K., et al.: Joint face detection and alignment using multitask cascaded convolutional networks. IEEE Signal Process. Lett. 23(10), 1503 (2016) 17. Ren, S., et al.: 39 (2015) 18. Liu, W., et al.: SSD: single shot multibox detector. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9905, pp. 21–37. Springer, Cham (2016). https://doi.org/ 10.1007/978-3-319-46448-0_2 19. Fu, C.-Y., et al.: (2017) 20. Monira, I., Udddin, Y.Md.S.: Faster R-CNN based traffic sign detection and classification (2022) 21. Lili, S., et al.: Group multi-scale attention pyramid network for traffic sign detection (2021) 22. Guang, C., et al.: A survey of the four pillars for small object detection: multiscale representation, contextual information, super-resolution, and region proposal (2020) 23. Yang, W.J.., Luo, C.C., Chung, P.C., Yang, J.F.: Simplified neural networks with smart detection for road traffic sign recognition. In: Arai, K., Bhatia, R. (eds.) FICC 2019. LNNS, vol. 69, pp. 237–249. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-12388-8_17 24. Zhang, Y, Xiao, X., Yang, X.: Real-time object detection for 360-degree panoramic image using CNN (2017) 25. Lin, T.-Y., et al.: p. 944 (2017) 26. Houben, S., et al.: (2013) 27. Sermanet, P., LeCun, Y.: Traffic sign recognition with multi-scale convolutional networks, 2813 (2011) 28. Howard, A., et al.: (2017) 29. Zhang, X., et al.: p. 6856 (2018) 30. Tan, M., Le, Q.: (2019) 31. Branislav, N., Velibor, I., Bogdan, P.: YOLOv3 Algorithm with additional convolutional neural network trained for traffic sign recognition (2020) 32. https://universe.roboflow.com/trusdi-agus-gmail-com/tra-traffic-sign-yolov5-tv/dataset/ 33. https://www.kaggle.com/datasets/ahemateja19bec1025/traffic-sign-dataset-classification

OPC-UA/MQTT-Based Multi M2M Protocol Architecture for Digital Twin Systems Le Phuong Nam1,2 , Tran Ngoc Cat1,2 , Diep Tran Nam1,2 , Nguyen Van Trong1,2 , Trong Nhan Le1,2 , and Cuong Pham-Quoc1,2(B) 1

2

Ho Chi Minh City University of Technology (HCMUT), Ho Chi Minh City, Vietnam [email protected] Vietnam National University - Ho Chi Minh City (VNU-HCM), Ho Chi Minh City, Vietnam

Abstract. The rapid evolution of Industrial 4.0 provides a vast potential to change how the globalization of manufacturing and consumption of goods and services occurs in the global markets. Based on the Internet of Things and Cyber-Physical Systems, the application of Digital Twin (DT) has grown exponentially in intelligent manufacturing. It has the potential to replicate everything in the physical world in the digital space and provide engineers with feedback from the virtual world. This paper proposes an emerging communication protocol, including the MQTT (Message Queuing Telemetry Transport) and OPC-UA (Open Platform Communications Unified Architecture), to support real-time data exchange in a DT application. While MQTT is a lightweight open messaging protocol that provides Machine-to-Machine (M2M), or Internet of Things (IoTs) connectivity protocol on top of TCP/IP, OPC-UA delivers a mechanism to thousands of data points to provide rich representations of a physical object. By combining MQTT and OPC-UA, we present a complete architecture for an intelligent manufacturing application where humans can monitor and control machines virtually. Our solution is applied to an open-source JetMax robot arm powered by Jetson Nano, which supports deep learning, and computer vision, for performance validation. Keywords: Industry 4.0 · Cyber Physical Systems · Internet of Things · Digital Twin Technology · MQTT · OPC-UA

1

Introduction

In the age of Industry 4.0, the intelligent factory is formed by integrating the latest advanced technologies, including the Internet of Things (IoTs) and Cyber-Physical Systems (CPS). Based on the IoTs, an interconnected network of machines, communication protocols, and intelligent sensing systems [15]. Meanwhile, CPSs are described as intelligent systems integrating artiﬁcial intelligence c The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 N.-N. Dao et al. (Eds.): ICIT 2023, LNDECT 187, pp. 322–338, 2023. https://doi.org/10.1007/978-3-031-46573-4_30

OPC-UA/MQTT-Based Multi M2M Protocol Architecture

323

(AI) and machine learning to analyze data and drive automated processes in industrial factories. This type of integration not only links the physical components of the production system together but also connects digital, abstract, and virtual components into a single system called DT that allows for frequently changing demands from the markets and optimization of the as-designed and as-built products. A Digital Twin (DT) is considered among the technologies that help realize the promises of smart factories to model and control the typically complex manufacturing systems [7]. The ﬁrst beneﬁt of adopting DT is the ability to monitor the real-time status of machines from various IoT devices [16], which continuously collect data from machines and factory environments. These data are then analyzed for intelligent scheduling, predictive maintenance, logistics, and decision-making [11]. Secondly, some of the applications of Virtual Reality (VR) wearable in the smart factory as “being able to tie together environmental conditions, inventory levels, process state, assembly error data, utilization, and throughput metrics in a context-dependent manner (where you look or walk)” [14]. This immersive sensory experience lets users augment their natural senses with real-time data from across any location or point in time [6] - to give unobstructed awareness of factory status. To support diﬀerent communication strategies in a DT application, OPCUA is used to exchange a large payload for 3D visualization in a VR glass and MQTT is used to provide remote monitoring and controlling for portable devices, such as the mobile phones or tablets. Based on OPC-UA and MQTT, this paper presents a new architecture for a smart manufacturing context, where local users are able to have 3D virtual view of the system, which is synchronized with the real one. Meanwhile, remote users also can monitor and control the system through a dashboard on mobile phones through an MQTT connection. Thus the diﬃculties of multi-protocol device integration and intercommunication and data uploading to the cloud can be tackled by OPC-UA and MQTT [9]. The main contributions of the paper can be summarized into three folds: – A multi M2M protocol architecture based on OPC-UA and MQTT for DT application. – Prototype implementation of the architecture on JetMax robot arm and Occulus VR glass. – Performance comparison between MQTT and OPC-UA for optimized choice in M2M communications applied into a DT application. The rest of the paper is organized as follows. Section 2 presents related DT system in the literature. Our proposed system for a multi M2M protocol architecture is presented in Sect. 3. Section 4 describes the implementation of our system on Jetmax robot arm and Oculus glass. Our experiments for system validation and performance evaluation are shown in Sect. 5. Finally, the paper ends with conclusions and future works in Sect. 6.

324

2

L. P. Nam et al.

Related Digital Twin Systems

Digital Twin (DT) technology has been widely adopted in the manufacturing industry to enhance all aspects of the product manufacturing process and the overall process lifecycle. In this area, the work of Rosen et al. [13] highlights the potential of DT in developing a computer system that eﬃciently monitors every step of production using a modular approach. Speciﬁcally, the authors propose a modular Smart Manufacturing approach [2] in which independent modules automatically perform high-level tasks without human intervention. These modules can make decisions among many alternative actions and respond to failures or unexpected events without disrupting the operation of other modules. To enable a system, designers need to allow modules to access highly factual information that accurately reﬂects the current state of the process and related products. This behavior can be achieved using faithful virtual copies of physical entities. DT also facilitates continuous communication between the system and physical assets in speciﬁc contexts. However, it is essential to note that the work of Rosen et al. [13] can simplify DTs by treating them as realistic models or simulations that can communicate seamlessly with their physical counterparts. It is essential not to confuse DT with mere simulations or symbolic representations generated by augmented or virtual reality applications [5]. What distinguishes DT from a mere model or symbolic representation is intelligence integration and the continuous data exchange between the physical model and its virtual counterpart. Furthermore, DTs must be developed based on knowledge provided by human experts and actual data gathered from current and past systems [3]. Such data are needed to describe the behavior of physical twins and derive solutions applicable to natural systems [1], [55]. DTs are specialized simulations explicitly designed for their intended purpose, evolving with the actual system throughout its entire lifecycle. In a similar vein to Rosen et al.’s [13] work, Qi and Tao [12] have demonstrated the beneﬁts of leveraging Digital Twins (DTs) in their Data-Driven Smart Manufacturing (BDD-SM) approach. BDD-SM exploits sensors and the Internet of Things (IoT) to collect and transmit large volumes of data. These data are then processed using AI applications and big data analytics deployed on the cloud. This approach enables real-time process monitoring, failure detection, and identiﬁcation of optimal solutions. Additionally, DT technology facilitates the establishment of real-time, bidirectional mappings between physical objects and their digital representations, enabling an intelligent, predictive, and prescriptive approach. This approach involves targeted monitoring, optimization, and self-healing actions. Similarly, in [4], a comparable framework is proposed, where the DT model consists of ﬁve core components: the physical space (PS), the virtual space (VS), sensors, integration technologies, and analytics. Sensors enable seamless realtime communication between the physical and virtual spaces, facilitated by integration technologies encompassing communication interfaces and security measures. The exchanged data are processed using analytic techniques, which utilize simulation results to compute prescriptions and recommendations.

OPC-UA/MQTT-Based Multi M2M Protocol Architecture

325

Furthermore, Liu et al. [8] proposed a method that integrates MQTT, OPCUA, and DT to improve the eﬃciency of manufacturing processes in the context of Industry 4.0. They demonstrated the beneﬁts of leveraging DT models and real-time data exchange supported by MQTT and OPC-UA to achieve real-time monitoring, predictive maintenance, and control adaptation in smart manufacturing systems. In summary, the integration of MQTT, OPC-UA, DT, and their applications in the intelligent industry has been explored in various studies. These studies highlight the potential to combine these technologies to improve data interoperability, real-time monitoring, predictive maintenance, and overall eﬃciency in intelligent manufacturing, asset management, and more, such as productions, smart grids, and Industry 4.0.

3

Multi-M2M Protocol Architecture Using OPC-UA and MQTT

Since OPC-UA employed a client-server architecture, it is diﬃcult to extend control capabilities - control multiple OPC-UA servers or devices at the same time. In this work, we propose an architecture that contains a DataCenter, which role is a center to gather all information and control methods from other OPCUA servers. Besides DataCenter, our architecture consists of multiple Industrial devices, Local users, Cloud servers, and multiple Remote users. Figure 1 depicts the proposed architecture. 3.1

Industrial Devices

Since OPC-UA employs a client-server architecture, it is challenging to extend control capabilities - to control multiple OPC-UA servers or devices simultaneously. In this work, we propose an architecture containing a DataCenter, which is a center to gather all information and control methods from other OPCUA servers. Besides DataCenter, our architecture consists of multiple Industrial devices, Local users, Cloud servers, and multiple Remote users. Figure 1 depicts the proposed architecture. 3.2

DataCenter

As the name suggests, Industrial devices might be all kinds of devices in an industrial context, for example, sensors, actuators, robotic arms, conveyor belts, laser engraving machines, etc. These devices might be controlled by PLC controllers, computers, or even embedded computers, which can run an OPC-UA server to provide insight into device status and manipulate them to complete a speciﬁc task. But as each controller has its own OPC-UA server, it is very inconvenient for factory workers if they want to look at the status of all devices at once, as they have to connect to each OPC-UA server to gather the information.

326

L. P. Nam et al.

Fig. 1. Our proposed architecture with DataCenter act as a center of all information.

3.3

Local Users

A local user could be a factory worker or a manager in the same network. They can use applications that act as an OPC-UA client and establish a direct connection to the DataCenter. Furthermore, since OPC-UA provides fast and secure data exchange, we can develop real-time data monitoring and control applications like DT. 3.4

Cloud Server

Cloud servers contain multiple cloud services for further data processing. For example, Microsoft Azure, Azure DT, and Amazon Web Services (AWS) cloud services provide various functionalities, including data analysis, storage, visualization, machine learning, and even DT for a whole factory level. Furthermore,

OPC-UA/MQTT-Based Multi M2M Protocol Architecture

327

the connection between the cloud server and DataCenter can be established by protocols such as TCP/IP, UDP, HTTPS, RESTAPI, MQTT, OPC-UA, etc. 3.5

Remote Users

By bringing data and processing to the cloud, remote users with user-speciﬁc devices can access them globally using an internet connection. Since data travels long distances on the internet, transmission delays are inevitable. The remote user’s applications may not deliver real-time monitoring but instead oﬀer summaries, statistical reports, or simple control, which is not time-critical for the manufacturing process.

4

System Implementation on JetMax and VR Oculus

In this section, we introduce the implementation of our proposed architecture. The entire system will be developed by embedded boards and cloud services. 4.1

Control Module

In our system, the Control Module includes the Jetmax Robot Arm connected to an OPC-UA server running on Jetson Nano. Jetmax is a robot arm developed by Hiwonder, integrated with the Jetson Nano board and controlled by ROS. It provides image recognition, data processing, and precise control capabilities, allowing users to perform various tasks. Jetmax has a ﬂexible design, with four mechanical parts controlled by servo motors. It also has a high-resolution camera and image processing capabilities, enabling users to perform object recognition, classiﬁcation, monitoring, and control tasks. In addition, Jetmax is integrated with open-source ROS software and programs, allowing users to customize and develop artiﬁcial intelligence applications. In reality, Robot arms in industrial environments provide signiﬁcant beneﬁts such as high precision, consistency in repetitive tasks, increased productivity, and the ability to work continuously in hazardous environments, thus making the workplace safer for human workers. Besides, Jetson Nano is a development board developed by NVIDIA, designed for artiﬁcial intelligence (AI) and deep learning. With its compact size, Jetson Nano is equipped with a 64-bit ARM Cortex-A57 processor and NVIDIA Maxwell GPU co-processor with 128 CUDA cores. Jetson Nano also has highspeed image and video processing capabilities, making it ideal for developing image recognition, classiﬁcation, monitoring, and control applications in ﬁelds such as autonomous driving, robotics, healthcare, and security. Jetson Nano is an excellent choice for developers, students, and enthusiasts interested in AI and deep learning technology. In this system, Jetson Nano acts as an OPC-UA server, managing methods and data nodes and providing user device control capabilities via the OPC-UA protocol.

328

L. P. Nam et al.

OPC-UA nodes

Update states/values to OPC nodes

ROS Pub to topic ROS executes pub/call service OPC-UA Server

Sub to topic

Call a service Call method(s) from server

OPC-UA methods

Fig. 2. Server Block.

Because the Control Module consists of two components, as mentioned above, we have implemented the system with two main functional blocks: the Server and JetMax Control. First is the Server block, depicted in Fig. 2, where the OPC-UA Server manages the nodes and methods conﬁgured to communicate with the DataCenter. Next, OPC-UA methods are where control signals sent from the DataCenter are executed. Here, the methods are based on the received signal and then perform the corresponding pub/sub/call service commands provided by ROS. Finally, the OPC-UA nodes, illustrated in Fig. 3, contain the Jetmax Robot arm’s status information and listen to updates from the DataCenter. In addition, the status will be continuously updated through the Jetmax Control. The directory structure of the nodes and methods in the OPC-UA Server is described as follows. In particular, the separation of the system into two operating streams is data reading stream, where nodes constantly update data, and control stream, where control signals are generated through methods. This design will make it easy to monitor and control the system, as data sent and received by the server will not conﬂict with each other. The next component is the JetMax Control block, presented in Fig. 4. The main task of this block is to control the robot arm, which directly interacts with the IO of a device to control. In addition, this block also updates the state of the

OPC-UA/MQTT-Based Multi M2M Protocol Architecture

329

Fig. 3. Nodes and methods in the OPC-UA Server.

Fig. 4. JetMax Control Block.

arm through topics, listens to topics for direct device control commands, and provides services for the Server to use. Based on the designed architecture above, the operation of the system will proceed in two streams as follows: – Control stream - through OPC-UA methods: When the DataCenter calls a method to the server → the methods will do the job of pub/sub or call ROS services → the topics will be listened to by Jetmax Control and control the robot arm to perform corresponding actions.

330

L. P. Nam et al.

– Data reading stream - through OPC-UA nodes: Jetmax Control continuously reads data on the robot arm’s state (coordinates, images, servos, etc.) to update into ROS topics → the server will listen to these topics, observe for changes and update them to the nodes of the OPC server so that the DataCenter can monitor. 4.2

DataCenter Module

In this system, each component is an independent server. So, this architect needs a feature to request/control all servers in the local network. Because of that reason, the component name DataCenter was suggested. All nodes (servers) are only being worked on when they receive a request or someone asks them to do it. They can not do anything alone and simply do what we require. So, we need a scenario to control all servers. One advantage of this approach is that we can change the procedures directly and modify them in one place - DataCenter. There are two main functions of DataCenter: – Control all activities of all servers in this system. DataCenter monitors all data, sends requests, receives data or signals, and synchronizes all activities in a process with scenarios (when and what server at ﬁrst time, then what is the next do, etc.). Of course, we can change those scenarios every time and from anywhere. – From outside of this system (or outside of this local network), DataCenter may have another role of providing data to client devices and API (Application Programming Interface) for calling from other devices. Due to the limited resources and devices, our DataCenter will only control one OPC-UA server. The block diagram of our implementation for DataCenter is as in Fig. 5. We have two critical sections in this implementation of DataCenter: one SubOPC-UA Server and one (or more) Sub-OPC-UA Client. Each Sub-OPC-UA Client will have a Subscription Handler to handle the value subscription of the OPC-UA server that our Sub-OPC-UA Client connects to. We decided to add the “Sub-” preﬁx in the name of these sections to distinguish them from the OPCUA server running on the Control module and the OPC-UA client running on the VR application. In this design, we can have multiple Sub-OPC-UA Clients. Each client can connect to control or collect information from another OPC-UA server it links to. All the information and control methods that these Sub-OPC-UA Clients collect will then gather in one single Sub-OPC-UA Server, which provides a single point of data access for users using OPC-UA Client. Furthermore, we can take advance of this data centralization to publish data to the cloud and utilize multiple functionalities such as machine learning algorithms, data storage, monitoring, visualization, etc. When running, DataCenter will be able to make an exact copy of the node structure of the OPC-UA server that it is connecting to, on condition that the node structure only contains folder, variable, and method nodes. Variable nodes

OPC-UA/MQTT-Based Multi M2M Protocol Architecture

331

Call function to synchronize behaviors or data

Sub-OPC-UA client

Subscription handler Handle OPC-UA client subcription

Sub-OPC-UA server

Notify the value of the variable type node has changed

Fig. 5. Block Diagram of DataCenter.

on DataCenter can reﬂect the actual value on the OPC-UA server. In contrast, method nodes act as a “wrapper” method that, when called, will ﬁnd the corresponding method on the OPC-UA server and contact that method directly through Sub-OPC-UA Client. 4.3

Virtual Reality Application

To control all related devices, we need to design an application that connects to the center of data - DataCenter. The application must implement an OPC-UA client to connect to the OPC-UA server running on the DataCenter for faster information exchange. We could also implement a 3D virtual replica of the device in the Virtual Environment (VR) based on the quick and stable data exchange OPC-UA oﬀers. This way, we can observe the current status of the actual device through a virtual instance and control it using the application. Virtual Reality applications attract interest and provide more insight to users than standard methods. To implement the VR application, we use Unity Engine. Unity is a free game engine used by game developers, artists, architects, automotive designers, ﬁlmmakers, and more to create and operate interactive, real-time 3D (RT3D). It is considered the world’s leading platform for creating and working RT3D. Unity has also been used to develop several DT applications in several ﬁelds like Automotive, Architecture, Manufacturing, healthcare, and more. Our VR application is built into an Android Package Kit (APK) ﬁle to load on the Oculus Quest 2, a virtual reality (VR) headset. It is the successor of the Oculus Quest device, developed by Meta. Oculus enables users to enjoy the virtual reality environment with features like voice commands, ﬁtness tracking, eye tracking, etc. Meta also releases the Oculus Integration Asset for free on Unity Store, allowing developers to develop their VR applications much faster and easier. This is also the main asset that we use to build our application.

332

L. P. Nam et al.

For developing an OPC-UA client, we use the Opc.UaFx.Client library, downloaded from the Nuget Library, to integrate into our software. This OPC-UA client will connect to the DataCenter, collect data, and control the devices by calling methods. In detail, our VR application is designed with the blocks depicted in Fig. 6. The myopc block is the OPC-UA client that handles connecting to DataCenter and subscribes to value change. The 3 Intermediate blocks, UI manage, Toggle Update AI, and InputHand, will handle user interaction and invoke the necessary functions from myopc block. Finally, the User Interface block stores all the user interface components like buttons, hands, and a 3D model of our robotic arm that continuously updates with the physical device when connected to DataCenter. Upon starting the application, the user will enter the VR environment and see the 3D model of the robotic arm with some buttons to interact with. To connect to DataCenter, the user must enter the URL address of the OPC-UA server currently running on DataCenter, then click the Connect button. After clicking the Connect button, a connection between the OPC-UA client in the VR application and the OPC-UA server in DataCenter will be formed. Then the OPC-UA client will browse the data structure and save necessary information such as node ID, value, method name, and variable node name. With nodes of the type variable, OPC Client will perform a Client subscription on these nodes (not to be confused with OPC-UA PubSub in part 14 of OPC-UA speciﬁcation) so that the OPC-UA client will be notiﬁed whenever these variables change and perform updating user interface or the 3D model. Also, saving node information in the application can save a lot of time and increase performance when accessing these nodes in the app. After successfully connecting to DataCenter, the control menu will fully appear with all the buttons. The user can also use the Oculus controllers to send OPC method requests and control the robotic arm. The current status of the real robotic arm will always be updated in the application and the 3D model. This 3D model is now a Digital Shadow of the actual robotic arm. Furthermore, the user can call other functions on the menu, such as “Go Home” to reset the robotic arm to its original position, “Increase Speed,” or “Decrease Speed” to change the moving speed of the robotic arm when controlling it with Oculus Controllers. The application also allows the user to activate and use three AI functions that the robotic arm supports: semi-automatic Waste Classiﬁcation, Block Stacking, and Color tracking. To ensure the security and integrity of data exchange, we use encryption to encrypt data with Sign and Encrypt mode, using Basic256Sha256 oﬀered by the Opc.UaFx.Client library.

5

Experiments

In this section, we present our experiments to validate and estimate the acceleration ability of the above system. At ﬁrst, the experimental setup is introduced.

OPC-UA/MQTT-Based Multi M2M Protocol Architecture

333

Then, we summarize the results with our experiments to illustrate the goals of our work. 5.1

Experimental Setup

To provide a general insight into the performance of OPC-UA and MQTT, we conduct a simple experiment to record the Round Trip Time (RTT) when using OPC-UA and MQTT on two relatively similar architectures. We record 1000 RTT values for each protocol when transmitting each packet with a pre-deﬁned size payload ranging from 2 bytes to 128KB. Furthermore, we repeated the above experiment in two scenarios: normal network condition and high-load network condition. To simulate a high-load network condition, we use the architecture’s third-party software Packet Sender [10] to continuously send anonymous UDP packets from one computer to the other. The detail of the experimental setup for each protocol is described below. OPC-UA Evaluation Setup. The architecture used for OPC-UA performance evaluation consists of 3 machines as shown in Fig. 7. – Client: Connect to the DataCenter and call a method to retrieve the value of the corresponding node. – DataCenter: a device that contains one sub-client and one sub-server. – Server: where the data nodes are stored to be returned to the DataCenter when accessed. OPC-UA connection currently does not use any authentication and encryption. From the Client, we connect to the sub-server of the DataCenter (marked as T1) and call a method to retrieve the value of the corresponding node in the Server. Speciﬁcally, in the DataCenter, after the sub-server receives the request from the Client, it executes a function corresponding to the call method. This function returns the value obtained by the sub-client from the corresponding node in the Server. Finally, the retrieved node value is returned to the Client (marked as T2). The Round Trip Time (RTT) for the entire send-receive process is calculated by Eq. 1. (1) RT TOP −CU A = T 2 − T 1 MQTT Evaluation Setup. The architecture used for MQTT performance evaluation consists of 3 machines as shown in Fig. 8. Clients 1 and 2 are MQTT clients, while the middle machine is an MQTT broker. MQTT connection will be anonymous and use QoS 0. This architecture is similar to the DataCenter architecture used in the OPC-UA setup. The broker will have two topics: v1/devices/me/telemetry0 and v1/devices/me/telemetry1. Client 1 publishes a packet with the pre-deﬁned size payload described above to topic telemetry0. Client 2 subscribes to topic telemetry0, then s the package and broadcasts an acknowledge packet back to telemetry1, which is the topic that

334

L. P. Nam et al.

client 1 subscribed to. This scheme creates a complete “Round trip” of data. The RTT is calculated on Client 1 from the time it starts publishing a packet with pre-deﬁned size payload (T1) until it receives the acknowledged packets (T2) as depicted in Eq. 2. RT TM QT T = T 2 − T 1 5.2

(2)

Performance Evaluation

After experimenting 1000 times by transmitting packets with predetermined sizes ranging from 2 bytes to 128 KB for each protocol, Fig. 9 and 10 show the boxplot charts representing the recorded RTT values of MQTT and OPC-UA in two network conditions: normal network and high network load, respectively.

myopc OPC client

call Method(s) to execute task

UI_manage

User interacts to object(s) on interface

User Interface Update UI

notify value/status changed

Toggle_Update_AI

call Method(s) to execute task

InputHand

User interacts to object(s) on interface

Fig. 6. Block diagram of the implementation of VR app.

Fig. 7. The architecture used for OPC-UA performance evaluation.

Based on the obtained results, we can observe that the RTTs of MQTT and OPC-UA are proportional to the packet size, and the RTT in the case of a high-load network is higher compared to the RTT in normal conditions. Speciﬁcally, in Fig. 9, OPC-UA has a better average RTT than MQTT in typical network conditions. The mean RTT for OPC-UA ranges from 83ms to

OPC-UA/MQTT-Based Multi M2M Protocol Architecture

335

Fig. 8. The architecture used for MQTT performance evaluation.

Fig. 9. The Boxplot charts of OPC-UA and MQTT under normal network conditions.

Fig. 10. The Boxplot charts of OPC-UA and MQTT under high network load conditions.

336

L. P. Nam et al.

156ms, while MQTT has a higher mean RTT, from 120ms to 239ms. Additionally, the data points of OPC-UA are more concentrated around the mean (with a smaller standard deviation) so the box sizes are small. On the other hand, the data points of MQTT are more scattered from the mean, has bigger box sizes, and more outliers. Furthermore, when transferring a payload of size 128KB, MQTT shows a maximum RTT up to 1609ms, while OPC-UA only needs a maximum of 689ms to transfer the packet. This indicates that the OPC-UA protocol provides more stability and is much faster than MQTT for speed-oriented applications. In Fig. 10, under high-load network conditions, when the packet sizes are small (ranging from 2 to 512 bytes), the data points of OPC-UA become more scattered and have more outliers compared to MQTT. However, when the packet size is larger (2 KB and above), the data of OPC-UA becomes more concentrated and superior to MQTT. Observing the average RTT, OPC-UA shows better RTT values than MQTT, ranging from 314ms to 895ms compared to 348ms to 2055ms of MQTT. Based on this result, we can assume that OPC-UA is a stable and fast protocol for transmitting larger packets in high-load network conditions. In the experiment, the outliers of OPC-UA do not exceed 4 s, as in the OPCUA implementation in Python, the timeout for a request is set to 4 s. If a response exceeds the set timeout, OPC-UA discards the packet. However, this situation is rare and only occurred when transferring the 128KB packet, with the likelihood of discarding being less than 1%. On the other hand, MQTT is more lenient in terms of timeout issues because it only uses QoS 0. Therefore, the outliers of MQTT are farther and more dispersed (with the peak at the 128 KB packet, the maximum RTT of MQTT reaching up to 9985ms). MQTT demonstrates good data transmission capability for large packages and high network loads but at the expense of increased delay. In contrast, OPC-UA has stricter requirements for speed and response time, thus sending faster than MQTT in various scenarios of packet sizes. However, due to limitations in the OPC-UA implementation in Python, OPC-UA discards packets when the response exceeds 4 s.

6

Conclusion

In this paper, we proposed a hybrid communication protocol for a DT application in intelligent manufacturing. To support a vast amount of data integration between a physical and virtual machine in both directions, we use the OPCUA protocol. A DataCenter is proposed to play the role of both the OPC-UA client and OPC-UA server for data exchange between industrial machines and VR applications. Multi-threading makes the DataCenter handle several OPCUA clients simultaneously to gather information from various devices. Then, an application implemented on the Oculus Quest 2 provides a virtual view of the whole system by connecting to the OPC-UA server of the DataCenter. Finally, the lightweight MQTT protocol is used to support remote tracking or controlling from portable devices since a small payload is required for this activity. Sensory data is published to a cloud server for this purpose. Future works will extend this architect to a multi-directional robot (e.g., use Mecanum wheels) for an intelligent warehouse application.

OPC-UA/MQTT-Based Multi M2M Protocol Architecture

337

Acknowledgement. We acknowledge Ho Chi Minh City University of Technology (HCMUT), VNU-HCM for supporting this study.

References 1. Boschert, S., Rosen, R.: Digital twin—the simulation aspect. In: Hehenberger, P., Bradley, D. (eds.) Mechatronic Futures, pp. 59–74. Springer, Cham (2016). https:// doi.org/10.1007/978-3-319-32156-1 5 2. Davis, J., Edgar, T., Porter, J., Bernaden, J., Sarli, M.: Smart manufacturing, manufacturing intelligence and demand-dynamic performance. Comput. Chem. Eng. 47, 145–156 (2012). https://doi.org/10.1016/j.compchemeng.2012.06.037, https:// www.sciencedirect.com/science/article/pii/S0098135412002219, fOCAPO 2012 3. Gabor, T., Belzner, L., Kiermeier, M., Beck, M.T., Neitz, A.: A simulation-based architecture for smart cyber-physical systems. In: 2016 IEEE International Conference on Autonomic Computing (ICAC), pp. 374–379 (2016). https://doi.org/ 10.1109/ICAC.2016.29 4. Grieves, M.: Digital twin: manufacturing excellence through virtual factory replication. White Paper 1(2014), 1–7 (2014) 5. Hribernik, K., Wuest, T., Thoben, K.-D.: Towards product avatars representing middle-of-life information for improving design, development and manufacturing processes. In: Kov´ acs, G.L., Kochan, D. (eds.) NEW PROLAMAT 2013. IAICT, vol. 411, pp. 85–96. Springer, Heidelberg (2013). https://doi.org/10.1007/978-3642-41329-2 10 6. Laaki, H., Miche, Y., Tammi, K.: Prototyping a digital twin for real time remote control over mobile networks: application of remote surgery. IEEE Access 7, 20325– 20336 (2019). https://doi.org/10.1109/ACCESS.2019.2897018 7. Leng, J., Wang, D., Shen, W., Li, X., Liu, Q., Chen, X.: Digital twins-based smart manufacturing system design in industry 4.0: a review. J. Manuf. Syst. 60, 119–137 (2021). https://doi.org/10.1016/j.jmsy.2021.05.011, https://www. sciencedirect.com/science/article/pii/S0278612521001151 8. Liu, M., Fang, S., Dong, H., Xu, C.: Review of digital twin about concepts, technologies, and industrial applications. J. Manuf. Syst. 58, 346–361 (2021). https://doi.org/10.1016/j.jmsy.2020.06.017, https://www.sciencedirect. com/science/article/pii/S0278612520301072, digital Twin towards Smart Manufacturing and Industry 4.0 9. Ludbrook, F., Michalikova, K.F., Musova, Z., Suler, P.: Business models for sustainable innovation in industry 4.0: smart manufacturing processes, digitalization of production systems, and data-driven decision making. J. Self-Govern. Manag. Econ. 7(3), 21–26 (2019) 10. NagleCode, L.: Packet sender. https://packetsender.com, Accessed 3 Jun 2023 11. Pech, M., Vrchota, J., Bedn´ aˇr, J.: Predictive maintenance and intelligent sensors in smart factory. Sensors 21(4), 1470 (2021) 12. Qi, Q., Tao, F.: Digital twin and big data towards smart manufacturing and industry 4.0: 360 degree comparison. IEEE Access 6, 3585–3593 (2018). https://doi.org/ 10.1109/ACCESS.2018.2793265 13. Rosen, R., von Wichert, G., Lo, G., Bettenhausen, K.D.: About the importance of autonomy and digital twins for the future of manufacturing. IFAC-PapersOnLine 48(3), 567–572 (2015). https://doi.org/10.1016/j.ifacol.2015.06.141, https://www. sciencedirect.com/science/article/pii/S2405896315003808, 15th IFAC Symposium onInformation Control Problems inManufacturing

338

L. P. Nam et al.

14. Weber, A.: The reality of augmented reality. https://www.assemblymag.com/ articles/94979-the-reality-of-augmented-reality, Accessed 3 June 2023 15. Zhou, K., Liu, T., Zhou, L.: Industry 4.0: towards future industrial opportunities and challenges. In: 2015 12th International Conference on Fuzzy Systems and Knowledge Discovery (FSKD), pp. 2147–2152 (2015). https://doi.org/10.1109/ FSKD.2015.7382284 16. Zhu, Z., Liu, C., Xu, X.: Visualisation of the digital twin data in manufacturing by using augmented reality. Procedia CIRP 81, 898–903 (2019). https://doi. org/10.1016/j.procir.2019.03.223, https://www.sciencedirect.com/science/article/ pii/S2212827119305281, 52nd CIRP Conference on Manufacturing Systems (CMS), Ljubljana, Slovenia, June 12-14, 2019

Real-Time Singing Performance Improvement Through Pitch Correction Using Apache Kafka Stream Processing Khoi Bui1,2 and Trong-Hop Do1,2(B) 1

Faculty of Information Science and Engineering, University of Information Technology, Ho Chi Minh City, Vietnam [email protected], [email protected] 2 Vietnam National University, Ho Chi Minh City, Vietnam

Abstract. Singing is an indispensable art in our spiritual life. However, expressing emotions through songs requires the singer to control the pitch accuracy, compromising the quality of their performances which makes “out of tune” a big problem in this ﬁeld. This study presents a system that corrects raw vocal pitch based on the original true pitch of the singer. By applying TD-PSOLA on the notes extracted using the timedomain pYin method, pitches can be corrected without changing the vocal characteristics. At the same time, the system is also built by using Kafka’s Data Streaming architecture to adapt to big data challenges.

Keywords: Kafka streaming Pitch shifting

1

· Audio streaming · Pitch correction ·

Introduction

Singing out of tune is a recurring phenomenon in performances, deﬁned as the deviations of musical pitch, whether it is too high or too low. This occurrence tends to make the voice unattractive and lower the stage quality. In contemporary times, there are ranges of real-time pitch correction systems and Antares was one the most prominent platforms. However, Antares has some drawbacks related to the process and requirements. The problem is Antares demands local processing and hardware, as well as musical knowledge. Furthermore, thanks to the development of online music streaming services, it has caused data explosion, which had ﬁnally led to the new big data technology, speciﬁcally Audio Streaming. Therefore, in modern times, developing systems that adjust the timing pitch based on audio streaming technology is extremely crucial. In this study, the Real-time Pitch Correction System based on the singer’s performance using Kafka framework will be introduced in great detail. About the Pitch Correction System, the voice that needs to be in-tuned will be pitch corrected by modifying the fundamental frequency based on the original fundamental frequency of the singer’s performance. By default, the raw voice and c The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 N.-N. Dao et al. (Eds.): ICIT 2023, LNDECT 187, pp. 339–347, 2023. https://doi.org/10.1007/978-3-031-46573-4_31

340

K. Bui and T.-H. Do

the singer’s voice will synchronize in time with each note. However, changing the fundamental frequency will normally cause the vocals to become morphed. So by using TD-PSOLA, the pitch will be changed while keeping the nuance of the vocals. The data throughput, real-time demand, and size diversity make building this system a big data task. To meet those demands, this system uses Apache Kafka as a middle layer of real-time data pipelines.

2

Related Work

Antares Audio Technologies [1] ﬁrst introduced the ground-breaking Auto-Tune Pitch Correcting Plug-In in 1997. It corrects the pitch of vocals and other solo instruments, in real-time, without distortion or artifacts, while preserving all of the expressive nuances of the original performance. Now, Auto-Tune has established itself as the worldwide standard in professional pitch correction. Today, it is used daily by tens of thousands of audio professionals all around the world to save the studio editing time and ease the frustration of endless retakes, save that otherwise once-in-a-lifetime performance, or create what has become the signature vocal eﬀect of our time. In 2020, Wager el at [12] have introduced a data-driven approach to automatic the pitch correction of solo singing performances. The relationship between the respective spectrograms of the singing and accompaniment was used to predict note-wise pitch shifts amount by using a CNN model with a GRU layer trained on Intonation dataset [11].

3 3.1

Pitch Shifting and Pitch Correction Pitch Shifting with TD-PSOLA Technique

The human voice is formed with two components: vocal cords and vocal tracts. The vocal cords (determined by fundamental frequencies) decide the pitch of the vocal while the vocal tract decides the vocal characteristic. Pitch shifting is changing the pitch of a sound in real-time. If we change the pitch of a vocal signal by modifying frequencies, we will transpose the formants as well as the pitch, thus altering the vocal characteristic. For example, when transposing into a higher pitch, the frequencies rise and virtually shrink the vocal tract of the singer and make them sound like a “chipmunk”. Similarly, lowering the pitch will make the resonant frequencies go down, and virtually stretch the singer’s vocal tract as if it was larger, making the sound feel unnatural [2]. In order to be consistent with the human vocal characteristics, TD-PSOLA (Time-Domain PSOLA) algorithm was used. PSOLA [4] refers to a family of signal processing techniques that are used to perform time-scale and pitch-scale modiﬁcation of speech. Time-domain TD-PSOLA is the most popular PSOLA technique and also the most popular of all time/pitch-scaling techniques. TDPSOLA works pith-synchronously, which means there is one analysis window per

Real-Time Pitch Correction System Using Kafka

341

pitch period. The signal is separated with a Hanning window, generally extending two pitch periods. The duration of a signal can be shortened or lengthened by duplicating or removing frames. The period of a signal can be changed by tightening or loosening the distance between those segments. Then it uses overlap-add to construct a waveform from those modiﬁed pitch period windows. The change in period results in a change in frequency from which the pitch of the signal will be shifted. However, the pitch modiﬁcation will necessarily change duration as a by product of moving the frames closer or further apart. For pitch shifting without modifying the signal’s duration, both above modiﬁcations are done at once, by a segment by segment calculation of the mapping function [10]. Because the ﬁlter response is represented in the time domain as its impulse response (the pitch periods) - not in a form that we could easily modify, TDPSOLA can just only modify fundamental frequencies and duration. It can not modify the vocal tract ﬁlter. 3.2

Estimating Fundamental Frequencies with PYIN Algorithm

The fundamental frequency (F0) of a periodic signal is the inverse of its period, which may be deﬁned as the smallest positive member of the inﬁnite set of time shifts that leave the signal invariant. Mathematically, it can be displayed below the equation. min x(t + T ) = x(t) T

(1)

The purpose of the YIN algorithm [5] is to ﬁnd a solution to Eq. 1 above. As mentioned in [8], The YIN algorithm is based on the intuition that, in a signal xi , i = 1, ..., 2W , the diﬀerence dt (τ ) =

W

(xj − xj+τ )2 ,

(2)

j=1

will be small if the signal is approximately periodic with fundamental period τ = 1/f0 . The diﬀerence can be obtained by ﬁrst calculating the auto-correlation function (ACF) rt (τ ) =

t+W

xj xj+τ

(3)

j=t+1

then Eq. 2 can be calculated as dt (τ ) = rt (0) + rt+τ (0) − 2rt (τ ).

(4)

Next, the diﬀerence is normalized by obtaining a Cumulative mean normalized diﬀerence function dt (τ ). Then the dip in the diﬀerence function dt that corresponds to the fundamental period can be found by picking the smallest period τ for which dt has a local minimum and dt (τ ) < s for a ﬁxed threshold s.

342

K. Bui and T.-H. Do

Probabilistic YIN (PYIN) algorithm is a variant of the well-known YIN technique for estimating fundamental frequency. In some cases, YIN algorithm fails to extract the pitch correctly due to its variable nature. Instead of having a constant threshold, PYIN deﬁnes a distribution of thresholds governed by a density function. Thus, for each value of the threshold, we will get an estimate. Then, these probabilities are used as observations in a hidden Markov model which is improved by using Viterbi algorithm for the Maximum Likelihood Sequence Estimation. 3.3

Pitch Correction Process

Assumes that all singer’s tracks are performed in the correct pitches and raw vocals are synchronous with the singer’s vocals note by note. First, the singer’s track is separated into the singer’s vocal and backing track using Spleeter [6] a source separation library with pretrained models written in Python and uses Tensorﬂow. Then, PYIN method - a note-wise algorithm to estimate fundamental frequencies is used to extract true notes from the singer’s vocal and raw notes from the raw vocal. Next by using TD-PSOLA, raw note’s pitches are shifted by the ratio they diﬀer from true note’s pitches note by note to get tuned vocal. Finally, the output performance can be achieved by mixing tuned vocals with the backing track. The processing pipeline can be summarized in Fig 1.

Fig. 1. Pitch correction pipeline.

4

Data Streaming with Kafka

This section presents the basic concept of Kafka data streaming system and describes the architecture of the pitch correction system.

Real-Time Pitch Correction System Using Kafka

4.1

343

Big Data Challenges

Big data can be deﬁned in “5Vs”: Velocity, Volume, Value, Variety, and Veracity. In the Audio industry, vast amounts of music data are being outputted each day. This increased data volume is generated by subscribers of music streaming and other audio processing platforms. As consumers listen to music more via streaming platforms than any other format, this information is highly valuable and is increasingly directing the industry. With an increased number of music streaming services, data can be highly unorganized and diﬃcult to process. Therefore, creating a system that can adapt to all of those problems quickly and cost-eﬀectively is a very big challenge. 4.2

Kafka Architecture and Design Principles

Kafka [7] is an open-source, highly distributed streaming platform. It was built by the engineers at LinkedIn (now part of the Apache software foundation), Kafka is a reliable, resilient, and scalable system that supports streaming events/applications. Kafka is an environment where users can publish a large number of messages into the system and consume those messages through a subscription, in real-time. That is why Kafka is becoming popular and its role in the Big Data ecosystem.

Fig. 2. Kafka architecture

The overall architecture of Kafka is shown in Fig 2. For knowing the Kafka framework, we must be aware of some terminologies. A stream of messages of a particular type is deﬁned by a topic. A producer can publish messages to a topic. The published messages are then stored at a set of servers called brokers. A consumer can subscribe to one or more topics from the brokers, and consume the subscribed messages by pulling data from the brokers.

344

K. Bui and T.-H. Do

4.3

System Architecture

Based on the basic architecture of a streaming data application using Kafka, the system architecture is designed to include two components: Client and Server. Each component itself is both a Producer and a Consumer. The Client receives the input audio ﬁle and sends it to the Server through Topic A. Then, the Server receives the input audio ﬁle and performs processing steps (described in Sect. 3.3). Finally, the output audio ﬁle is sent back to the Client by the Server through Topic B. At the same time, The output audio ﬁle is also stored in the database as a resource for future development. An overview of the system architecture is shown in Fig 3.

Fig. 3. System architecture.

Real-Time Pitch Correction System Using Kafka

5

345

Implementation

The implementation consists of three parts. The ﬁrst is to build an interactive application for the Client. Next is setting up the Kafka environment. And ﬁnally, the Pitch Correction unit has been built. All parts are implemented in Python and can be executed on computer clusters. 5.1

Audio Formating

For the convenience of encoding and processing, input and output audio ﬁles are in WAV ﬁle format by default. It is then converted to an array with a sampling rate of 44100 Hz - most common for musical audio thanks to Librosa [9]. 5.2

Kafka Setup

For simulation purposes, Kafka is installed just on a single computer. Because of the size of the WAV ﬁle format, the Broker conﬁgured with replica.fetch.max.bytes and message.max.bytes are both 1e8 bytes, similar to fetch.message.max.bytes on Consumer. 5.3

Building Client Application

For ease of use, a simple application has been built on the Streamlit1 platform. This application allows users to upload and listen back to raw WAV ﬁle and singer performance WAV ﬁle, then encode and send them to the Server. After, it decodes tuned audio returned from the Server into WAV format, allowing users to listen to it online or download it. Details of the interface are shown in Fig 4. 5.4

Building Pitch Correction Unit

To separate the singer’s performance with Spleeter, a 2stem pre-trained model is used to separate the original track into vocals and accompaniment. To speed up the above process, Tensorﬂow GPU backend is used for STFT transform. Thanks to Vampy2 - a wrapper for the Vamp audio analysis plugin API [3], the note-wise PYIN algorithm is easily implemented in Python with parameters listed in Table 1. Finally, for TD-PSOLA, a Python function was built based on Wager’s implementation [12]. To be more intuitive, Fig 5 shows the results of the pitch correction process on a short performance of the song named “Nang tho”, performed by Hoang Dung. The raw vocals were created by detuning Hoang Dung’s vocals.

1 2

https://streamlit.io/. https://www.vamp-plugins.org/vampy.html.

346

K. Bui and T.-H. Do

Fig. 4. System user interface.

Table 1. Parameters of PYIN implementation in Python. plugin key

“pyin:pyin”

output

’notes’

threshdistr

0.15

onsetsensitivity

0

lowampsuppression 0.1 prunethresh

0.05

Fig. 5. An example result of Pitch correction process.

Real-Time Pitch Correction System Using Kafka

6

347

Conclusion

This study focuses on correcting the raw vocal pitches to the correct tone with a measure of the singer’s pitches of the original song. When adjusting the pitch, the vocal characteristic must be considered. This was overcome when applying TD-PSOLA on notes extracted by PYIN techniques. In general, the research achieved the goal of assisting users in improving their vocals through the pitch correction system. Further, we plan to move beyond by creating a dataset for Vietnamese song pitch correction tasks based on the Deep Learning method and experimenting with them. One of the key features of this study is that the vocal is processed in real-time through Kafka Streaming. This makes the Pitch correction system easy to adapt to big data problems. Future, adding features that allow users to choose backing tracks and record them in real-time is the next step for expanding this research.

References 1. Antares audio technologies, Auto-Tune Live owner’s manual. https://www. antarestech.com/mediaﬁles/documentation records/10 Auto-Tune Live Manual. pdf (2016). Accessed 16 July 2022 2. Bastien, P.: Pitch shifting and voice transformation techniques. https://dsp-book. narod.ru/Pitch shifting.pdf, Accessed 16 July 2022 3. Cannam, C., Landone, C., Sandler, M.: Sonic visualiser: an open source application for viewing, analysing, and annotating music audio ﬁles, pp. 1467–1468 (2010). https://doi.org/10.1145/1873951.1874248 4. Charpentier, F., Stella, M.: Diphone synthesis using an overlap-add technique for speech waveforms concatenation. ICASSP ’86. IEEE International Conference on Acoustics, Speech, and Signal Processing, vol. 11, 2015–2018 (1986) 5. de Cheveign´e, A., Kawahara, H.: YIN, a fundamental frequency estimator for speech and music. J. Acoustical Soc. Am. 111(4), 1917–1930 (2002). https://doi. org/10.1121/1.1458024 6. Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and eﬃcient music source separation tool with pre-trained models. J. Open Source Softw. 5(50), 2154 (2020). https://doi.org/10.21105/joss.02154, deezer Research 7. Kreps, J.: Kafka : a distributed messaging system for log processing (2011) 8. Mauch, M., Dixon, S.: Pyin: a fundamental frequency estimator using probabilistic threshold distributions. In: 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 659–663 (2014). https://doi.org/10. 1109/ICASSP.2014.6853678 9. McFee, B., et al.: librosa/librosa: 0.9.2, June 2022. https://doi.org/10.5281/zenodo. 6759664 10. Taylor, P.: Text-to-Speech Synthesis. Cambridge University Press, Cambridge (2009). https://doi.org/10.1017/CBO9780511816338 11. Wager, S., et al.: Intonation: a dataset of quality vocal performances reﬁned by spectral clustering on pitch congruence. In: ICASSP 2019–2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 476–480 (2019). https://doi.org/10.1109/ICASSP.2019.8683554 12. Wager, S., Tzanetakis, G., Wang, C.I., Kim, M.: Deep autotuner: a pitch correcting network for singing performances (2020). https://doi.org/10.48550/ARXIV.2002. 05511, https://arxiv.org/abs/2002.05511

An Implementation of Human-Robot Interaction Using Machine Learning Based on Embedded Computer Thanh-Truc Tran1,2 , Thanh Vo-Minh1,2 , and Kien T. Pham1,2(B) 1

School of Electrical Engineering, International University, Ho Chi Minh City 700000, Vietnam [email protected] 2 Vietnam Nation University, Ho Chi Minh City 700000, Vietnam

Abstract. Communication and interaction between humans and robot is a potential research direction which can be applied in many ﬁelds. Machine learning based emotion and pose recognition is currently one of the primary methods to be targeted most by these types of robots, which are capable of engaging with and responds to humanity through a range of various scenarios. In this study, a prototype companion robot has been designed, a human-robot based interaction system to recognize human emotions and human behavior using machine learning has been developed and implemented on this companion robot, the robot can generates appropriate responses basing on the recognition results. The recognition models using convolution neural networks are built to be deployed in an embedded computer Jetson Nano which is very suitable to be implemented on companion robot. The amalgamation of both emotion and pose recognition is implemented successfully to control some actions and responses of the robots. The experimental ﬁndings of this research shows the potential applications of embedded artiﬁcial intelligence for designing robots, especially in the ﬁeld of human-robot based interaction. Keywords: human-robot interaction · facial emotion recognition pose recognition · embedded computer

1

·

Introduction

As robots’ place in society grows and diversiﬁes, the needs for more eﬀective communication between human and robot is higher than ever. In this research, a human-robot interaction system based on human emotion and pose is going to be built in the cause of enhance human-robot collaboration. The primary objective of human-robot interaction (HRI) is to equip robots with all the abilities required to engage in a conversation with humans. Since humans communicate both verbally and non-verbally, it is necessary for these social robots to interact with people in both ways. Nonverbal cues like facial expressions can be utilized to convey one’s mood during a conversation. Another form of nonverbal c The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 N.-N. Dao et al. (Eds.): ICIT 2023, LNDECT 187, pp. 348–359, 2023. https://doi.org/10.1007/978-3-031-46573-4_32

Electrical Engineering

349

communication is gesture - a physical movement used to communicate ideas in conjunction to speaking. This paper provides a method of developing a robot system that can eﬀectively interact with humans through either their emotion expressions or actions. Expressions are made up of a range of information about human behavior and emotion. Therefore, emotional information is so important in human communication. The robotic system must accurately and instantly capture the human pose to eﬀectually collaborate with people. By implementing a camera, the robot can track the motion of the human operator and further discern human intention.

2

Related Work

A lot of research is now being done on Facial Emotion Recognition (FER), and numerous models have been suggested. Especially, a 2020 review study by S. Li and W. Deng gives insight into status quo of face emotion recognition bases on deep learning techniques [1]. On the FER-2013 dataset, Pramerdorfer and Kampel used Convolutional Neural Networks to achieve the state-of-art accuracy of 75.2% [2]. The authors employ an ensemble of CNNs with VGG, Inception and ResNet architectures. However, the application of ensemble learning algorithm is not an eﬀective method to deploy the model on edge devices because the capacity is quite massive. Recently, Pascual et al. proposed a lightweight FER system on edge device Jetson Nano that achieved 69.87% on FER-2013 test set [3]. Lately, academics have performed a huge amount of study into various sensing technologies, and techniques for modeling and identifying human actions [4]. Deep learning techniques that could automatically extract pertinent features have been used to recognize activities. Yu, Moirangthem and Lee developed a continuous timescale long-short term memory (LSTM) model to solve the issue of recognizing human intentions [5]. The shallow network topology of the LSTM network, however, prevents it from accurately representing the complex characteristics of sequential data [6]. In this project, we make a model combined of convolutional neural networks (CNN) and LSTM to enhance the LSTM network’s capability for modeling sequential data. Many simulations involving human-robot interaction have been constructed using various hardware implementations and models to meet the author’s needs. A model for FER employing transfer learning that combines the KNN and CNN algorithms was presented by Wahab et al. in cooperation with Raspberry Pi allowed the writers to enable their model conducts in reality scenarios [7]. An LSTM neural network is presented in [8] to identify human intention with an UR5 robot and a KinectV2 depth camera to develop a human-robot collaboration scenario. Our research will create a human-robot interaction system based on both human emotions and poses on the Jetson Nano Development Kit.

350

3 3.1

T.-T. Tran et al.

Methodology System Overview

This section presents a comprehensive description in technical detail of how this project was developed. Figure 1 shows the main stages within the program overview. Firstly, the input frame is captured by webcam. This image will be pre-processed by either detecting face area or making a pose estimation. These pre-processed outputs subsequently go through CNN model to classify emotion or pose - this step is fulﬁlled by Jetson Nano. These predictions from machine learning models are then sent to both Arduino board and the tablet application. Arduino is account for controlling the robot direction. The robot face is an Android application, which took human emotion predictions and change the robot reaction according to the results. With the help of built-in speech recognition function in App Inventor, the tablet robot face app can also recognizes voice and answers some simple questions.

Fig. 1. System overview

3.2

Pre-processing

Facial Emotion Recognition. As for the emotion recognition task, the ﬁrst step is to detect only the facial part from the whole image, this progress is processed by Haar-Cascade detector [9]. This approach can satisfy the rigorous criteria of

Electrical Engineering

351

instantaneous, real-world situations. The face area would thereafter be trimmed and saved as a region of interest. To standardize for the training and testing phase, this image is then resized to 224 × 224-pixel. Pose Recognition. The initial stage in the action recognition task is to use the MediaPipe library for human posture detection in the image. The body landmarks were predicted and tracked using the pose estimation technique. All the landmarks’ points recorded by MediaPipe are subsequently saved as a text ﬁle and go through a variety of pre-processing stages, including label encoding, data split, one-hot encoding and segmentation. The training and testing data of 4 diﬀerent poses are captured with 6000 number of frames for each pose, which split into the ratio 80/20. 3.3

Extracting Features and Classification

Facial Emotion Recognition FER model using transfer learning techniques with EﬃcientNet. The experiment was conducted by using the FER-2013 database. The 2013 Facial Expression Recognition dataset (FER-2013) [10] includes photos of faces displaying six global meaning facial expressions- angry, disgust, fear, happy, sad, surprise and supplemented with neutral. A collection of 35,887 gray-scale 48 × 48-pixel photos are grouped according to various emotions. The transfer learning technique is implemented to create the FER system in order to address the data shortage issue. When the FER pattern is explicitly trained on narrow-scale datasets, over-ﬁtting causes the model to be less generic and unable to execute FER tasks in real-world environments [11]. This machine learning technique reuses weights obtained by training model on a wider database. This study pre-trains the EﬃcientNet transferred model using ImageNet data. EﬃcientNet uses an eﬃcient method for scaling up models called compound coeﬃcient - it scales each dimension uniformly with a ﬁxed set of scaling coefﬁcients [12]. We chose EﬃcientNet as base model for its lightweight while still achieved relatively high accuracy compared to others Keras application models [13]. As per this work, EﬃcientNetB3 is considered as the base model and fully connected layers are added on the uppermost part of it. For that upper-level, the original output layer is replaced with 4,096 and 1,024 fully connected layers respectively and a Softmax output layer of 7 classes regarding to seven emotions. Half of the layers in EﬃcientNetB3 are frozen, and keep the rest of the network trainable. Adam is used as an optimizer with a learning rate of 0.01, nonnegative weighting parameters are β 1 = 0.9 and β 2 = 0.999. Training process uses Adam and a 128-batch size for 100 epochs. Table 1 shows detailed architecture of our proposed model using EﬃcientNetB3 as base model for feature extraction. Pose Recognition. Initially, the input data (the position value of each landmarks received by MediaPipe) is segmented by taking every 30 time-steps per sample. This sample is then normalized to improve the training speed and accuracy. In

352

T.-T. Tran et al. Table 1. Keras Summary of the proposed network using EﬃcientNet Layer (type)

Output shape

Param #

Eﬃcientnet-B3 (None, None, None, 1536) 10783528 Flatten

(None, 1536)

0

Dense

(None, 4096)

6295552

Dropout

(None, 4096)

0

Dense

(None, 1024)

4195328

Dropout

(None, 1024)

0

Dense

(None, 7)

7175

Total parameters: 21,281,583 Trainable parameters: 20,335,445 Non-trainable parameters: 946,138

order to improve the model’s learning, four bidirectional LSTM layers with 32 neurons and an initialization kernel that applies Glorot Uniform initialization are then inserted between dropout layers. Convolutional layers, which are used to extract spatial characteristics, are then applied. The ﬁrst CNN layer has 64 neurons and the following one has 128 neurons. A Max-Pooling layer downsamples data in between the ﬁrst and second CNN layers and the Global Average Pooling (GAP) layer turns multi-dimensional feature maps into one-dimensional feature vectors. To reach the ﬁnal steps, the extracted information is deployed through three fully connected layers with a total of 64, 128, and 64 neurons, respectively. The dropout layers are also interleaved between the above layers to restrict overﬁtting. The result is returned via a layer with Softmax activation function. Figure 2 depicts the diagram of LSTM-CNN model for pose recognition.

Fig. 2. Diagram of LSTM-CNN model used in pose recognition

Electrical Engineering

3.4

353

Robot Design and Hardware Implementation

Robot Design. The companion robot built for this research is controlled using an Arduino micro-controller. The core processing unit (Jetson Nano) is ﬁrst classifying the expression and action. The predictions from Jetson Nano are then sent to Arduino to control robot motion and robot face expression. Table 2 summarizes the motions of the car corresponding to the model predictions. Table 2. Summary of car control Pose Prediction

Emotion Prediction

Robot Motion

Push both hand forward

Happy, Neutral, Surprised Forward Angry, Sad, Fear, Disgust Stop

Wave left hand

Happy, Neutral, Surprised Left Angry, Sad, Fear, Disgust Stop

Wave right hand

Happy, Neutral, Surprised Right Angry, Sad, Fear, Disgust Stop

Hold both hand over head Happy, Neutral, Surprised Backward Angry, Sad, Fear, Disgust Stop

Nvidia Jetson Nano Embedded Computer. The Nvidia Nano starts up with the Ubuntu operating system. Python is used as the main programming language in Jetson and libraries for computer vision projects, including OpenCV and TensorFlow are installed. However, Keras models are not supported on this embedded computer so it needs to be converted into ONNX model format. The ONNX model is imported afterwards to perform an inference using TensorRT. Emotion and pose recognition is operated on embedded computer Jetson Nano in real-time and categorized to display on the screen. Robot Face Design. The robot face using Masstel Tab 10A tablet to run an application. An application to display robot reactions in each circumstance is programmed using MIT App Inventor. The emotion prediction from Jetson Nano are sent to the application through Bluetooth. Graphics Interchange Format (GIF) images are designed to animate robot expressions. Using Speech Recognition and Text to Speech - the MIT App Inventor built-in tools, the app is also made to recognize short sentences with less than 10 words and able to answer some programmed commands. System Control. The companion robot built for this research is controlled using an Arduino micro-controller. The core processing unit (Jetson Nano) is ﬁrst classifying the expression and action from a video stream. The attained consummation is then sent to micro-controller board. The Arduino receives predictions from embedded computer Jetson Nano in order to control the robot. The DC motors and stepper motor obtain signal from micro-controller in the means

354

T.-T. Tran et al.

Fig. 3. Block diagram of system control

of triggering the correct motor and moving the robot in the desired direction. Figure 3 represents the robot control block diagram. The robot built for this research is equipped with 2 DC motors to control 2 back wheels, which moves the robot forward and backward. The ﬁrst stepper motor controls the robot’s front wheel steering angles. The others two helps the robot to either shake it head around or nod. The whole system is powered by a 12 Volts DC lead acid battery and a DC to DC step-down voltage regulator to produce the 5 Volts power supply for the Arduino. Figure 4 shows the system schematic.

Fig. 4. Robot Schematic

Figure 5 depicts the companion robot design with Jetson Nano as the main core, a C270 Logitech Webcam With a resolution of HD 720p for real-time video capture. The tablet works as the robot face display three diﬀerent expressions: happy, sad and angry. Three stepper motors to control the robot head to shake, nod and direct the robot steering angle correspondingly. Two DC motors on

Electrical Engineering

355

the back wheels drive the robot in forth and back direction. An Arduino in charge of control the robot motion following the predictions received from the Jetson Nano.

Fig. 5. Robot Design

4 4.1

Results and Discussion Facial Emotion Recognition

The FER model carried out in this work undergo testing stages with 7,178 test images from FER-2013 and the achieved accuracy 72%. This accuracy is higher than Pascual et al. 69.87% result [3]. Our model took 150 ms to produce the real-time prediction results on Jetson Nano as demonstrated in Fig. 6.

Fig. 6. Emotion detection on Jetson Nano

The confusion matrix of the model is presented in Fig. 7. The matrix indicates anger, fear and neutral are apt to be misclassiﬁed with sadness and vice

356

T.-T. Tran et al.

versa. This mostly caused by the raw data of FER-2013, we can clearly observe similarities between the expressions of these sets of feelings.

Fig. 7. Confusion matrix of EﬃcientNet model

The performance measures of the proposed model can be calculated from the confusion matrix and expressed in Table 3. Table 3. Performance measures of the model Recall Precision F1 Score Angry

0.64

0.67

0.65

Disgust

0.61

0.77

0.68

Fear

0.51

0.63

0.56

Happy

0.90

0.89

0.89

Neutral 0.71

0.66

0.69

Sad

0.64

0.59

0.62

Surprise 0.83

0.79

0.81

Electrical Engineering

4.2

357

Pose Recognition

The model for pose recognition reaches 100% accuracy on our own dataset, succeeds in classifying four diﬀerent activities created corresponding to four commands - Left, Right, Forward and Backward. Figure 8a and Fig. 8b demonstrate pose classiﬁcation when human waving left and right hand respectively, Fig. 8c of human pushing both hand forward is admitted as the command to go forward and Fig. 8d of human holding both hand over head is perceived as the command to go backward.

Fig. 8. Real-time action recognition

Fig. 9. Robot motion controlled by human emotion and pose recognition

4.3

Robot Motion

The Jetson Nano human emotion and pose recognition application captures human waving left hand with happy expression Fig. 9a, it will turns the car

358

T.-T. Tran et al.

from reference point Fig. 9b to the left as shown in Fig. 9c. When the human changes to waving the right hand with the same emotion Fig. 9d, the system recognizes and the car turn right back to the reference point Fig. 9e, continue to wave the right hand spins the car to the right side as in Fig. 9f. The car however stops immediately after the human facial expression changes to angry, sad, disgust or fear, and the tablet robot face changes to sad afterwards to express sympathy.

5

Conclusion

A system of human emotion and pose recognition for human-robot interaction on embedded computer was proposed in this paper. On facial emotion recognition task, the system used transfer learning of EﬃcientNet and the model is implemented on the Jetson Nano embedded computer. For recognition of postures, the LSTM combined with CNN model is proposed to identify four human poses and the model exhibits for sequential data to classify gestures accurately. The amalgamation of both emotion and pose recognition is implemented successfully on Jetson Nano embedded computer. The predicted emotion and pose recognition results are ﬁnally sent to companion robot for its responsive motions with robot face expression using tablet. However, the recognition accuracy and speed of the EﬃcientNet model is still low; therefore, more data need to be collected, contributed to the current dataset and more researches have to be carried out to improve the performance of this model. The experimental ﬁndings of this research shows to the potential applications of embedded artiﬁcial intelligence for human-robot interaction.

References 1. Li, S., Deng, W.: Deep facial expression recognition: a survey. IEEE Trans. Aﬀect. Comput. 13(3), 1195–1215 (2022) 2. Pramerdorfer, C., Kampel, M.: Facial expression recognition using convolutional neural networks: state of the art. arXiv preprint arXiv:1612.02903 (2016) 3. Pascual, A.M., et al.: Light-FER: a lightweight facial emotion recognition system on edge devices. Sensors 22, 9524 (2022) 4. Chen, L., Nugent, C.D., Wang, H.: A knowledge-driven approach to activity recognition in Smart Homes. IEEE Trans. Knowl. Data Eng. 24(6), 961–974 (2012) 5. Yu, Z., Moirangthem, D.S., Lee, M.: Continuous timescale long-short term memory neural network for human intent understanding. Front. Neurorobot. 11, 42 (2017) 6. Sagheer, A., Kotb, M.: Time series forecasting of petroleum production using deep LSTM recurrent networks. Neurocomputing 323, 203–213 (2019) 7. Ab Wahab, M.N., Nazir, A., Zhen Ren, A.T., Mohd Noor, M.H., Akbar, M.F., Mohamed, A.S.: Eﬃcientnet-Lite and hybrid CNN-KNN implementation for facial expression recognition on Raspberry Pi. IEEE Access 9, 134065–134080 (2021) 8. Yan, L., Gao, X., Zhang, X., Chang, S.: Human-robot collaboration by intention recognition using Deep LSTM neural network. In: 2019 IEEE 8th International Conference on Fluid Power and Mechatronics (FPM) (2019)

Electrical Engineering

359

9. Viola, P., Jones, M.: Rapid object detection using a boosted cascade of Simple features. In: Proceedings of the 2001 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (2001) 10. Goodfellow, I.J., et al.: Challenges in representation learning: a report on three machine learning contests. Neural Netw. 64, 59–63 (2015) 11. Hawkins, D.M.: The problem of overﬁtting. J. Chem. Inf. Comput. Sci. 44(1), 1–12 (2003) 12. Tan, M., Le, Q.: Eﬃcientnet: rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning. PMLR (2019) 13. Keras Applications. https://keras.io/api/applications/

DIKO: A Two-Stage Hybrid Network for Knee Osteoarthritis Diagnosis Using Deep Learning Trung Hieu Phan1 , Thiet Su Nguyen2 , Trung Tuan Nguyen3 , Tan Loc Le1 , Duc Trung Mai1 , and Thanh Tho Quan1(B) 1

2

3

Faculty of Computer Science and Engineering, Ho Chi Minh City University of Technology (HCMUT), VNU-HCMC, Ho Chi Minh City, Vietnam [email protected] Ho Chi Minh City University of Science (HCMUS), VNU-HCMC, Ho Chi Minh City, Vietnam University of Information Technology (UIT), VNU-HCMC, Ho Chi Minh City, Vietnam

Abstract. Knee osteoarthritis is one of the most common joint diseases. Many studies have explored automated diagnosis using artiﬁcial intelligence, but their results are unsatisfactory when tested on realworld data for Vietnamese patients. Therefore, in this article, we propose a high-performing model based on our datasets, namely DIKO. Unlike other methods that directly diagnose from original images, DIKO uses a new approach that includes two stages: (1) identifying the ROI (Region of Interest) in X-ray images before classiﬁcation, and (2) using a Multicolumn CNN model combining InceptionV3 with DenseNet201 architecture to extract features from knee images eﬀectively. In experiments using real-world knee X-ray datasets for Vietnamese patients, DIKO outperformed previously published methods and shows promise for real-world applications, especially when integrated with IoT systems. This will enable doctors and caregivers to monitor patient health in realtime and make timely actions. Keywords: hybrid model osteoarthristis

1

· detection · classiﬁcation · knee

Introduction

Knee osteoarthritis is a prevalent condition among middle-aged and elderly individuals, causing pain that hinders mobility, daily activities, and the overall quality of life. Cui et al. [1] conducted a survey across 6 continents and estimated the prevalence of knee osteoarthritis to be 19.2% in Asia, 13.4% in Europe, 15.8% in North America, 4.1% in South America, 3.1% in Oceania, and 21.0% in Africa. In Vietnam, Nguyen et al. [2] conducted an X-ray-based study, revealing a knee osteoarthritis prevalence of 34.2%, which increased with age: 8% in the 40–49 c The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 N.-N. Dao et al. (Eds.): ICIT 2023, LNDECT 187, pp. 360–369, 2023. https://doi.org/10.1007/978-3-031-46573-4_33

A Two-Stage Hybrid Network for Knee Osteoarthritis Diagnosis

361

age group, 30% in the 50–59 age group, and 61.1% in those over 60 years old. These ﬁndings highlight the global prevalence of knee osteoarthritis. While various imaging-based diagnostic methods are available, X-ray is commonly used due to its aﬀordability and accessibility. With the recent advancement of Artiﬁcial Intelligence (AI) techniques, many researches have been reported to consider assisting doctors with intelligent systems, especially in the diagnosis process on X-ray images. Ali et al.[4] presented the beneﬁts of artiﬁcial intelligence in healthcare. Applying artiﬁcial intelligence to the diagnosis of knee osteoarthritis would beneﬁt the detection, prevention, and treatment of the disease. Currently, there are many studies on the use of artiﬁcial intelligence in this problem, including two main approaches: machine learning [5,6] and deep learning [8,9]. However, these methods face various challenges when applied to realworld datasets, including the limited and imbalanced labeled data, varied image quality, and physical characteristics of diﬀerent regions. Especially, when tested with real Vietnamese patient datasets, the existing works suﬀered poor performance, probably due to speciﬁc latent features of Vietnamese cases that have not been eﬀectively analyzed and captured by public datasets. To address these challenges, we propose a new approach called DIKO (DenseNet - Inception for Knee Osteoarthritis). Unlike other works that utilized a oneshot end-to-end classiﬁcation, our approach involves two stages. In the ﬁrst stage, we accurately identify the Region of Interest (see Sect. 2) on images captured by the X-ray machines. In the second one, the identiﬁed ROI is further processed by a hybrid architecture of Inception [19] and DenseNet [21] for feature extraction and classiﬁcation. The contributions of our study include: 1. Proposing the DIKO model, a novel two-stage approach with combined architecture that outperforms other related studies in the problem of Knee Osteoarthritis diagnosis, especially for Vietnamese patients. 2. Introducing real-world datasets of knee X-ray images for Vietnamese people, which have been processed and enriched for eﬀective training. The rest of this paper is organized into the following sections. In Sect. 2, we study knee osteoarthritis and its stages. In Sect. 3, we survey related works. Our proposed model is presented in Sect. 4. In Sect. 5, we present experimental results based on the constructed and augmented dataset and compare them with other models. Finally, Sect. 6 is our conclusion and future work.

2

Knee Osteoarthristis and Region of Interest

Knee osteoarthritis (KOA) is a degenerative joint disorder that impacts the articular cartilage, synovium, and subchondral bone of the knee joint. It is prevalent among older adults and causes pain, stiﬀness, and disability as the cartilage that cushions the bones in the knee joint wears down over time. The KL classiﬁcation system [3] outlines ﬁve grades of knee osteoarthritis (Fig. 1).

362

T. H. Phan et al.

– – – –

Grade 0: Healthy knee. Grade 1: Minor osteophytes or doubtful to have osteophytes. Grade 2: Visible osteophytes, slightly joint space narrowing. Grade 3: Noticeable joint space narrowing, subchondral sclerosis, many osteophytes, possible bone deformation. – Grade 4: Severe joint space narrowing (even possibly complete joint space loss), subchondral sclerosis, large osteophytes, and visible bone deformity.

Fig. 1. Example of KL classiﬁcation system, severity increases from left to right

The region of Interest (ROI) of the knee is the central area of the knee joint. In diagnosis, the doctors usually focus on this area because it is where degenerative and injured conditions occur. Figure 2 shows an example of the knee’s ROI, including the shin bone (tibia) and the thigh bone (femur). Based on the distance between these two bones and other conditions observed, doctors can diagnose the disease.

Fig. 2. ROI of the knee X-ray image

In Vietnam, the current diagnostic process for knee osteoarthritis relies on manual evaluations conducted by medical professionals to identify the ROI and ascertain the presence or absence of the condition. Considering this situation, our research aims to revolutionize this approach by introducing the DIKO model. DIKO is designed to automatically locate the ROI on X-ray images and provide accurate diagnoses for knee osteoarthritis, thereby oﬀering a reliable alternative to manual evaluations. This advancement has the potential to signiﬁcantly enhance the diagnostic accuracy of KOA while also reducing costs and saving valuable time.

3

Related Works

Various research groups worldwide have utilized CNN models to diagnose osteoarthritis. The results from such proposed solutions have shown to be more

A Two-Stage Hybrid Network for Knee Osteoarthritis Diagnosis

363

eﬀective than traditional machine learning methods [14]. Bayramoglu et al. [7] employed a combination of features extracted by BoneFinder [16] and features from a TinyCNN network to make predictions. Wang et al. [8] used the VGG-16 [17] network to locate the center of the knee joint, and ResNet-50 [18] to classify KOA severity. Tiulpin et al. [10] used the Siamese [24] network with input pairs of the lateral and medial to evaluate the disease severity. An alternative to using end-to-end CNN models for classiﬁcation involves using CNNs to extract features from X-ray images and combining them with additional features for improved classiﬁcation, which has shown promising results. In [11], CNN-extracted features were normalized and selected through PCA before using SVMs to determine disease severity. Gu et al. [12] proposed a new method of using U-Net [15] to determine joint space narrowing in combination with the VGG-16 network output to create a feature vector for the Random Forest classiﬁer. In recent years, attention-based networks have become increasingly popular in medical image processing. G´orriz et al. [9] combined Convolutional-Attention with VGG-16 to enable the model to learn from the Regions of Interest without extraction in advance. In addition to Convolutional-based Attention mechanisms, Self-Attention mechanisms, derived from the Transformer model [22] in Natural Language Processing, have been successfully applied in Computer Vision, outperforming CNNs when suﬃcient data is available. Alshareef et al. [13] used the ViT [23] model to diagnose the degree of knee joint degeneration. However, due to the lack of medical data, this model was not very eﬀective.

4

DIKO Architecture

Our approach to diagnosing knee osteoarthritis consists of two stages. In the ﬁrst stage, we utilize the YOLOv5 model for ROI identiﬁcation in X-ray images. This state-of-the-art object detection algorithm enables accurate and eﬃcient detection of the knee’s ROI, which is crucial for subsequent diagnosis. In the second stage, our hybrid model is employed to generate the ﬁnal diagnosis result, showcasing exceptional performance. By combining these two stages, we are able to oﬀer an eﬃcient and accurate method for diagnosing knee osteoarthritis using X-ray images. The overall architecture of our model is illustrated in Fig. 3. 4.1

Stage 1: ROI Identification

The ROI of the knee is where most of the symptoms of knee osteoarthritis manifest (see Sect. 2). To avoid model learning noise and unnecessary information from the X-ray images, the ROI needs to be extracted in advance. For accurate ROI identiﬁcation, we employed the YOLOv51 model to identify this ROI. YOLOv5, introduced by Glenn Jocher in 2020, is built on the previous YOLO architecture, incorporating optimization strategies such as mosaic data augmentation, random aﬃne transform, and auto-anchor using genetic algorithm [26]. 1

https://github.com/ultralytics/yolov5.

364

T. H. Phan et al.

Fig. 3. Overall architecture of DIKO

According to the experimental results of Horvat et al. [26], YOLOv5 surpasses previous models in computation speed and accuracy. We trained the YOLOv5 model on our labeled knee images dataset (Sect. 5.1). The ﬁnal model was used to extract the ROI and resize the images to 224 × 224 as inputs for the following step. 4.2

Stage 2: ROI Analysis

We combined DenseNet201 [21] and InceptionV3 [19] in our knee osteoarthritis diagnosis approach for improved accuracy in medical image analysis. DenseNet201 is known for its eﬃcient parameter usage, which makes it easier to scale up the model while maintaining high accuracy. InceptionV3, on the other hand, uses multiple parallel convolutional layers with diﬀerent kernel sizes to extract diverse features from images. Our experimental results showed that combining these two models led to higher accuracy compared to using a single model alone. The overall architecture of our model is illustrated in Fig. 3. The model architecture consists of two branches, each associated with a backbone network. Global Average Pooling is applied to each feature map to extract features from each channel, as it has been found to be more eﬀective than Max Pooling (Tiulpin et al. [10]). The resulting feature vectors are then passed through a fully connected layer with ReLU activation, reducing the dimensionality of the vector and generating higher-level features. These feature vectors are interconnected and passed through additional fully connected layers to facilitate interactions between the features from both networks. Disease classiﬁcation is performed using either sigmoid or softmax activation. The binary cross entropy loss function is utilized with this architecture to optimize the model’s performance. The equation for the binary cross-entropy loss is as follows: LBCE = −

N 1 (yi log (ˆ yi ) + (1 − yi ) log (1 − yˆi )) N i=1

A Two-Stage Hybrid Network for Knee Osteoarthritis Diagnosis

365

where N is the number of samples, yi is the true label for sample i, and yˆi is the predicted probability for sample i. Furthermore, DenseNet201 and InceptionV3 are both eﬀective feature extractors. Experimental results have demonstrated that combining these two networks produces superior features for the diagnosis of knee osteoarthritis.

5 5.1

Experiment Datasets

We would like to introduce two datasets: “Knee X-ray Yes/No” and “Knee’s ROI Identiﬁcation”. These datasets include posterior-anterior (PA) and lateral (LAT) X-ray images of the knee from patients at a hospital in Vietnam. Each image in the ﬁrst dataset has been evaluated by medical professionals to identify the presence or absence of symptoms of osteoarthritis. The second dataset was annotated with bounding boxes to facilitate ROI identiﬁcation. These datasets serve as valuable resources for researchers and healthcare professionals seeking to advance their knowledge of knee joint pathologies and related conditions. The dataset “Knee X-ray Yes/No” contains 895 images, out of which 151 are marked as “No” indicating the absence of knee osteoarthritis, and 744 are marked as “Yes” indicating the presence of symptoms. Our research indicates that this is the ﬁrst dataset in Vietnam speciﬁcally designed for knee osteoarthritis. This dataset holds immense potential for the development of machine-learning models that can assist in the early detection and diagnosis of knee osteoarthritis in patients, particularly those from Vietnam.

Fig. 4. Examples of images from our datasets

On the other hand, “Knee’s ROI Identiﬁcation” is another dataset comprising 1268 images, where each image is annotated with a bounding box to identify the ROI (Fig. 4). This dataset is also the ﬁrst of its kind in Vietnam, aimed at knee’s ROI identiﬁcation. We aspire to utilize this dataset to develop advanced knee’s ROI identiﬁcation models that can assist in diagnosing knee joint diseases for Vietnamese patients.

366

T. H. Phan et al.

Table 1. Results of the YOLOv5 model on the “Knee’s ROI Identiﬁcation” dataset mAP0.5 mAP0.5:0.95 Times per epoch (s) 0.995

5.2

0.73

18

Results

ROI Identification. In our study, we partitioned the “Knee’s ROI identiﬁcation” dataset into three subsets, including 1108 images for training, 106 images for validation, and 54 images for testing. We trained our model on a GPU Nvidia Tesla T4 with 12GB of RAM. The evaluation results of the test set are presented in Table 1. Our ﬁndings indicate that the model achieved a near-perfect accuracy level with an IoU greater than 0.5 and required a relatively low training time. This information highlights the eﬀectiveness of our model in knee’s ROI identiﬁcation and its potential application in the diagnosis of knee joint diseases. Augmentation to Address Imbalanced Dataset. The “Knee X-ray Yes/No” dataset was collected from Vietnamese patients, resulting in more images with symptoms than without, leading to imbalanced data. According to Buda et al. [25], data imbalance can have a negative impact on model performance. To address this, we utilized data augmentation techniques to oversample the non-symptomatic images. Table 2. Comparison of rebalanced and imbalanced datasets No Yes Overall Before rebalanced 151 744 895 After rebalanced

755 744 1499

The techniques used for data augmentation included horizontal ﬂipping and rotating the ﬂipped images by 20◦ in both clockwise and counterclockwise directions. Further details on the number of images before and after balancing can be found in Table 2. Table 3. Performance of the model before and after data rebalancing Model

Accuracy F1-Score

DIKO (before) 85.45%

0.836

DIKO (after)

0.878

88.74%

According to Table 3, the model trained on augmented data outperformed the model trained on non-augmented data in terms of accuracy. Moreover, the

A Two-Stage Hybrid Network for Knee Osteoarthritis Diagnosis

367

F1-score of the augmented model improved, primarily due to more accurate predictions in the non-symptomatic class, which had limited data. This ﬁnding highlights the eﬀectiveness of data augmentation techniques in reducing the impact of data imbalance and improving the performance of machine learning models in knee osteoarthritis diagnosis. Table 4. Comparison of the proposed model and other popular CNN architectures Model

Accuracy F1-Score

VGG-16 [17] 71,82% 74.19% ResNet50 [18] 81.21% InceptionV3 [19] InceptionResNetV2 [20] 80.91% 83.03% DenseNet201 [21]

0.706 0.73 0.794 0.804 0.824

DIKO (ours)

0.878

88.74%

Comparison with Other Architectures. The results presented in Table 4 demonstrate that our model is signiﬁcantly more eﬀective than other popular CNN architectures. Additionally, architectures with more layers outperform those with fewer layers. Moreover, architectures that can combine features from diﬀerent kernel sizes, such as InceptionV3 [19] and InceptionResNetV2 [20], also produce better results than models that use only a ﬁxed kernel size. Our model achieved promising results by combining features extracted from two architectures, InceptionV3 and DenseNet201. This highlights the beneﬁts of leveraging the strengths of multiple deep learning models for improved performance in medical image analysis, particularly in tasks such as knee osteoarthritis diagnosis, where accuracy is crucial.

6

Conclusion

In this study, we introduced two datasets of X-ray images for knee osteoarthritis diagnosis in Vietnamese patients, “Knee X-ray Yes/No” and “Knee’s ROI Identiﬁcation”. We proposed an automatic diagnosis method that involves two stages: using YOLOv5 to identify the ROI from images captured by the X-ray machine and diagnosing knee osteoarthritis using the Multicolumn-CNN architecture. This new architecture combines two highly eﬀective CNN architectures, InceptionV3 and DenseNet201. Our model demonstrated superior performance compared to other modern CNN architectures, with an accuracy of 85.45% on the original dataset. We also provided a data rebalancing method using data augmentation techniques. Training DIKO on a rebalanced dataset improved the accuracy to 88.74%, and the F1-score increased from 0.836 to 0.878. In the

368

T. H. Phan et al.

future, we will create a dataset with various levels of knee osteoarthritis and collect more data on other knee joint diseases. Additionally, we will enhance and expand our model to incorporate information from IoT devices to increase accuracy and provide reliable and timely diagnoses for patients. Overall, our ﬁndings highlight the eﬀectiveness of our proposed method and demonstrate its potential for real-world clinical applications. Acknowledgments. This research is funded by Vietnam National Foundation for Science and Technology Development (NAFOSTED) under grant number IZVSZ2.203310.

References 1. Cui, A., Li, H., Wang, D., Zhong, J., Chen, Y., Lu, H.: Global, regional prevalence, incidence and risk factors of knee osteoarthritis in population-based studies. EClinicalMedicine 29–30 (2020) 2. Ho-Pham, L.T., Lai, T.Q., Mai, L.D., Doan, M.C., Pham, H.N., Nguyen, T.V.: Prevalence of radiographic osteoarthritis of the knee and its relationship to selfreported pain. PLoS ONE 9(4), e94563 (2014) 3. Kellgren, J.H., Lawrence, J.S.: Radiological assessment of osteo-arthrosis. Ann. Rheum. Dis. 16(4), 494–502 (1957) 4. Ali, O., Abdelbaki, W., Shrestha, A., Elbasi, E., Alryalat, M.A.A., Dwivedi, Y.K.: A systematic literature review of artiﬁcial intelligence in the healthcare sector: beneﬁts, challenges, methodologies, and functionalities. Innov. Knowl. 8(1), 100333 (2023) 5. Wahyuningrum, R.T., Anifah, L., Purnama, I.K., Purnomo, M.H.: A novel hybrid of S2DPCA and SVM for knee osteoarthritis classiﬁcation. In: International Conference on Computational Intelligence and Virtual Environments for Measurement Systems and Applications (CIVEMSA), pp. 1–5. IEEE (2016) 6. Aprilliani, U., Rustam, Z.: Osteoarthritis disease prediction based on random forest. In: International Conference on Advanced Computer Science and Information Systems (ICACSIS), pp. 237–240. IEEE (2018) 7. Bayramoglu, N., Nieminen, M.T., Saarakkala, S.: A lightweight CNN and joint shape-joint space (JS 2 ) descriptor for radiological osteoarthritis detection. In: Papie˙z, B.W., Namburete, A.I.L., Yaqub, M., Noble, J.A. (eds.) MIUA 2020. CCIS, vol. 1248, pp. 331–345. Springer, Cham (2020). https://doi.org/10.1007/978-3-03052791-4 26 8. Wang, Y., Li, S., Zhao, B., Zhang, J., Yang, Y., Li, B.: A ResNet-based approach for accurate radiographic diagnosis of knee osteoarthritis. CAAI Trans. Intell. Technol. 7(3), 512–521 (2022) 9. G´ orriz, M., Antony, J., McGuinness, K., Gir´ o-i-Nieto, X., O’Connor, N. E.: Assessing knee OA severity with CNN attention-based end-to-end architectures. In: Proceedings of International Conference on Medical Imaging with Deep Learning, pp. 197–214. PMLR (2019) 10. Tiulpin, A., Thevenot, J., Rahtu, E., Lehenkari, P., Saarakkala, S.: Automatic knee osteoarthritis diagnosis from plain radiographs: a deep learning-based approach. Sci. Rep. 8, 1727 (2018) 11. Ahmed, S.M., Mstafa, R.J.: Identifying severity grading of knee osteoarthritis from X-ray images using an eﬃcient mixture of deep learning and machine learning models. Diagnostics 12(12), 2939 (2022)

A Two-Stage Hybrid Network for Knee Osteoarthritis Diagnosis

369

12. Gu, H., et al.: Knee Arthritis Severity Measurement using Deep Learning: A Publicly Available Algorithm with A Multi-Institutional Validation Showing Radiologist-Level Performance. arXiv preprint arXiv:2203.08914 (2022) 13. Alshareef, E.A., et al.: Knee osteoarthritis severity grading using vision transformer. Intell. Fuzzy Syst. 43(6), 8303–8313 (2022) 14. Abedin, J., et al.: Predicting knee osteoarthritis severity: comparative modeling based on patient’s data and plain X-ray images. Sci. Rep. 9, 5761 (2019) 15. Ronneberger, O., Fischer, P., Brox, T.: U-Net: convolutional networks for biomedical image segmentation. In: Navab, N., Hornegger, J., Wells, W.M., Frangi, A.F. (eds.) MICCAI 2015. LNCS, vol. 9351, pp. 234–241. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-24574-4 28 16. Lindner, C., et al.: Fully automatic segmentation of the proximal femur using random forest regression voting. IEEE Trans. Med. Imaging 32(8), 1462–1472 (2013) 17. Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. In: International Conference on Learning Representations (ICLR), pp. 1–14 (2015) 18. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 770–778. IEEE (2016) 19. Szegedy, C., Vanhoucke, V., Ioﬀe, S., Shlens, J., Wojna, Z.: Rethinking the inception architecture for computer vision. In: Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2818–2826. IEEE (2016) 20. Szegedy, C., Ioﬀe, S., Vanhoucke, V., Alemi, A.: Inception-v4, Inception-ResNet and the impact of residual connections on learning. In: Proceedings of the ThirtyFirst AAAI Conference on Artiﬁcial Intelligence (AAAI), pp. 4278–4284. ACM (2016) 21. Huang, G., Liu, Z., Van Der Maaten, L., Weinberger, K. Q.: Densely connected convolutional networks. In: Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2261–2269. IEEE (2017) 22. Vaswani, A., et al.: Attention is all you need. In: Advances in Neural Information Processing Systems (NIPS), vol. 30, pp. 5998–6008 (2017) 23. Dosovitskiy, A., et al.: An image is worth 16x16 words: transformers for image recognition at scale. In: International Conference on Learning Representations (ICLR) (2021) 24. Chopra, S., Hadsell, R., LeCun, Y.: Learning a similarity metric discriminatively, with application to face veriﬁcation. In: Proceedings of 2005 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), vol. 1, pp. 539–546. IEEE (2005) 25. Buda, M., Maki, A., Mazurowski, M.A.: A systematic study of the class imbalance problem in convolutional neural networks. Neural Netw. 106, 249–259 (2018) 26. Horvat, M., Jeleˇcevi´c, L., Gledec, G.: A comparative study of YOLOv5 models performance for image localization and classiﬁcation. In: Proceedings of the Central European Conference on Information and Intelligent Systems (CECIIS), pp. 349– 356 (2022)

Shallow Convolutional Neural Network Configurations for Skin Disease Diagnosis Ngoc Huynh Pham1 , Hai Thanh Nguyen2(B) , and Tai Tan Phan2 1

FPT University, FPT Polytechnic, Can Tho, Vietnam [email protected] 2 Can Tho University, Can Tho, Vietnam [email protected], [email protected]

Abstract. Cancer incidence is usually relatively low, skin diseases are not paid enough attention, and most patients, when admitted to the hospital, are in a late state which is already much damage and making it diﬃcult to treat. Some diseases have so many similarities that it is diﬃcult to distinguish between diseases when viewed with the naked eye. Currently, the trends of applying artiﬁcial intelligence techniques to support medical imaging diagnosis are vigorously applied and achieved many achievements with deep learning in image recognition. Deep architectures are complex and heavy, while shallow architectures also bring good performance with some appropriate conﬁgurations. This study investigates conﬁgurations of shallow convolutional neural networks for binary classiﬁcation tasks to support skin disease diagnosis. Our work focuses on studying and evaluating the eﬀectiveness of simple architectures with high accuracy on the problem of skin disease identiﬁcation through images. The experiments on eight considered skin diseases have revealed that shallow architectures can perform better on small image sizes (32 × 32) rather than larger ones (128 × 128) with more than 0.75 in accuracy on all considered diseases. Keywords: diagnostics · shallow convolutional neural networks conﬁgurations · skin diseases

1

·

Introduction

Biologically, skin cancer [1] is a phenomenon in which epidermal cells in the body grow out of order, grow abnormally, leading to tumor formation. If the tumor is made up of malignant cells, it is cancerous. The most common types of skin cancer are basal cell carcinoma (about 80%), squamous cell carcinoma (about 16%), and melanoma - the most dangerous type, but the incidence is lowest (about 4%) [2]. The malignancy of skin cancer is ranked among the lowest among cancers because it has a relatively slow growth rate, so the mortality rate is lower than other forms of cancer. However, if subjectively not treated in time, it will leave many severe dangers to the health and aesthetics of the patient. c The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 N.-N. Dao et al. (Eds.): ICIT 2023, LNDECT 187, pp. 370–381, 2023. https://doi.org/10.1007/978-3-031-46573-4_34

Shallow Convolutional Neural Network Conﬁgurations

371

Artiﬁcial Intelligence (AI) applications in general and deep learning, and Deep learning (DL) in particular, to diagnose skin diseases based on images is a new approach to support doctors. In addition, Convolutional Neural Networks (CNNs) are widely implemented in numerous medical applications. Many studies have focused on very deep architectures, e.g., ResNet152 [3] with 152 weight layers for images of 224 × 224 pixels, but in some cases, a shallow architecture also brings promising results. Our study evaluates shallow CNN architectures’ eﬃciency with only a few convolutional layers on image-based skin disease classiﬁcation tasks. We provide a deep analysis of shallow CNNs, including convolutional layers and the number of ﬁlters and input size. The results show that with small images such as 32 × 32, the classiﬁcation performance can obtain more than 0.75 on all considered diseases.

2

Related Work

Applying deep learning methods to diagnosis is a new topic in the medical ﬁeld. Skin diseases are the most common and easily seen by the naked eye. However, clearly distinguishing between skin diseases to ﬁnd appropriate and timely treatment is challenging. In the article [4], the authors reviewed 45 studies on identifying skin diseases by deep learning. The authors analyzed these studies from several perspectives: disease type, data set, data processing technology, data augmentation technology, the model for skin disease image recognition, deep learning framework, evaluation indicators, and model performance. In addition, they summarized the methods of disease diagnosis and treatment by two methods: traditional and machine learning. Then the authors conﬁrmed that the deep learning-based skin disease recognition method is better than the doctor’s and other computer-aided methods in diagnosing skin diseases, especially the best recognition eﬃciency when using the multi-deep learning model. The authors in [5] reviewed the state of the art in computer-aided diagnosis systems and examined recent practices in diﬀerent steps of computer-aided diagnosis systems. Among machine learning techniques used for skin cancer diagnosis these days, support vector machine (SVM) is the most prominent, and the diagnostic accuracy of these systems lies between 60%-97%. In short, they argue that new researchers need standard procedures and publicly available datasets to ﬁght this most deadly disease together. Research [6] proposed a Computer-aided diagnosis (CAD) system for diagnosing the most common skin lesions such as psoriasis, acne, eczema, melanoma, and benign. The experimental process was performed on 1800 images and achieved 83% accuracy for 6-class classiﬁcation by SVM with the quadratic kernel. The work [7] used the relative color histogram analysis technique to evaluate skin lesion discrimination based on computing color characteristics in diﬀerent regions of skin lesions in dermoscopy images. This technique was performed with diﬀerent training set sizes from 113 malignant melanomas and 113 benign dysplastic nevi images. The experimental results achieved the recognition rate of malignant melanoma of 87.7% and

372

N. H. Pham et al.

high dysplastic nevi of 74.9%. The paper [8] investigated the performance of deep learning-based approaches to classify skin diseases from color digital images. The authors have applied recent network models such as U-net, Inception Version-3 (InceptionV3), Inception and Residual Network (InceptionResNetV2), VGGNet, and Residual Network (ResNet). The evaluation of the results obtained from the network models shows that it is possible to diagnose automatically from digital images with accuracy from 74% (U-net) to 80% (ResNet). The authors argue that further research is needed to develop a new model that combines the advantages of many diﬀerent network models to achieve optimal accuracy. In study [9], the authors used a dual-stage approach that eﬀectively combined Computer Vision and Machine Learning on evaluated histopathological attributes to identify the disease accurately. The model was trained and tested on six diseases: Psoriasis, Seborrheic Dermatitis, Lichen Planus, Pityriasis Rosea, Chronic Dermatitis, and Pityriasis Rubra Pilaris. The accuracy achieved after training and testing is up to 95%. The work in [10] presented the co-attention fusion network for skin cancer diagnosis and reached 76.8% on the seven-point checklist dataset.

3

Methods

3.1

Data Set Description

The samples used for testing the topic are taken from the training dataset named ISIC 20191 , a large dataset with 25,331 skin lesion images. Skin samples in JPG format are classiﬁed into eight disease categories: dermal ﬁbroma (DF), vascular lesion (VASC), squamous cell carcinoma (SCC), light keratosis (AK), keratosis seborrheic (BKL), basal cell carcinoma (BCC), melanoma (MEL) and pigmented moles (NV) with sample numbers of each set are detailed in Table 1. For binary classiﬁcation tasks, we label 1 for the disease to be diagnosed and 0 for the remaining diseases. The result returns the probability of disease of the disease being diagnosed. The hold-out method is used to divide the data set (25,331 samples) into two parts in a random 9:1 ratio, of which 90% (22,797 samples) is used for training and the remaining 10% (2,534 samples) for the inspection process. 3.2

Architectures for Skin Diseases Classiﬁcation

Figure 1 illustrates the proposed workﬂow for skin disease classiﬁcation. The input consists of a set of color images of the skin surface with diﬀerent resolutions. This image can be pre-processed by resizing the image technique to ﬁt the trained model. We observe that the training model with single-class output is extensible and has great potential in medical image classiﬁcation. Speciﬁcally, when it is necessary to investigate more new diseases, for a single-class model, we do not need to remove the trained models but only need to train more models on newly arising diseases while using the multi-class model. We have to re-train from the beginning. In addition, it creates independence for the system and 1

https://challenge.isic-archive.com/.

Shallow Convolutional Neural Network Conﬁgurations Table 1. Number of samples for each class in the ISIC 2019 dataset Diseases

Number of samples

Dermatoﬁbroma - DF

239

Vascular lesion - VASC

253

Squamous cell carcinoma - SCC 628 Actinic keratosis - AK

867

Benign Keratosis - BKL

2624

Basal cell carcinoma - BCC

3323

Melanoma - MEL

4522

Melanocytic nevus - NV

12875

Total

25331

Fig. 1. Architecture of the system to diagnose eight types of skin diseases

373

374

N. H. Pham et al.

saves training time. Furthermore, the output results when using the multi-class model are not speciﬁc. That is, we do not see the disease probability of each disease. On the other hand, in the experimental process, it has been shown that when using the multi-class model, the diagnostic results are much lower than that of the single-class model. From the above problems, this study uses a training model with single-class outputs instead of multi-class outputs, i.e., running eight groups of models (best results) for the eight diseases. During the model training phase, we investigate diﬀerent hyperparameters, including the convolutional neural network’s main types of hyperparameters, such as the number of convolutional layers and the number of ﬁlters at each convolutional layer. The input to the model is a hyperparameter. Thus, within this study’s scope, we will evaluate and experiment with diﬀerent input image sizes, the number of convolutional layers, and the number of ﬁlters of each convolution layer. The model evaluation uses the holdout method and accuracy metric to measure the predictive performance of the disease classiﬁcation model.

Fig. 2. An illustration of a considered CNN architecture.

The convolutional neural network architectures are tested according to Formula 12 , where CONV is a convolutional layer, ReLU is a rectiﬁed linear unit activation function, POOL is a pooling operation, FC indicates a Fully connected layer, and “*” denotes repetition. IN P U T → [[CON V → ReLU ] ∗ N → P OOL] ∗ M → [F C → ReLU ] ∗ K → F C (1) In the experiment, we perform with M = 1, K = 1, N = 1, 2 as a hyperparameter that can be investigated along with the number of ﬁlters = 64, 128, 256, the size of INPUT = 32, 64,128 is also a hyperparameter that can be investigated during the experiment. Thus, 18 architectures can be evaluated for each disease, and eight diseases can be 144 diagnostic cases for comparison. Some other hyperparameters include batch size 128, Adam optimization function, learning rate using Adam’s default value of 0.001, the maximum number of epochs running 100, and epoch patient of 5 (epochs will stop if training does not improve results) holding constant in all convolutional neural network architecture conﬁgurations. Figure 2 illustrates an architecture with an input of 128 × 128, two convolutional layers with 128 ﬁlters (stride one ﬁxed through experiments), and one max 2

https://cs231n.github.io/convolutional-networks/.

Shallow Convolutional Neural Network Conﬁgurations

375

pooling layer 2 × 2. Before training the model, all the data can go through the image preprocessing step to 128 × 128 pixels in size. The ﬁrst convolution layer uses 128 ﬁlters of size 3 × 3 and a ReLU activation function. The architecture is built with two convolution layers for more detailed object extraction. The 2 × 2 Max-pooling layer is used after the second convolution layer. In addition, the model uses the fully connected layer and the Sigmoid function to give output 1. In addition, the model uses a fully connected layer and the Sigmoid function to give output 1. The proposed algorithm mainly focuses on binary classiﬁers to classify diﬀerent skin images for fast and accurate detection of skin diseases.

4

Experimental Results

4.1

Performance with Diﬀerent Input Image Sizes

The results of architectures with input sizes of 32 × 32, 64 × 64, and 128 × 128, respectively, on eight binary classiﬁcation tasks are shown in Table 2. As observed, the architecture using a 32 × 32 input image has the highest average accuracy (91.5%), average accuracy decreases to 90.8% when increasing the input image size to 64 × 64, and the lowest is 90.6% when continuing to increase the input image size to 128 × 128. The highest accuracy is 99.2%, with an almost negligible diﬀerence when resizing the input image. The lowest accuracy of the architecture using 128 × 128 input images reached the lowest value of 56.1%. On the other hand, the architecture using the input image size of 32 × 32 has the lowest standard deviation (0.104), increases gradually for the input image 64 × 64 (standard deviation 0.111), reaches the highest with the image input 128 × 128 (0.120), that also contributes to the assertion that shallow model using small image size for training can give more stable accuracy. Table 2. Experimental results in accuracy of diﬀerent input sizes on the test set Input size Average Max

4.2

Min

Standard Deviation (SD)

032 × 032 0.915

0.991 0.579 0.104

064 × 064 0.908

0.992 0.612 0.111

128 × 128 0.906

0.992 0.561 0.120

Evaluation on Conﬁgurations of Shallow Convolutional Neural Networks on the Input Size of 32 × 32

Fig. 3a shows that the average accuracy is also improved when increasing the number of layers. Figure 3b reveals that when increasing the number of ﬁlters (from 64 ﬁlters to 256 ﬁlters), all three measures, including average, maximum, and minimum accuracy, all increase. Speciﬁcally, the average accuracy increased from 90.5% to 90.8%, the maximum accuracy increased from 98.9% to 99.1%, and the minimum accuracy also increased from 74.2% to 75.5%. As observed

376

N. H. Pham et al.

from Fig. 3c, we see that when increasing the number of ﬁlters (from 64 ﬁlters to 256 ﬁlters), the average accuracy and minimum accuracy tend to decrease (from 91.1% to 88.7% for average accuracy and from 77.1% to 57.9% for minimum precision). In contrast, the maximum accuracy tends to increase (from 98.9% to 99.1%) but not signiﬁcantly. Furthermore, the higher the number of layers, the average and minimum accuracy tends to increase (from 90.5% to 91.1% for average accuracy, from 74.2% to 77.1% for minimum accuracy as shown in Fig. 3d. Notably, the maximum accuracy is 98.9% unchanged when changing the number of layers used. According to Fig. 3e, we see that, with the input size 32 × 32 architecture, 128 ﬁlters, the more layers increase, the average accuracy, and maximum accuracy tend to increase (from 90.6% to 90.7% for the average accuracy, from 99% to 99.1% for the highest accuracy, and the minor accuracy tends to decrease (from 74.6% to 74.3%). Furthermore, the results from Fig. 3f exhibit the architecture of input size 32 × 32, 256 ﬁlters. The higher the number of layers, the lower the average accuracy, and the minor accuracy tends to decrease (from 90.8% to 88.7% for average accuracy, from 75.5% to 57.9% for minimum accuracy). Notably, the maximum accuracy tends to be constant. 4.3

Investigation on the Conﬁgurations with Input Size of 64 × 64

Figure 4a shows that when increasing the number of layers, both the average and minimum accuracy measurements decrease. Speciﬁcally, the average accuracy decreased from 90.1% to 89.6%, the minimum accuracy decreased from 72.5% to 61.2%, and the maximum accuracy remained unchanged at 99.2%. In addition, the experiments in Fig. 4b exhibit that, when increasing the number of ﬁlters (from 64 ﬁlters to 256 ﬁlters), the maximum and the minimum accuracy increase, but the average accuracy is unstable. This shows that the average accuracy does not depend on the change in the number of ﬁlters in this case. Another observation in Fig. 4c shows that when increasing the number of ﬁlters, all three measures, including average accuracy, maximum accuracy, and minimum accuracy, are unstable. We cannot ﬁnd the dependence of the accuracy on the change in the number of ﬁlters used. For Fig. 4d, as the number of layers increases, the average accuracy and minimum accuracy decrease, while the maximum accuracy increases. Also, as seen from Fig. 4d, as the number of layers increases, the average accuracy and minimum accuracy decrease while the maximum accuracy increases. According to Fig. 4e, we see that, with the input size 64 × 64 architecture, 128 ﬁlters, the more layers the number of layers increases, the average accuracy tends to increase (from 89.8% to 90.1%). Remarkably, the minimum and maximum accuracy tends to decrease. Furthermore, Fig. 4f shows that, with the input size 64 × 64 architecture, 256 ﬁlters, the higher the number of layers, the more all three measures tend to decrease. 4.4

Conﬁgurations on the Input Size of 128 × 128

Experimenting with the models using input size 128 × 128, surveying the changes of CNN layers, we got the results as shown in Fig. 5a. We found that the trend of

Shallow Convolutional Neural Network Conﬁgurations

377

Fig. 3. The performance of various conﬁgurations on the image input size of 32 × 32 Table 3. Comparison of architectures in average accuracy Number of Filters 64 Depth(number of layers) 1

128 2

1

256 2

1

2

Input Size 32 × 32

0.905 0.911 0.906 0.907 0.908 0.887

Input Size 64 × 64

0.901 0.897

0.898 0.901 0.904 0.890

Input Size 128 × 128

0.895 0.884

0.891 0.900 0.895 0.883

accuracy dependence on the change in the number of layers is completely similar to the model used. In Input size 64 × 64, when increasing the number of layers, the average and minimum accuracy decrease, but the maximum accuracy is unchanged. Similar to the model using input size 64 × 64 as exhibited in Fig. 5b and Fig. 5c, we see that the accuracy does not depend on the change of the number of ﬁlters in this case. However, adding more layers can decrease the performance (Fig. 5d and Fig. 5e). Furthermore, the results from Fig. 5f show

378

N. H. Pham et al.

Fig. 4. The comparison of various conﬁgurations on the image input size of 64 × 64

that, with the input size 128 × 128, 256 ﬁlters architecture, the higher the number of layers, the lower the average accuracy and the minimum accuracy tend to decrease, only the maximum accuracy tends to increase slightly. 4.5

Performance on Various Diseases

Table 3 reveals that the architecture with 32 × 32 input size, using two convolutional layers (with 64 ﬁlters per layer), can give the highest average accuracy (91.1%) compared to all the tested architectures. With an input size of 128 × 128, we obtain the best with two convolutional layers, while for 64 × 64, the best one is one convolutional layer but with 256 ﬁlters. We use this architecture to experimentally investigate the accuracy of diﬀerent diseases, as shown in Fig. 6. This ﬁgure shows that Dermatoﬁbroma disease achieves the highest average accuracy (99.9%), and the NV disease training model gave the lowest average accuracy (77.1%).

Shallow Convolutional Neural Network Conﬁgurations

Fig. 5. The comparison of various conﬁgurations (input size of 128 × 128)

Fig. 6. Comparison of accuracy between disease prediction models

379

380

5

N. H. Pham et al.

Conclusion

Our study presented the investigation on the eﬃciency of shallow convolutional neural network conﬁgurations on skin disease images. As proved from experiments, the architecture using 64 ﬁlters gives the highest average accuracy on diseases (91.1%). We experiment in only a few layers; with such a shallow architecture, if the input image size is increased, the image can become more complex, and there are many details to be processed. Hence, the accuracy for the shallow architecture is almost impossible as not enough to improve better results. Therefore, the input image size 32 × 32 gives the highest accuracy compared to architectures using other input image sizes. From the above analysis results with surveyed data, we ﬁnd that using a shallow architecture with hyperparameters: input image size 32 × 32, two convolutional layers, 64 ﬁlters for one convolutional layer to train Practice can give the highest accuracy. This size is many times smaller than the original image but gives acceptable results. The unequal sample size of subclasses indicates that some diseases may not have enough features to distinguish them from others. In addition, datasets with large input images, for example, 250 × 250, if reduced to 32 × 32, can only recognize extensive features, small features, when reducing the image size, will be lost, leading to incorrect identiﬁcation. In the future, it is possible to continue experimenting with hyperparameters to choose the optimal value for each hyperparameter to improve architectural accuracy and diagnostic quality.

References 1. Skin Cancer Information. https://www.skincancer.org/skin-cancer-information/ 2. Green, A., et al.: Daily sunscreen application and betacarotene supplementation in prevention of basal-cell and squamous-cell carcinomas of the skin: a randomised controlled trial. The Lancet 354(9180), 723–729 (1999). https://doi.org/10.1016/ s0140-6736(98)12168-2 3. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE (2016). https://doi.org/10.1109/cvpr.2016.90 4. Li, L.F., Wang, X., Hu, W.J., Xiong, N.N., Du, Y.X., Li, B.S.: Deep learning in skin disease image recognition: a review. IEEE Access 8, 208264–208280 (2020). https://doi.org/10.1109/access.2020.3037258 5. Hameed, N., Ruskin, A., Hassan, K.A., Hossain, M.: A comprehensive survey on image-based computer aided diagnosis systems for skin cancer. In: 2016 10th International Conference on Software, Knowledge, Information Management Applications (SKIMA). IEEE (2016). https://doi.org/10.1109/skima.2016.7916221 6. Hameed, N., Shabut, A., Hossain, M.A.: A computer-aided diagnosis system for classifying prominent skin lesions using machine learning. In: 2018 10th Computer Science and Electronic Engineering (CEEC). IEEE (2018). https://doi.org/ 10.1109/ceec.2018.8674183 7. Stanley, R.J., Stoecker, W.V., Moss, R.H.: A relative color approach to color discrimination for malignant melanoma detection in dermoscopy images. Skin Res. Technol. 13(1), 62–72 (2007). https://doi.org/10.1111/j.1600-0846.2007.00192.x

Shallow Convolutional Neural Network Conﬁgurations

381

8. Goceri, E.: Skin disease diagnosis from photographs using deep learning. In: Tavares, J.M.R.S., Natal Jorge, R.M. (eds.) VipIMAGE 2019. LNCVB, vol. 34, pp. 239–246. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-32040-9 25 9. Kumar, V.B., Kumar, S.S., Saboo, V.: Dermatological disease detection using image processing and machine learning. In: 2016 Third International Conference on Artiﬁcial Intelligence and Pattern Recognition (AIPR). IEEE (2016). https:// doi.org/10.1109/icaipr.2016.7585217 10. He, X., Wang, Y., Zhao, S., Chen, X.: Co-attention fusion network for multimodal skin cancer diagnosis. Pattern Recogn. 133, 108990 (2023). https://doi.org/10. 1016/j.patcog.2022.108990

Design an Indoor Positioning System Using ESP32 Ultra-Wide Band Module Ton Nhat Nam Ho, Van Su Tran, and Ngoc Truong Minh Nguyen(B) School of Electrical Engineering (SEE), International University, VNU-HCMC, Ho Chi Minh City, Vietnam [email protected]

Abstract. In recent years, Indoor Positioning Systems have always captivated a colossal repercussion on many end-user applications such as medical healthcare, logistic and warehouse, smart buildings, military, etc. In particular, indoor tracking often requires high accurate localization. However, positioning precision is restricted by various hindrances within the mise-en-sc`ene, especially the Non-Line-Of-Sight context which aﬀects signal dispersion and occlusions. The proliferation of numerous RF technologies, such as RFID, WiFi, Bluetooth or Zigbee, etc. facilitates users to detect an object’s location within a peculiar range. Consequently, in the scope of this paper, a short-range technology system is developed for indoor positioning. The system is exploited properly to trail an object equipped with a small tag using Ultra-Wide Band module as ESP32. For the exact localization of the matter, the method measures the running time of wave between the object and at least two receivers (called bilateration). Also, the model is validated through a good accordant comparison between simulation and experimental results. Keywords: Bilateration Positioning · ESP32 Ultra-Wide Band Module · Indoor Positioning System · Localization Systems · Object Tracking

1

Introduction

Positioning systems have permeated into all facets of human life [1,2] among which satellite navigation technology is the most sophisticated positioning automation [3]. Notwithstanding, these technologies are more applicable for use in the wide outdoor environment as a result of the block of indoor building substances, which culminates poor signal. That is the reason why the Indoor Positioning System (IPS) came up. This kind of system bears on an equipment employed to maneuver all pinpointed gadget using radio waves, magnetic ﬁelds, acoustic signals, luminescence from sun, or other sensory data [4,5]. In modern days, the use of IPS is turning into prominent across various end-user industries. The worldwide insistence for indoor location is envisaged to raise at a Compound Annual Growth Rate (CAGR) of 14.2% during the prognosis period c The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 N.-N. Dao et al. (Eds.): ICIT 2023, LNDECT 187, pp. 382–393, 2023. https://doi.org/10.1007/978-3-031-46573-4_35

ESP32 UWB Module Indoor Positioning System

383

between 2022 and 2032, attaining an amount of US$ 28,500.5 million in 2032, in accordance with a report from Future Market Insights (FMI). The demand of tracking objects, human and pathways has become preeminent. In addition, the employment of Internet of Things (IoT) devices is anticipated to bolster the growth of IPS in the forthcoming years. More, retailers has been helped by the deployment of indoor technology to enhance their user experiences (UX) as well as provide proper navigation for objects or any regions. Indoor location solutions bring with them favorable circumstances for corporations to relate customers inside large indoor spaces with their trademarks, products, and allies, providing them with diverse ways to increase customer relationships and sales [6,7]. Concretely, IPS usually consists of two separate elements: anchor and position tags. Anchor is an electronic device, importantly placed in a speciﬁc space. Meanwhile tags are lugged by people or things. Anchor either actively situates the mobile device and tag, yields an environment location or background for the device to grasp. The technologies used forthwith are RF technologies included RFID, WIFI, Bluetooth, Zigbee, and UWB [8]. Based on RF signals, the position of a target is detected using radio signals transmitted from a transmitter to a receiver. Therefore, Ultra-WideBand (UWB) has been proposed to provide more low-cost but accurate positioning within few centimeters away, which is used frequently in industrial environments [9–11]. Furthermore, the indoor environments will potentially include complex path and obstacles which can cause the variation of signal and noise problems. Under Non-Line-Of-Sight (NLOS) condition, UWB devices can still provide location data with high accuracy, scalability, and reliability despite the environment contains construction. This perception concentrates on developing an UWB-based IPS by applying the method to measure the Time of Flight (ToF) between the object and the receivers in this research. The novelty of this research comes from the fact that only two UWB-anchors are set up as framework landmarks at two corners of an indoor zone. After the conﬁguration is built up, the mobile indoor gadget starts to link to the anchors. Empirical results show that the UWB-bilateration location assessment can jointly achieve the accuracy and reliability of an IPS. The paper organizes as follows: Sect. 2 describes related theory to the topic; Sect. 3 designates the research methodology of the system; representation of the simulation and experimental results as well as some discussions come in Sect. 4; lastly, Sect. 5 concludes the paper outcomes.

2 2.1

Literature Review What is Ultra-Wide Band?

UWB is a technology that can function throughout a signiﬁcant section of the radio spectrum at extraordinarily low energy levels for short-range, highbandwidth communications. Due to its broad bandwidth, an UWB signal has several frequency components, which enhances the probability that it may pass through barriers.

384

T. N. N. Ho et al.

UWB oﬀers a number of favors, including the ability to operate in unlicensed free gamut, a bandwidth that is greater than that of other positioning techniques, ranging from 3.1 to 10.6 GHz. The capacity of the UWB system to conduct in low Signal-to-Noise Ratio (SNR) communication channels provides immunity from multi-path deterioration, and the fact that UWB signals transmit at low average power due to its short-pulse (see Fig. 1). With UWB technology, applications for indoor positioning can seamlessly integrate accuracy, scalability, and reliability (AGVs) [12].

Fig. 1. A comparison between Indoor Positioning Technologies [13].

2.2

Ranging Methods Using UWB

There are ﬁve UWB-based techniques applying into localization [14–16]: 1. Time Diﬀerence of Arrival (TDoA): The anchors receive data packets from tag. The distance estimate and, eventually, the computation to ﬁnd the item are subject to the diﬀerence in the time of receipt by the anchors. 2. Angle of Arrival (AoA) Triangulation: The AoA measurement is directly calculated from the postponement of arrival at each element. Triangulation is the method of locating a point by creating triangles leading to it from other accepted points. 3. Two-Way Ranging (TWR): Utilizes two signal transmission delays that instinctively occur to calculate the distance between two stations. The two delays are the processing delay of recognition within a wireless device and the signal propagation delay between two wireless devices. The recommended study employs this kind of technique. 4. Asymmetric Double-Sided-Two-Way Ranging (ADS-TWR): Varies from the approach above in that the responses from the two stations are not synchronized, meaning that one station doesn’t wait for the other station to respond before sending its own.

ESP32 UWB Module Indoor Positioning System

385

5. RSS based on Trilateration: The target point gets the RSS of three separate known-position Access Points (APs - often WiFi routers), and using the wireless signal transmission loss model. The RSS is then translated into the distances between the target and the associated APs.

3 3.1

Methodology System Overview

Building Prototype. The proposed IPS is composed of three distinct ESP32 UWB modules (see Fig. 2). Each module is based on DW1000 and ESP32. The unit supports up to 850 kbps of data rate and 80 m point-to-point distance (LOS). They will serve as both two Anchors hooked up to the power supply and a mobile Tag connected to the WiFi in order to provide location data to a personal computer (see Fig. 3).

Fig. 2. An ESP32 DW1000 UWB module.

Fig. 3. System overview with two Anchors and a Tag.

While changing position, the indoor tag keeps examining the proportionate location through reckoning by two anchor modules. More concrete, the mobile

386

T. N. N. Ho et al.

point is computed by the intersection of two UWB antennas’ radiation pattern precisely and instantly. As mentioned earlier, the TWR technique and ToF convention which are subjected to calculating the time taken for waves to transmit from two points of supply (Anchor 1 and 2) to an object and back (see Eq. 1). Based on these information combining with some know-how of maths and physics, we can enact the distance of that item from these sources. The research methodology is veriﬁed working well on the 2D map which called bilateration [17] (see Fig. 4). distance DH1,2 =

speed of light × time1,2 2

(1)

Fig. 4. Scenario of UWB-bilateration location computation.

Software Tools. The ESP32 modules give users the ability to determine the location of a tagged object. This can be audited with the use of a Graphical Interface Display (GID) created by Unity Compiler. Hardware Setup. Two main objectives are setting up the modules (Anchors and Tag) then testing them. Setup the Anchors. The Anchors are each connected by a consistent or mobile power supply for energy consumption. This is done by attaching wires to the 5V and the GND pins of the module or just by connecting a set of Micro-USB to the module instead. Setup the Tag. First, we connect the Tag with the local WiFi. After identifying the IPv4 address, we try to create a connection between the computer and the ESP32 Tag module. We duplicate the results displayed in the computer’s IPv4 address and paste them back to the Tag as in Fig. 5.

ESP32 UWB Module Indoor Positioning System

387

Fig. 5. Setup WiFi and IPv4 for the Tag connection.

3.2

Device Connections

There are three objectives in this step: the ﬁrst is erecting the Tag, the second initiates the Anchors, and the last is building a GID. Setting up the Anchors. The Anchors’ responsibility is to recognize UWB pulses emitted by UWB Tags and transmit them to the location server. A set of Anchors must be erected above the area to build the location infrastructure in order to cover the area with an IPS. We determine the device address for each of the two Anchors and diﬀerentiate location in opposite corners of the room. Therefore, we must set two diverse addresses for the ﬁrst and the second Anchor as in Fig. 6.

Fig. 6. Setting up two Anchors in the system.

Then, the ESP32 UWB modules in general come with six distinct modes for transferring data at varied rates. The speeds of the data rate and the pulse response of the device can be altered through the mode deﬁnition. Also, selecting the mode for the Anchors should be done in the exact same way as for the Tags. If we want the device to give accurate and long data rate, we accordingly choose to set them at mode “LONGDATA RANGE ACCURACY” (see Fig. 7).

Fig. 7. Setting up the mode for the Anchors.

Setting up the Tag. In this step, we customize the Tag’s functionality. The function of the device is to recognize UWB pulses that are emitted by UWB Tags then send those pulses to a location server. Next, we start the process of setting the address for the Tag so the computer will be able to recognize when it is connected. We chose to set the Tag with a

388

T. N. N. Ho et al.

diﬀerent name like “7D:00:22:EA:82:60:3B:9C” as the Tag also needs to be set diﬀerently than the other anchors. This is to prevent any errors occurring while the device is in use. Finally, we conﬁgure the mode of the device. This is necessary that the Tags and the Anchors both need to be conﬁgured in the same mode. As a result, the Tags are going to be set similarly to the Anchors (see Fig. 8).

Fig. 8. Setting up the mode for the Tag.

Building the GID. Lastly, we will construct a program utilizing Unity Compiler to deliver a graphical interface for the user to interact with. The GID displays three primary components of the system, which is a virtual map with two Anchors (two purple dots), a Tag (an orange dot), and a button to link with the Tag. To begin, we make a frame that consists of a gray background represented the ﬂoor. After all, we start to construct the connect button so the Tag can be located, and the communication can be established as well (see Fig. 9).

Fig. 9. Representation of the GID.

4

Simulation and Experimental Results

In this section, we begin to verify three processes: transmitting, collecting, and displaying ranging data.

ESP32 UWB Module Indoor Positioning System

4.1

389

Transmitting Ranging Data

The main goal is to make the Tag being able to transmit the data to the Anchors (see Fig. 10). – Step 1: Give the tag the base required function of the localization system, which is the ability to connect and communicate with the Tag. – Step 2: Begin to send the data every second over the UDP server. – Step 3: Start to make the Tag connecting with the Anchors. When they access, the communication range will start to print out a notiﬁcation and start generating location data.

Fig. 10. The Tag starts to transmit data.

4.2

Collecting Ranging Data

The main purpose is to make the Anchors receive the data transmitted by the Tags, then calculate the distance between each device (see Fig. 11).

Fig. 11. The Anchors collect ranging data.

– Step 1: Install the Arduino-DW1000 Library which oﬀers the functionality essential to use the DW1000 chips and modules. – Step 2: Begin to test the code to provide the general information of the Anchor, such as the distance to the Tag, some notiﬁcations when connecting or deleting a Tag. – Step 3: Start to make the Anchors connecting with the Tag.

390

4.3

T. N. N. Ho et al.

Simulation and Experimental Testing

To start the comparison, at ﬁrst, we settle a 3 × 4 m room in the simulation containing a blue box presenting for the mobile Tag and the two Anchors bestowed as red boxes conducted at two top-corners of the room. In reality, the mobile Tag is attached on a person and the two Anchors connecting with a personal computer for location data tracking. The base system is situated on a table with the height of 70 cm to the ground. The Anchor1 is placed to the right while the Anchor2 to the left of the room. Concretely, the distance between two Anchors are 2.0 m and they are positioned symmetrically to the computer. At ﬁrst, the tester is standing at a distance of 1.8 m from the receiving system. This means that the Tag is distant equally about 2.06 m to the two Anchors applying Pythagoras formula. Figure 12 illustrates the conﬁguration for comparison testing.

Fig. 12. Simulation (left) and experiment (right) conﬁguration.

In case 1, the tester moves one step (exactly 0.6 m) to the right of the origin position. The Tag transmits and two Anchors collect ranging data continuously. The distances between the Tag and two Anchors are displayed repeatedly on both the computer system and the GID. If we do a math calculation, the distances from the Tag to Anchor1 and Anchor2 are about 1.84 m and 2.41 m, respectively. Meantime, the results displayed on the data receiving system indicate that the user is at the moment on the right side of the system, averagely 1.78 m to the Anchor1 and 2.40 m to the Anchor2 (see Fig. 13). The errors between the real world and measure are 0.06 m for Anchor1 and 0.01 m for Anchor2. The computed ranges between the Tag and the Anchors shown on Serial Monitor prove a good similarity to the practicality. In the second case, the tester moves three steps to the opposite side. More detail, the user stands at 1.5 m to the left compared with the ﬁrst case position. Again, by a simple math computation, the real distances between the Tag and two Anchors (1 and 2) are correspondingly 2.62 m and 1.80 m. As a result, the new ranging data showing on the Serial Monitor has delineated that the Tag had moved closer to Anchor2 (about 1.71 m) than Anchor1 (about 2.67 m) on average (see Fig. 14). The diﬀerences between real and measured data are 0.09 m for Anchor1 and 0.05 m for Anchor2.

ESP32 UWB Module Indoor Positioning System

391

Fig. 13. Ranging results on the Serial Monitor and on the GID in case 1.

Anew, the ranging data has shown that the device location has a favorable accordant with the practical position of the Tag. In addition, the mean error distance is said to be less than 10 cm. This justiﬁes a high accuracy of the system.

Fig. 14. Ranging results on the Serial Monitor and on the GID in case 2.

Lastly, in order to justify the recognition of the IPS working with only two anchors, we have tested also with a third anchor in the second case. This additional anchor is placed at the opposite corner of the room and distant 0.5 m from the walls in two dimensions like the previous two. From a quick calculation, we derive the distance from the third anchor to the Tag is 2.33 m. Concurrently, the obtained ranging data is 2.31 m roughly. The ranging data for the third anchor as well as the whole conﬁguration exposition on the Serial Monitor are shown in Fig. 15. Be noted that the anchor number is just alias name. This explains for the switching between Anchor1 and Anchor2 in Fig. 14 and Fig. 15.

392

T. N. N. Ho et al.

Fig. 15. Ranging results on the Serial Monitor and on the GID in case 2 using three anchors.

5

Conclusion

In the paper, we design and implement an application that can position the indoor location of an object by using three ESP32 DW1000 UWB modules. The module is used for channeling and receiving the ranging data for the purpose of getting and knowing the location of the person or an object that is using a Tag then sending them to the User Datagram Protocol (UDP) server. Additionally, a user interface related to the server and built by Unity Compiler helps to show the base GID and the location of the Tags on the screen has been deployed. Subsequently, the system works normally without meeting any major challenges as the base purpose of the system is to recognize the location data of the tag and displaying them on the application. The scenario at the moment is scarcely small and simple area with few of obstacles, for instance, chairs, bags, etc. In the future work, we will try to test the IPS in a room with more complex obstructions such as a warehouse or a working oﬃce.

References 1. Yasuhiro, K., Hiroshi, H., Kenji, S.: Positioning system using PHS and a radio beacon for logistics. In: 2008 IEEE International Conference on Automation and Logistics (2008). https://doi.org/10.1109/ICAL.2008.4636126 2. Lee, C.K.M., Ip, C.M., Taezoon, P., Chung, S.Y.: A bluetooth location-based indoor positioning system for asset tracking in warehouse. In: 2019 IEEE International Conference on Industrial Engineering and Engineering Management (IEEM) (2019). https://doi.org/10.1109/IEEM44572.2019.8978639 3. Satellite Navigation. https://satellite-navigation.springeropen.com/ 4. Luka, B., Mladen, T.: Overview of indoor positioning system technologies. In: 41st International Convention on Information and Communication Technology, Electronics and Microelectronics (MIPRO) (2018). https://doi.org/10.23919/MIPRO. 2018.8400090

ESP32 UWB Module Indoor Positioning System

393

5. Pooyan, S.F., Amirhossein, F., Javad, R., Alizera, B.: A survey on indoor positioning systems for IoT-based applications. IEEE Internet Things J. 9(10), 7680–7699 (2022). https://doi.org/10.1109/JIOT.2022.3149048 6. Pavel, P., Sven, C., Joaquin, T.-S., Elena, S.L., Jari, N.: Collaborative indoor positioning systems: a systematic review. Sensors 21(3), 1002 (2021). https://doi. org/10.3390/s21031002 7. Bo, Y., Tomohiko, M., Naoki, S.: Auto-tracking wireless power transfer system with focused-beam phased array. IEEE Trans. Microw. Theory Tech. 71(5), 2299–2306 (2023). https://doi.org/10.1109/TMTT.2022.3222179 8. Shanghui, D., Wenjie, Z., Li, X., Jingmin, Y.: RRIFLoc: radio robust image ﬁngerprint indoor localization algorithm based on deep residual networks. IEEE Sens. J. 23(3), 3233–3242 (2023). https://doi.org/10.1109/JSEN.2022.3226303 9. Abubakar, S., et al.: Compact base station antenna based on image theory for UWB/5g RTLS embraced smart parking of driverless cars. IEEE Access 7, 180898– 180909 (2019). https://doi.org/10.1109/ACCESS.2019.2959130 10. Rattiya, K., Supakit, K., Danai, T., Chuwong, P.: Switched beam multi-element circular array antenna schemes for 2D single-anchor indoor positioning applications. IEEE Access 9, 58882–58892 (2021). https://doi.org/10.1109/ACCESS. 2021.3072951 11. Hui, Z., et al.: A dynamic window-based UWB-odometer fusion approach for indoor positioning. IEEE Sens. J. 23(3), 2922–2931 (2023). https://doi.org/10. 1109/JSEN.2022.3228789 12. Yoshikawa, M., Mito, S., Kanasugi, H.: Indoor spatial-environment measurement using ultra-wideband positioning system. In: IEEE Sensors, pp. 01–04 (2022). https://doi.org/10.1109/SENSORS52175.2022.9967251 13. Jayakhanth, K., AbdelGhani, K., Somaya, A., Abdulla, K.A.: Indoor positioning and wayﬁnding systems: a survey. Hum.-Centric Comput. Inf. Sci. 10(1), 1–41 (2020). https://doi.org/10.1186/s13673-020-00222-0 14. Taavi, L., Sander, U., Muhammad, M.A., Yannick, L.M.: Active-passive twoway ranging using UWB. In: 2020 14th International Conference on Signal Processing and Communication Systems (ICSPCS) (2020). https://doi.org/10.1109/ ICSPCS50536.2020.9309999 15. Haneda, K., Takizawa, K.I., Takada, J.I., Dashti, M., Vainikainen, P.: Performance evaluation of threshold-based UWB ranging methods. In: 2009 3rd European Conference on Antennas and Propagation (2009). ISSN: 2164–3342 16. Jerome, H.: Ranging and Positioning with UWB. UWB Technology - New Insights and Developments (2022). https://doi.org/10.5772/intechopen.109750 17. Chian C., H., River, L.: Real-time indoor positioning system based on RFID heronbilateration location estimation and IMU inertial-navigation location estimation. In: 2015 IEEE 39th Annual Computer Software and Applications Conference, vol. 3, pp. 481–486 (2015). https://doi.org/10.1109/COMPSAC.2015.317

Towards a Smart Parking System with the Jetson Xavier Edge Computing Platform Cuong Pham-Quoc1,2(B)

and Tam Bang1,2

1

2

Ho Chi Minh City University of Technology (HCMUT), 268 Ly Thuong Kiet, District 10, Ho Chi Minh City, Viet Nam Vietnam National University - Ho Chi Minh City (VNU-HCM), Thu Duc, Ho Chi Minh City, Vietnam {cuongpham,bnbaotam}@hcmut.edu.vn

Abstract. Smart parking systems are becoming increasingly popular in smart cities due to their numerous beneﬁts. Unlike traditional systems that require drivers to spend a lot of time searching for parking spots, smart parking systems use a combination of edge computing platforms, cloud services, and user applications based on videos and sensor data. This paper presents our system design and implementation of smart parking. Our proposed architecture uses edge computing to process most workloads in video processing, which overcomes network bandwidth obstacles. We deployed our prototype at our institution campus using Jetson Xavier boards for testing. Our experimental results show that we achieve video processing performance at the edge side by up to 30 FPS. We developed two AI models that can recognize vehicle license plates and manage parking slots. We used certiﬁed datasets for training and testing, and the models oﬀer an accuracy of up to 99.6%. Keywords: Jetson Nano

1

· Edge computing platform · Smart parking

Introduction

Smart parking systems oﬀer numerous advantages transforming the traditional parking experience, beneﬁting drivers and parking lot operators. One signiﬁcant advantage is the reduced search time for parking spaces, leading to less congestion and improved traﬃc ﬂow. Along with these obvious advantages, smart parking systems also help reduce CO2 emission [18], one of the most critical issues for big cities. Real-time availability information and guidance these systems provide enable drivers to locate and reserve vacant parking spots quickly, eliminating the frustration of circling around aimlessly. Another advantage of smart parking is the optimization of parking space utilization. Operators can eﬃciently manage parking resources and maximize revenue generation by using sensors and data analytics. These systems allow operators to monitor occupancy levels, identify parking patterns, and make informed decisions regarding pricing and allocation. c The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 N.-N. Dao et al. (Eds.): ICIT 2023, LNDECT 187, pp. 394–402, 2023. https://doi.org/10.1007/978-3-031-46573-4_36

Smart Parking with Jetson Nano

395

A study conducted by the National Center for Sustainable Transportation found that smart parking systems can improve parking space utilization [9]. This eﬃcient use of parking infrastructure maximizes revenue and reduces the need for additional parking spaces, which is crucial in congested urban areas. While smart parking systems oﬀer numerous advantages, we must address some technology challenges for successful implementation [12]. One of the primary challenges is the need for a robust and reliable communication infrastructure. Smart parking systems rely on real-time data exchange between sensors, devices, and servers to provide accurate information to drivers and operators. However, ensuring seamless connectivity and network reliability in all areas, including underground parking lots or remote locations, can be a technological hurdle. Another challenge is the accurate detection and monitoring of parking space occupancy. Smart parking systems employ various technologies such as sensors, cameras, or license plate recognition to determine the availability of parking spots. However, these technologies may face diﬃculties in accurately detecting smaller vehicles, motorcycles, or unconventional parking situations [16]. Ensuring high detection accuracy across diverse scenarios remains a technological challenge that needs to be overcome. Integration and interoperability are also signiﬁcant challenges in smart parking systems. Diﬀerent components, such as parking meters, payment systems, and mobile applications, must seamlessly work together to provide a cohesive and user-friendly experience. Achieving interoperability among various hardware and software components from diﬀerent vendors requires standardization and compatibility protocols. In this paper, we present our proposed system based on edge computing platforms, more precisely, the Jetson Xavier board, for a smart parking system. The proposed system consists of three main layers: (i) the edge layer where input devices (cameras for our ﬁrst prototype) are attached and AI models deployed for processing data before sending to the cloud; (ii) the cloud layer where databases and services are built for managing both parking areas and users; and (iii) the users layer for providing interfaces so that users can utilize the proposed system. The main contributions of our paper consist of three folds. 1. We propose an edge-computing architecture for smart parking where edge devices do most workload of data processing; 2. We present our prototype system with the Jetson Xavier edge computing platform; and 3. We summarize our experimental results regarding performance and accuracy for future study referencing. The rest of the paper is organized as follows. Section 2 introduces related work proposed in the literature. The proposed system architecture is discussed in Sect. 3. We introduce our ﬁrst prototype version with the Jetson Xavier edge computing platform and related results in Sect. 4. Finally, Sect. 5 concludes our paper.

396

2

C. Pham-Quoc and T. Bang

Related Work

The development of eﬀective smart parking systems at a reduced cost is now possible thanks to the rapid advancement in internet, communication, and information technology. Researchers have created alternative ways to manage parking lots using various methodologies and sensors. They can be divided into two main groups: entrance gate control and parking lot management. However, the entrance gate control approach suﬀers from many limitations compared to the other one, like it cannot help drivers to ﬁnd suitable slots in the parking area [6]. In the second type of management, many approaches focus on mono-slot management to keep track of every individual slot, while some more recent approaches take multi-slots into account [11]. In the former method, mono-slotbased systems develop a network of wireless sensors, one attached to a slot, to detect cars parked. Studies in this type of management focus on energyeﬃciency [17,19], using TinyOS [3,20,24]. The latter method, using modern techniques compared to the second one, mainly uses AI models and image processing to keep track of multiple parking slots concurrently [1,2,5,10,21–23]. Although the systems handling multiple slots by image processing and AI models use modern techniques, all collected images and videos are processed at the server machines or cloud services. This approach requires a massive amount of storage and communication bandwidth. This paper presents our system to manage multiple slots by AI models with the main workload for data processing done by edge computing devices. Therefore, our system does not require high communication bandwidth and storage amount.

3

System Design

Figure 1 depicts our proposed edge-based smart parking system architecture. The architecture consists of three main layers, including edge, cloud, and user layers. These layers are connected through a wireless connection. 3.1

Edge Layer

This is the primary layer for collecting and pre-processing collected data from cameras and sensors. The layer comprises edge computing platforms for hosting computational models and input devices. The computational models can be AI models, encoders/decoders, and security functionalities. The ultimate purpose of this layer is to extract parameter information from input devices and forward to the cloud layer. The two mandatory data types that should be collected are vehicle information (e.g., license plates or positions) and parking slot information (available or occupied). Unlike other systems where the entire collected data is transferred directly to the cloud, our edge layer only forward essential data to the cloud layer to improve the system performance and reduce the required communication bandwidth and volume of storage.

Smart Parking with Jetson Nano

397

User Interface/apps wireless connection Users

User layer

Parking slots

Routing Cloud layer Booking wireless connection

Computational models ...

Edge layer

Edge platforms

Cameras

Sensors

Fig. 1. The 3-layer architecture of the edge-based smart parking system

3.2

Cloud Layer

This layer is responsible for storing extracted data from the edge layer for further processing. It contains at least two databases that hold users’ information, such as accounts, booking, and parking slots. A database also stores parking slots’ statuses, such as available, booked, and occupied. With these databases, several services can be built for diﬀerent purposes. The two primary services are the booking and routing services. The booking service enables users to reserve slots for speciﬁc intervals, while the routing service assists drivers in arriving at their booked parking spaces quickly using a map. These services directly interact with users through the user interface deployed on their portable devices, following the software-as-a-service cloud computing model. 3.3

User Layer

When it comes to interacting with users, this layer is crucial. Depending on the cloud layer’s services, this layer will oﬀer various graphic user interfaces (GUI) for portable devices. Since the focus here is on the smart parking system, it’s essential that this layer provides two speciﬁc GUIs: booking and routing. As mentioned earlier, the purpose of these two services is to make it easier for users

398

C. Pham-Quoc and T. Bang

to book parking spots and navigate to them. To communicate with the services on the cloud, user applications will use wireless internet access.

4

Experiments

In this section, we ﬁrst present our ﬁrst prototype setup for testing the proposed architecture with the Jetson Xavier edge computing platform. The prototype version is built with two AI models for recognizing parking slots’ status (empty or occupied) and license plates through videos captured in real-time by cameras. The user application allows drivers to reserve a place and guides them to arrive at the booked slots. 4.1

Prototype Setup

Figure 2 summarizes our prototype setup for the smart parking system at HCMUT. The prototype implementation targets one parking lot in our main campus and follows the proposed three-layer architecture above. Edge Layer: our parking lot management system utilizes two cameras to recognize license plates and manage parking slots. We use the Jetson Xavier platform [15] as the edge computing board to ensure eﬃcient data processing. Through this board, we deploy two AI models - Yolov4 and Yolo-tiny [4] - for recognizing license plates and managing parking slots, respectively. These models were trained on the COCO dataset [14] for places and a specialized dataset for Vietnamese license plates [7]. The power source for the board and cameras is currently the electricity grid, but battery power can also be utilized. A wireless internet connection establishes communication between the board and cloud services. Cloud Layer: we utilize the Heroku cloud services provider to host our system’s databases and services. To handle our databases for users’ information and parking slots’ status, we use PostgreSQL open-source relational DBMS. Based on these databases, we have built two primary services for our ﬁrst prototype, the smart parking systems: the booking and routing services. The booking service enables users to reserve a slot in a ﬁxed interval, while the routing service, utilizing Google Maps, will guide the driver from their current location to the booked place. User Layer: as mentioned above, this layer provides user interfaces to interact with the system. Along with two primary services, booking and routing, the system allows users to register, retrieve their booking and parking history, modify booking, and login/logout into the system. Currently, we only provide Androidbased applications.

Smart Parking with Jetson Nano

Booking service

399

Routing service User layer

wireless connection Heroku Posgres

Parking slots

Heroku Posgres

Users

Routing

Booking

Heroku cloud

wireless connection

Yolo-tiny for plates

Yolo4 for slots Jetson Xavier

Jetson Xavier edge computing board Camera 1 for license plates

Camera 2 for parking slots

Fig. 2. The implementation of our ﬁrst prototype smart parking system at HCMUT

4.2

Experimental Results

This section introduces our experimental results regarding our AI models’ accuracy and the entire system’s response time for diﬀerent use cases. AI Models Evaluation: regarding the license plates, due to the simplicity of this requirement, we can recognize all license plates of cars with the Yolo-tiny model. We use the following parameters to evaluate our AI model managing the parking slots and compare them with other models. – True positive rate (TPR): the number of actual slots that are recognized as slots compared to the total samples; – False positive rate (FPR): the number of objects that are not slots but recognized as slots compared to the total samples; – True negative rate (TNR): the number of objects that are not slots and not recognized as slots compared to the total samples; – False negative rate (FNR): the number of actual slots that cannot be recognized as slots. TPR – Precision: is the true positive prediction and calculated as TPR+FPR

400

C. Pham-Quoc and T. Bang

TPR – Recall : is the actual positive rate and calculated as TPR+FNR – F1-score: is the performance of the model and estimated as 2× Precision×Recall Precision+Recall

Table 1 represents the above values of the three models, including SSDMobileNet [8], RetinaNet [13], and Yolov4 for our work. The table shows that our deployment model performs similarly in Precision, Recall, and F1 scores. These results validate our trained model with Yolov4 for managing parking slots. Table 1. AI models accuracy comparison Model

TPR FPR TNR FNR Precision Recall F1-score

SSD-MobileNet 0.897 0.006 0.083

0.013 0.992

0.985

0.988

RetinaNet

0.903 0.010 0.080

0.006 0.989

0.992

0.990

Yolov4 (ours)

0.893 0.003 0.087

0.017 0.996

0.982

0.989

Performance Evaluation Along with the accuracy analysis above, we also compare our processing speed with the Jetson Xavier board and two other edge computing platforms, including Jetson Nano and Raspberry Pi 4, in terms of frames per second (FPS). Table 2 compares the three edge platforms when deploying the Yolov4-based models for both license plate recognizing and parking slot managing. As shown in the table, our system deployed on the Jetson Xavier edge computing platform can process captured videos in real-time with up to 30 FPS for license plate recognition and 27 FPS for slots management. Table 2. Edge computing performance comparison Platform

Frames per second License plate Parking slot

Jetson Nano

17-21

12-16

Raspberry Pi 4 0.71 - 1.34

0.98-2.47

Jetson Xavier

21-24

27-30

We then evaluate the response time of our cloud-based service for booking parking slots. Our testing databases comprise 1,886 users who make 9068 booking requests for 50 parking slots. Figure 3 illustrates the response time for booking requests. As shown in the ﬁgure, more than 57% of requests are responded to in less than 1 s. There are only 0.2% requests that need more than 3 s.

Smart Parking with Jetson Nano

401

Fig. 3. Response time comparison for the booking service

5

Conclusion

Smart parking systems, in high demand for smart cities, oﬀer several advantages over traditional approaches, which require drivers to spend considerable time searching for and reaching parking spots. This paper presents our system’s design and implementation of a smart parking solution. It combines edge computing platforms, cloud services, and user applications that utilize video and sensor data. The proposed architecture capitalizes on the processing capabilities of edge computing to handle the majority of video processing tasks, thereby overcoming network bandwidth limitations. We developed an initial prototype using Jetson Xavier boards to test our concept, which we deployed on our institution’s campus. The experimental results demonstrate that we achieved edge-side video processing performance of up to 30 frames per second (FPS). We also developed two AI models, one for license plate recognition and the other for parking slot management. By employing certiﬁed datasets for training and testing, these models achieve an accuracy of up to 99.6%. Acknowledgement. We acknowledge Ho Chi Minh City University of Technology (HCMUT), VNU-HCM for supporting this study.

References 1. Amato, G., Carrara, F., Falchi, F., Gennaro, C., Meghini, C., Vairo, C.: Deep learning for decentralized parking lot occupancy detection. Expert Syst. Appl. 72, 327–334 (2017) 2. Baroﬃo, L., Bondi, L., Cesana, M., Redondi, A.E., Tagliasacchi, M.: A visual sensor network for parking lot occupancy detection in smart cities. In: 2015 IEEE 2nd World Forum on Internet of Things (WF-IoT), pp. 745–750. IEEE (2015) 3. Benson, J.P., et al.: Car-park management using wireless sensor networks. In: Proceedings 2006 31st IEEE Conference on Local Computer Networks, pp. 588–595. IEEE (2006) 4. Bochkovskiy, A., Wang, C.Y., Liao, H.Y.M.: YOLOv4: optimal speed and accuracy of object detection. arXiv preprint arXiv:2004.10934 (2020)

402

C. Pham-Quoc and T. Bang

5. Bong, D., Ting, K., Lai, K.: Integrated approach in the design of car park occupancy information system (coins). IAENG Int. J. Comput. Sci. 35(1), 8 (2008) 6. Chinrungrueng, J., Dumnin, S., Pongthornseri, R.: iparking: a parking management framework. In: 2011 11th International Conference on ITS Telecommunications, pp. 63–68. IEEE (2011) 7. Forum, C.V.: Data for car’s license plate (2022). https://thigiacmaytinh.com/tainguyen-xu-ly-anh/tong-hop-data-xu-ly-anh/. Visited on 10 Jun 2023 8. Howard, A.G., et al.: MobileNets: eﬃcient convolutional neural networks for mobile vision applications (2017) ˇ ık, K., Poliak, M., Otah´ 9. Kalaˇsov´ a, A., Cul´ alov´ a, Z.: Smart parking applications and its eﬃciency. Sustainability 13(11), 6031 (2021) 10. Kamble, S.J., Kounte, M.R.: Machine learning approach on traﬃc congestion monitoring system in internet of vehicles. Procedia Comput. Sci. 171, 2235–2241 (2020) 11. Karbab, E., Djenouri, D., Boulkaboul, S., Bagula, A.: Car park management with networked wireless sensors and active RFID. In: 2015 IEEE International Conference on Electro/Information Technology (EIT), pp. 373–378. IEEE (2015) 12. Khalid, M., Wang, K., Aslam, N., Cao, Y., Ahmad, N., Khan, M.K.: From smart parking towards autonomous valet parking: a survey, challenges and future works. J. Netw. Comput. Appl. 175, 102935 (2021) 13. Lin, T.Y., Goyal, P., Girshick, R., He, K., Doll´ ar, P.: Focal loss for dense object detection. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2980–2988 (2017) 14. Lin, T.Y., et al.: Microsoft COCO: common objects in context (2015) 15. NVIDIA: Jetson developer kits. https://developer.nvidia.com/embedded/jetsondeveloper-kits. Visited on 10 Jun 2023 ˇ ˇ c, P.: Privacy leakage of Lorawan smart 16. Rodi´c, L.D., Perkovi´c, T., Skiljo, M., Soli´ parking occupancy sensors. Futur. Gener. Comput. Syst. 138, 142–159 (2023) 17. Soua, R., Minet, P.: A survey on energy eﬃcient techniques in wireless sensor networks. In: 2011 4th Joint IFIP Wireless and Mobile Networking Conference (WMNC 2011), pp. 1–9. IEEE (2011) 18. Surpris, G., Liu, D., Vincenzi, D.: How much can a smart parking system save you? Ergon. Design 22(4), 15–20 (2014) 19. Suryady, Z., Sinniah, G.R., Haseeb, S., Siddique, M.T., Ezani, M.F.M.: Rapid development of smart parking system with cloud-based platforms. In: The 5th International Conference on Information and Communication Technology for the Muslim World (ICT4M), pp. 1–6. IEEE (2014) 20. Tang, V.W., Zheng, Y., Cao, J.: An intelligent car park management system based on wireless sensor networks. In: 2006 First International Symposium on Pervasive Computing and Applications, pp. 65–70. IEEE (2006) 21. Vinay, A., et al.: Face recognition using VLAD and its variants. In: Proceedings of the Sixth International Conference on Computer and Communication Technology 2015, pp. 233–238 (2015) 22. Wei, L., Hong-ying, D.: Real-time road congestion detection based on image texture analysis. Procedia Eng. 137, 196–201 (2016) 23. Yass, A.A., Yasin, N.M., Zaidan, B.B., Zeiden, A.: New design for intelligent parking system using the principles of management information system and image detection system. In: Proceedings of the 2009 International Conference on Computer Engineering and Applications, Manila, Philippines, vol. 68, p. 360364. CiteSeer (2011) 24. Yee, H.C., Rahayu, Y.: Monitoring parking space availability via Zigbee technology. Int. J. Future Comput. Commun. 3(6), 377 (2014)

AlPicoSoC: A Low-Power RISC-V Based System on Chip for Edge Devices with a Deep Learning Accelerator Thai Ngo1,2 , Tran Ngoc Thinh1,2(B) , and Huynh Phuc Nghi1,2 1

2

Ho Chi Minh City University of Technology (HCMUT), 268 Ly Thuong Kiet, District 10, Ho Chi Minh City, Vietnam Vietnam National University - Ho Chi Minh City (VNU-HCM), Thu Duc, Ho Chi Minh City, Vietnam {thai.ngo2992001,tnthinh,nghihp}@hcmut.edu.vn

Abstract. In recent years, the integration of Artiﬁcial Intelligence (AI) into Internet of Things (IoT) devices has gained signiﬁcant importance due to its necessity in serving human-centric applications. However, these devices impose stringent hardware requirements in terms of energy, area, and eﬃcient computing capacity. This paper presents an IoT Systemon-Chip (SoC) based on the RISC-V architecture, named AlPicoSoC, equipped with a Deep Neural Network (DNN) accelerator known as Alpha Accelerator. Alpha Accelerator is speciﬁcally designed to accelerate inference tasks on Deep Learning (DL) models, providing layer-level computation in each working cycle while ensuring minimal resource utilization and power consumption. AlPicoSoC is implemented and evaluated on an Ultra96-V2 FPGA board. Experimental results obtained using the MNIST dataset demonstrate the system’s high accuracy, achieving up to 97.69%. Signiﬁcantly, AlPicoSoC system outperforms the original PicoSoC system by a remarkable factor of over 1679.47 times, while the resource utilization and energy consumption of AlPicoSoC only marginal increases, with a ratio of just over 8.26 and 1.21, respectively. Keywords: FPGA

1

· IoT · edge device · hardware accelerator · CNN

Introduction

Nowadays, the integration of IoT and AI has reached an advanced level to cater to human requirements. Notable applications include recording, collecting, classifying, and processing data related to images, sounds, and other environmental factors. AI plays a pivotal role in extracting important information from the data obtained at IoT nodes. Deep Learning (DL) models used in AI applications often have complex architectures and include hundreds of millions of parameters, requiring the enormous computing power of computer systems. One potential approach to address this challenge involves leveraging cloud servers, where AI models are executed on c The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 N.-N. Dao et al. (Eds.): ICIT 2023, LNDECT 187, pp. 403–413, 2023. https://doi.org/10.1007/978-3-031-46573-4_37

404

T. Ngo et al.

the server side [1]. This approach faces a signiﬁcant drawback as it involves substantial data transfer between edge devices and cloud servers, leading to extensive time and energy consumption. As a result, it is unsuitable for realtime low-power applications. Alternatively, another solution involves employing DL models directly on edge devices [1]. This solution eﬀectively addresses the aforementioned problem. However, it introduces a new challenge in the form of limited area and power constraints at the edge devices. Therefore, hardware implementations must ensure eﬃcient DL model computations while minimizing power consumption. To address the aforementioned challenges, this paper presents a novel system designed to eﬃciently perform computations on deep models while satisfying area and energy requirements for edge devices. The key contributions of this paper are outlined below: – Introduction of an accelerator speciﬁcally designed to speed up the computational eﬃciency during the inference process of DL models. – Development of a low-cost and low-power system-on-chip (SoC) utilizing a RISC-V core integrated with the proposed accelerator, tailored for edge devices. – Conducting experiments and evaluating the system’s performance on FPGA hardware, employing a Convolutional Neural Network (CNN) model trained on a simple MNIST dataset. The remainder of this paper is organized as follows. Section 2 provides information on related works in the ﬁeld. Section 3 describes the system’s architecture and the speciﬁcs of the proposed accelerator. In Sect. 4, we present the experimental setup and the corresponding results. Finally, Sect. 5 concludes the paper.

2

Related Work

Several studies have been conducted on hardware design using ASIC and FPGA technologies for edge devices to support eﬃcient computation on deep learning models. For ASIC-related research, Lee et al. [2] propose an architecture to use parallelism in conjunction with computation using Variable Weighted Bit Accuracy. Ando et al. [4] propose to use in-memory computing techniques and binary neural networks. However, this system only supports acceleration for Artiﬁcial Neural Network (ANN). Shin and et al. [5] propose a heterogeneous architecture to support acceleration for CNN and Mulit-Layer Perceptron (MLP) – Recurrent Neural Network (RNN). However, when running the model, part of the system is left unused, resulting in a waste of resources. Eyeriss [6] is an architecture that greatly supports the computation of CNNs when combining parallelization and reuse of model coeﬃcients. For FPGA-related research, Zhang et al. [9] propose a rooﬂine-model-based method for accelerate CNN model. Nguyen et al. [3] propose using Stochastic Computing to improve the power consumption and area of the whole system.

AlPicoSoC

405

However, the accuracy of the model when running on this system is still not enough to meet the necessary requirements of the DL model. In [10], a solution for the limited bandwidth problem during the computation process of the DL model is proposed by Qiu et al. Overall, the biggest problem with the above methods is that the resources and energy are still relatively large, and not suitable for some special applications of IoTs. To solve the above problem, we propose an architecture based on RISC-V with an integrated accelerator. The reason for choosing RISC-V is that its modularity is suitable for designing for energy and resource-saving systems. The accelerator task is responsible for performing all calculations of DL model to ensure both computational eﬃciency and resource-energy saving. In addition, computational acceleration methods such as pipelines as well as building appropriate dataﬂow for each layer type are also applied to the accelerator.

3

Proposed Architecture

The overall architecture of the proposed system can be shown in Fig. 1. PicoSoC is a simple PicoRV32 design, which can be used as a turn-key solution for simple control tasks in ASIC and FPGA designs [7]. Our system, named AlPicoSoC, is based on PicoSoC with several conﬁgurations of instruction memory and accelerators built into it. The system includes a RISC-V PicoRV32 core, a UART peripheral, an instruction memory IMEM (replaced original SPI Flash Memory with BRAM-based memory), a data memory DMEM, a DNN accelerator, called Alpha Accelerator, and a small controller unit that comes with the accelerator.

Fig. 1. Overall architecture of AlPicoSoC: RISC-V-based SoC with AI accelerator

3.1

The Architectural Design of Alpha Accelerator

Alpha Accelerator is designed to speed up the computation task per layer during the inference process of DL models. It has full support for reductive models

406

T. Ngo et al.

based on the 8-bit Integer Quantization technique. In addition, Alpha Accelerator provides the additional ability to read data, calculate and write back results according to the pipeline technique including 3 stages: RDATA, COMPS and WBACK. The details of these stages will be presented in the next subsection.

Fig. 2. Overall architecture of Alpha Accelerator

Figure 2 illustrates the architecture of Alpha Accelerator. All functional units in Alpha Accelerator can be divided into two groups that can perform computations and data reads/writes. The ﬁrst group contains Processing Matrix, Accumulation Matrix, and Element-Wise Unit. The second group includes Input Buﬀer, Bias and Partial-Sum Buﬀer, Weight Buﬀer, and Output Buﬀer. Processing Matrix is one of the most important blocks of the accelerator. The task of this block is to perform MAC calculations to generate partial-sum and full-sum for Fully-Connected and Convolution Layer. In the case of Convolution Layer, Processing Matrix exploits the 3D feature in the convolution between the Input Feature Map and the Kernel. This means that the values can be calculated at the same time in 3 dimensions of the data, thus reducing the number of partial sums to get the ﬁnal result and reducing the number of times these partial-sums are stored and reloaded. As for the Fully-Connected Layer, the calculation only stops at the 2nd data dimension, so exploiting this 3D feature does not make much sense, but instead it allows us to calculate more data in the 2nd dimension. Processing Matrix contains N3 Processing Array that plays the role of performing 2D level calculations in the 3D level calculations of the Processing Matrix. Each Processing Array contains N2 Processing Unit that plays the role of performing 1D level calculations in the 2D level calculations of the Processing Array. Each Processing Unit includes N1 pair Look-Up Table (LUT) and Multiply-ACcumulate (MAC) unit, which uses Bit-Serial computation technique [2]. Each computation pair can compute N0 input value and N0 weight value per its working cycle. In sum up, at each working cycle, Processing Matrix can compute N0 × N1 × N2 × N3 MAC operations at the same time.

AlPicoSoC

407

As for Processing Matrix, after doing the MAC calculation, it returns the partial-sums of the output. These partial-sums can be linked together and be part of a number of complete results. Therefore, the task of Accumulation Matrix is to add together the results from the Processing Matrix to produce larger partial sums that each belong to only one full result of the layer. Element-Wise Unit plays the role of performing monomial operations of the layer. This includes the quantization and the activation function task in FullyConnected and Convolution Layer; and the compare or divide operation in Pooling Layer. All Buﬀer blocks in Alpha Accelerator are used to temporarily store data to read data from/write data to external memory. These buﬀers are usually relatively small in size to store a relative amount of data for the computation cycle. 3.2

Pipelining in Alpha Accelerator

As mentioned above, Alpha Accelerator implements 3-stage pipeline. The goal of the pipeline mechanism is to further improve the throughput of the accelerator when performing layer calculations. Figure 3 depicts the Alpha Accelerator resource allocation process for each stage of the Pipeline.

Fig. 3. The pipeline in Alpha Accelerator

The ﬁrst stage is RDATA. The task of this stage is to read the data needed for the computation from the external memory, temporarily store it in the buﬀer, and actually store it in the internal registers. The reason the data needs to be

408

T. Ngo et al.

cached temporarily is to not reload the required value for the calculation into the register if the data is already in the buﬀer. The second stage is COMPS. This stage performs the parallel computations of MAC calculations using the Bit-Serial technique. The data after the calculations is temporarily stored in the Accumulation Matrix register. The third stage is WBACK. At this stage, the values obtained in the previous stage or performing unit calculations in the Element-Wise Unit (compare, activation function,...) and then will be cached temporarily; or stored directly into cache. These values are then written back to external memory. Additionally, with full support for the quantization model, the ﬁnal rounding of data for storage is performed at this stage just before the activation function is calculated. Note that MAC calculations at the COMPS stage are not needed at the Pooling Layer. Instead, the Pooling Layer only needs to perform unary calculations such as comparison (Max Pooling) or average (Average Pooling). Therefore, especially with Pooling Layer, the calculation process is only pipelined on 2 Stages, RDATA and WBACK. 3.3

Workload Mapping

Depending on the layer type, the calculation process is performed according to diﬀerent computation models. In the Convolution Layer, computational tasks play an extremely important role, overwhelming memory-related tasks. In this layer, Alpha Accelerator is shown its capabilities because, in essence, it is designed to speed up the computational tasks rather than the memory-related tasks. In terms of the computational model utilized, the Convolution Layer employs Kernel-Based Computation, which leverages kernel reuse and exploits the locality of the Input Feature Map during the convolution operation. In contrast to the Convolution Layer, the Fully-Connected Layer places greater emphasis on memory-related tasks rather than computational tasks. During this stage, the Alpha Accelerator requires fewer computations, allowing it to reduce computation power by activating only a subset of processing units while putting the remaining units into sleep mode. In terms of the computational model utilized, the Fully-Connected Layer employs Input-Based Computation, which capitalizes on the reuse of Input Activation to perform calculations eﬃciently. Similar to the Fully-Connected Layer, memory-related tasks play an important role compared to computational tasks in the Pooling Layer. Furthermore, all MAC processing units do not need to be active during this layer, allowing the Alpha Accelerator to put all units into sleep mode. In terms of computational modeling utilized, the Pooling Layer shares similarities with the Convolution Layer as it can exploit the locality of the Input Feature Map in cases where the stride is smaller than the patch size.

AlPicoSoC

4

409

Evaluation

4.1

Setup

The entire AlPicoSoC system is implemented utilizing FPGA technology to facilitate performance evaluation. Alpha Accelerator is implemented with a conﬁguration N0 = N1 = N2 = N3 = 3, speciﬁcally optimized for 3 × 3 kernel size in Convolution Layer. The experimental setup employs the Ultra96-V2 Board, which is an ARM-based FPGA featuring the Xilinx Zynq UltraScale+ MPSoC chip. The model used for evaluation is a simple CNN on the MNIST dataset. The description of the model is shown in Table 1. TensorFlow Lite is used to train the model and compare the inference results with the model deployed on the FPGA hardware. Table 1. Summary information about the evaluation model Layer Type

Input

Kernel/Weight Output

0

Conv2D + ReLU 28 × 28 × 1

1

MaxPooling2D

2

Conv2D + ReLU 13 × 13 × 32 3 × 3 × 32 × 32 11 × 11 × 32 9248

3

MaxPooling2D

11 × 11 × 32 -

5 × 5 × 32

0

4

Flatten

5 × 5 × 32

-

800

0

6

Fully-Connected

800

800 × 10

10

8010

4.2

3 × 3 × 1 × 32

Param #

26 × 26 × 32 -

26 × 26 × 32 320 13 × 13 × 32 0

Result

The accuracy of all model types is evaluated using the MNIST test set, comprising 10,000 data samples. Table 2 shows the accuracy results of the original 32-bit Floating-Point model trained on software, compared to the 8-bit Integer quantization model implemented on both software and FPGA hardware. Notably, the quantized model exhibits a slight decrease of 0.01% in accuracy compared to the original model. This can be attributed to the relative simplicity and comprehensibility of the test model. Furthermore, when running on FPGA hardware, the quantization model showcases a reduction in accuracy by 0.98% compared to its software counterpart. The cause of this attenuation is due to the diﬀerence in some quantization parameters when brought down to the FPGA. Table 3 presents detailed information regarding resource utilization, while Table 4 provides insights into the power consumption of the PicoSoC and AlPicoSoC systems when deployed on the Ultra96-V2 FPGA. Regarding resource utilization, it is observed that both systems exhibit modest resource consumption, with PicoSoC consistently utilizing less than 6% of resources (except for

410

T. Ngo et al. Table 2. Comparison accuracy between MNIST model types Model Type

Accuracy

32-bit Floating-Point Software 98.68% 8-bit Integer Software

98.67%

8-bit Integer Hardware

97.69%

BRAM storage), and AlPicoSoC utilizing less than 50%. Notably, the total number of FPGA Conﬁgurable Logic Blocks (CLBs) in AlPicoSoC is approximately 8.26 times greater than that of PicoSoC. In terms of power consumption, AlPicoSoC consumes only 0.328 W, which is 1.21 times higher than PicoSoC’s power consumption of 0.272 W. Table 3. FPGA resource utilization Resource Available Used Utilization PicoSoC AlPicoSoC PicoSoC AlPicoSoC LUT

70,560

2485

21,555

3.52%

30.55%

FF

141,120

1,115

11,940

0.79%

8.46%

BRAM

216

90

90

41.67%

41.67%

DSP

360

4

0

1.11%

0%

CLB

8,820

503

4,153

5.70%

47.09%

Table 4. FPGA power consumption report PicoSoC AlPicoSoC Power Dynamic 0.050 W 0.106 W Device Static 0.222 W 0.222 W Total 0.272 W 0.328 W Comparison

1x

1.21x

To expand the scope of testing, we also conducted implementations on the ARM Cortex-A53 MPCore core of the Ultra96-V2 Processor System. The outcomes of comparing the execution time of the MNIST model across 10 inference runs are presented in Table 5. Remarkably, the AlPicoSoC system demonstrates a substantial acceleration in the computational task of the MNIST model, achieving a speed improvement of up to 1679.47 times compared to PicoSoC system and 40.60 times compared to the Ultra96-V2 Processor System. The results clearly

AlPicoSoC

411

indicate that the incorporation of Alpha Accelerator brings about remarkable acceleration to the AlPicoSoC system, while simultaneously ensuring eﬃcient utilization of resources and energy consumption. Table 5. Comparison latency for 10 inference tests between Ultra96-V2 PS, PicoSoC and AlPicoSoC Ultra96-V2 PS PicoSoC AlPicoSoC Latency

2.5454 s

Comparison 27.11x 1x

69.0263 s 0.0411 s 1x 0.04x

1679.47x 61.93x

The FPGA evaluation results, along with the results of previous studies, are presented in Table 6. Our work exhibits the smallest resource utilization and energy consumption values, surpassing the other works by more than 2 times in resources and 5 times in energy. However, when evaluating performance parameters and computational eﬃciency, our work achieves relatively lower values. This outcome is a direct consequence of our system’s initial design focus on optimizing resource and energy parameters. The inclusion of an accelerator aims to further enhance computational speed while adhering to the principle of minimal resource utilization and energy consumption. Consequently, our system is particularly well-suited for IoT applications that demand low power consumption while meeting compute time requirements. Table 6. Comparison with other FPGA Implementations Chakradhar et al. [8]

Zhang et al. [9]

Gong et al.[11]

Lo et al.[12]

This work

DL Model

3 CONV Layers

AlexNet

LeNet-5

LeNet-5

2 CONV Layers

CNN Size

0.52 GOP

1.33 GOP

0.005 GOP

0.005 GOP

0.003 GOP

Precision

48-bit ﬁxed

32-bit ﬂoat

16-bit ﬁxed

4-bit ﬁxed

8-bit ﬁxed

Platform

Virtex5 SX240T

Virtex7 VX485T

Zynq Z-7020

Virtex VCU128

Zynq ZU3EG

Frequency

120 MHz

100 MHz

200 MHz

400 MHz

100 MHz

Power

14.00 W

18.61 W

2.15 W

10.75 W

0.33 W

Resource Used LUT

–

186,251

38,136

122,198

21,555

FF

–

205,704

42,618

170,221

11,940

DSP

–

2,240

205

2,175

0

BRAM –

1,024

242

382

90

16 GOPS

61.62 GOPS

76.48 GOPS

445.6 GOPS

0.65 GOPS

1.14

3.31

35.57

213

1.97

GOPS/W

GOPS/W

GOPS/W

GOPS/W

GOPS/W

Performance Power Eﬃciency

412

5

T. Ngo et al.

Conclusion

In this paper, we propose AlPicoSoC, a RISC-V-based System-on-Chip designed for IoT nodes, which incorporates a DNN accelerator utilizing the bit-serial computation technique. The experimental results highlight the capability of AlPicoSoC to enhance the computational performance of DL models while ensuring minimal energy and resource utilization. Speciﬁcally, AlPicoSoC demonstrates a signiﬁcant speed improvement of up to 1679.47x in DL model inference, while maintaining a trade-oﬀ of 8.26x resources and 1.21x power consumption compared to the PicoSoC system. In future work, we aim to further enhance the eﬃciency of AlPicoSoC, focusing on optimizing the area utilization and expanding its computational speed to support not only CNNs but also other network architectures such as ANNs and RNNs. Acknowledgement. We acknowledge Ho Chi Minh City University of Technology (HCMUT), VNU-HCM for supporting this study.

References 1. Capra, M., Bussolino, B., Marchisio, A., Masera, G., Martina, M., Shaﬁque, M.: Hardware and software optimizations for accelerating deep neural networks: survey of current trends, challenges, and the road ahead. IEEE Access 8, 225134–225180 (2020). https://doi.org/10.1109/access.2020.3039858 2. Lee, J., Kim, C., Kang, S., Shin, D., Kim, S., Yoo, H.-J.: UNPU: an energyeﬃcient deep neural network accelerator with fully variable weight bit precision. IEEE J. Solid-State Circuits 54, 173–185 (2019). https://doi.org/10.1109/jssc. 2018.2865489 3. Nguyen, D.-A., Ho, H.-H., Bui, D.-H., Tran, X.-T.: An eﬃcient hardware implementation of artiﬁcial neural network based on stochastic computing. In: 2018 5th NAFOSTED Conference on Information and Computer Science (NICS) (2018). https://doi.org/10.1109/nics.2018.8606843 4. Ando, K., et al.: BRein memory: a single-chip binary/ternary reconﬁgurable inmemory deep neural network accelerator achieving 1.4 tops at 0.6 W. IEEE J. SolidState Circuits 53, 983–994 (2018). https://doi.org/10.1109/jssc.2017.2778702 5. Shin, D., Lee, J., Lee, J., Lee, J., Yoo, H.-J.: DNPU: an energy-eﬃcient deeplearning processor with heterogeneous multi-core architecture. IEEE Micro 38, 85–93 (2018). https://doi.org/10.1109/mm.2018.053631145 6. Chen, Y.-H., Krishna, T., Emer, J.S., et al.: Eyeriss: an energy-eﬃcient reconﬁgurable accelerator for deep convolutional neural networks. IEEE J. Solid-State Circuits 52, 127–138 (2017) 7. YosysHQ PicoSoC - A simple example SoC using PicoRV32. In: GitHub. https:// github.com/YosysHQ/picorv32/tree/master/picosoc. Accessed 1 Jun 2023 8. Chakradhar, S., Sankaradas, M., Jakkula, V., Cadambi, S.: A dynamically conﬁgurable coprocessor for convolutional neural networks. In: Proceedings of the 37th Annual International Symposium on Computer Architecture (2010). https://doi. org/10.1145/1815961.1815993

AlPicoSoC

413

9. Zhang, C., Li, P., Sun, G., Guan, Y., Xiao, B., Cong, J.: Optimizing FPGA-based accelerator design for deep convolutional neural networks. In: Proceedings of the 2015 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays (2015). https://doi.org/10.1145/2684746.2689060 10. Qiu, J., et al.: Going deeper with embedded FPGA platform for convolutional neural network. In: Proceedings of the 2016 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays (2016). https://doi.org/10.1145/2847263. 2847265 11. Gong, L., Wang, C., Li, X., Chen, H., Zhou, X.: MALOC: a fully pipelined FPGA accelerator for convolutional neural networks with all layers mapped on Chip. IEEE Trans. Comput. Aided Des. Integr. Circuits Syst. 37, 2601–2612 (2018). https:// doi.org/10.1109/tcad.2018.2857078 12. Lo, C.Y., Sham, C.-W.: Energy eﬃcient ﬁxed-point inference system of convolutional neural network. In: 2020 IEEE 63rd International Midwest Symposium on Circuits and Systems (MWSCAS) (2020). https://doi.org/10.1109/mwscas48704. 2020.9184436 13. Shawahna, A., Sait, S.M., El-Maleh, A.: FPGA-based accelerators of deep learning networks for learning and classiﬁcation: a Review. IEEE Access 7, 7823–7859 (2019). https://doi.org/10.1109/access.2018.2890150

A Transparent Scalable E-Voting Protocol Based on Open Vote Network Protocol and Zk-STARKs Ngan Nguyen1,2 and Khuong Nguyen-An1,2(B) 1

Faculty of Computer Science and Engineering, Ho Chi Minh City University of Technology (HCMUT), 268 Ly Thuong Kiet Street, District 10, Ho Chi Minh City, Vietnam {ngan.nguyen1911667,nakhuong}@hcmut.edu.vn 2 Vietnam National University Ho Chi Minh City, Linh Trung Ward, Thu Duc District, Ho Chi Minh City, Vietnam

Abstract. Electronic voting (e-voting) has become increasingly popular as an approach to achieve novel security requirements compared to paper voting. Among the technologies used to implement e-voting, blockchain has gained signiﬁcant attention due to its distinctive properties. This paper will assess the feasibility of applying cutting-edge technologies such as smart contracts, zero-knowledge rollup, etc., to tackle some widely concerned limitations of blockchain-based electoral systems, namely scalability and transparency. Keywords: electronic voting · Open Vote Network zero-knowledge rollup · zk-STARK

1

· blockchain ·

Introduction

In recent years, the rise of electronic voting has brought many beneﬁcial properties, such as ballot secrecy, transparency, veriﬁability, etc., to elections. While deemed impossible to achieve through paper voting, these properties have been present in many electronic voting systems worldwide. For example, the Helios web-based voting system [1] has claimed to achieve transparency and individual veriﬁability thanks to using a bulletin board - a public broadcast channel to distribute information among participants. However, most implementations of bulletin boards are centralized, i.e., the majority of decision-making power lies in the hand of a few authority bodies. This gives rise to the demand for a secure bulletin board, which, while maintaining its fundamental functionalities, still retains transparency. One of the ﬁrst attempts to realize such a channel using blockchain is [9]. Patrick et al. have utilized blockchain as the bulletin board for the Open Vote Network protocol - a voting protocol developed by Feng et al. [7]. However, most blockchains must trade oﬀ their scalability for decentralization and security. This phenomenon is captured by the “scalability trilemma” c The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 N.-N. Dao et al. (Eds.): ICIT 2023, LNDECT 187, pp. 414–427, 2023. https://doi.org/10.1007/978-3-031-46573-4_38

Zk-STARKs Open Vote Network

415

an infamous problem blockchain designers face. In eﬀorts to overcome scalability trilemma, many scaling solutions (including Layer-1 and Layer-2 solutions) have been proposed. Though achieving impressive results, Layer-1 scaling solutions require alterations to the blockchain’s inner workings, which is impossible for application-level developers to deploy. Therefore, many blockchain-based applications choose Layer-2 solutions to improve scalability while retaining the beneﬁts of Layer-1 blockchain. Layer-2 scaling, in its essence, is outsourcing computation or communication complexity to some third party outside of the blockchain. In [10], Seifelnasr et al. have adapted the core idea of Layer-2 scaling to their blockchain-based Open Vote Network implementation by oﬀ-loading the vote tallying computation onto an administrator node. However, their system cannot preserve dispute-freeness - a security property claimed by [7]. Inspired by [10], ElSheikh et al. have taken a similar route, which includes using smart contracts and a Layer-2 scaling solution, i.e., zero-knowledge rollup. While their system can retain most of the fundamental security properties of Open Vote Network, it loses part of its transparency due to using zk-SNARK, more specifically, trusted setups in zk-SNARK. Recently, zk-STARK has been considered a competitor for zk-SNARK as it does not require any trusted setup to generate and verify proofs with practical scalability. In this study, we will explore the viability of employing zk-STARKs to achieve transparency and scalability for a blockchain-based e-voting system based on Open Vote Network protocol. The rest of this paper is organized as follows. Section 2 presents some background knowledge about relevant cryptographic primitives, blockchain technology, and electronic voting (e-voting). Then, in Sect. 3, we give a brief survey on e-voting. And after that, we devote Sect. 4 to describe the detailed design of our proposed protocol and our system and Sect. 5 to evaluate our system against diﬀerent batch sizes. Finally, we conclude this paper and discuss some future works in Sect. 6.

2 2.1

Preliminaries Requirements for Voting Protocols

There are various requirements a voting protocol needs to meet to be helpful in general. Among them, the most notorious ones concerned in this study are brieﬂy described below. – Universal verifiability. It can be proven to any party, including a casual observer, that all valid votes are accounted for in the ﬁnal tally. – Dispute freeness. Any interested party can conﬁrm that participants comply with the protocol at any phase. This characteristic extends universal veriﬁability from the tallying phase to any phase. A dispute-free voting protocol does not require interactive zero-knowledge proofs or dispute-resolution mechanisms. Instead of relying on active misbehavior detection, dispute-free protocols enforce integrity by employing preventative mechanisms.

416

N. Nguyen and K. Nguyen-An

– Self-tallying. Any interested (casual) third party can carry out the tallying phase. This characteristic can be seen as another way to increase universal veriﬁability, albeit diﬀerently. – Perfect ballot secrecy. Knowledge of a partial tally (besides the ﬁnal tally computed at the end of the election) can only be obtained via collusion of all remaining voters. 2.2

Open Vote Network

Open Vote Network is a homomorphic e-voting protocol proposed by Feng et al. in [7]. The protocol contributes to the work of Kiayias and Yung [8], which is claimed to achieve overwhelming eﬃciency while preserving all the essential security properties of [8]. Before the protocol starts, a public cyclic group G of prime order q is chosen so the Decision Diﬃe-Hellman (DDH) problem in G is intractable. Let n be the total number of eligible voters and g be a generator of G that is accessible to all n voters. Each voter Vi randomly picks a secret value xi ∈ Zq . We assume that a vote vi can only take the values in {−1, 1}. The protocol has two main phases: – Phase 1: Every voter Vi broadcasts vki = g xi and a Schnorr’s proof of knowledge for xi . The proof is made non-interactive using the Fiat-Shamir heuristic. At the end of this phase, each voter computes their blinding key Yi := g

yi

=

i−1 j=1

g

xj

n

g xj .

j=i+1

– Phase 2: Each voter Vi broadcasts an ElGamal encrypted vote ci = Yixi g vi and a zero-knowledge proof of partial knowledge to prove that vi ∈ {−1, 1} without revealing the actual value of vi . The proof is generated using Cramer, Damg˚ ard and Schoenmakers (CDS) technique [4, Sec. 2.6] and made noninteractive using the Fiat-Shamir heuristic. After the election has concluded, any interested party can count the ballots by computing g t = i Yixi g vi , where t equals the diﬀerence between the numbers of “yes” (vi = 1) and “no” (vi = −1) votes. Since t is usually small (0 ≤ t ≤ n), it can be recovered through brute-forcing or the well-known baby-step giant-step algorithm. 2.3

Proof of Validity of Ballot (CDS Proof )

Based on the technique described in [3], Cramer et al. have constructed an eﬃcient proof system for proving an encrypted vote’s validity, which is speciﬁed in Sect. 2.6 of [4]. Figure 1 illustrates the interactive proof system for proving the validity of an encrypted ballot broadcasted by voter Vi during phase 2 of our protocol.

Zk-STARKs Open Vote Network

417

Fig. 1. Encryption and proof of validity of ballot

In Fig. 1, vki and ci correspond to voting key and encrypted vote of Vi , respectively. Since the voting key has already been published in phase 1, only the encrypted vote is broadcasted during phase 2. We apply the FiatShamir transformation to convert this proof system to a non-interactive one. As suggested in Sect. 3 of [4], the challenge hi is computed by voter Vi as H (IDi , vki , ci , ai1 , bi1 , ai2 , bi2 ), where IDi is a unique public string used to identify Vi and H is a cryptographic hash function. In our protocol, IDi equals to i, thus, hi = H (i, vki , ci , ai1 , bi1 , ai2 , bi2 ) . In total, a voter Vi needs to submit the encrypted vote ci along with the CDS proof (ai1 , bi1 , ai2 , bi2 , ri1 , ri2 , di1 , di2 ) during phase 2 to securely cast their vote. 2.4

Zero-Knowledge Rollup

Zero-knowledge rollup is a solution that utilizes zero-knowledge proofs to improve blockchain’s scalability while preserving centralization and security. This solution is deployed on Layer 2 - i.e., a separate network that relies on blockchain for security. The principal concept behind zero-knowledge rollup is that transactions can be processed oﬀ-chain in batches. At the same time, computational integrity can be enforced by zero-knowledge proofs (referred to as validity proofs in Fig. 21 ). Thanks to the ability to “compress computations” of zero-knowledge proofs, executing proof veriﬁcation over transactions on the blockchain is more cost-eﬀective. 1

Source: https://albaronventures.com/ethereum-layer-2-ecosystem.

418

N. Nguyen and K. Nguyen-An

Fig. 2. High-level view of the zero-knowledge rollup.

Zk-SNARK and zk-STARK are among the most popular categories of practical zero-knowledge proof systems used in rollups. A brief comparison between these two types of constructions is provided in Table 1. Table 1. Comparison between zk-STARK and zk-SNARK zk-STARK

zk-SNARK

Prover complexity

O(n · poly-log(n))

O(n · log(n))

Verifier complexity

O(poly-log(n))

∼O(1)

Communication complexity (proof size)

O(poly-log(n))

∼O(1)

Trusted setup

No

Yes

Post-quantum security

Yes

No

Cryptographic assumption(s)

Collision-resistant hash Discrete logarithm functions problem, bilinear pairing

Table 1 shows that both zk-SNARK and zk-STARK are fully scalable. However, zk-SNARK is not transparent because a trusted setup phase is required to generate random public data for the protocol. This process, if not carried out properly, can create a security vulnerability in the system, e.g., CVE-2019-7167 of Zcash2 . On the other hand, one of the signiﬁcant advantages of zk-STARK is its ability to function without any trusted setup.

3

Related Works

Homomorphic e-voting systems follow a privacy-focused approach - i.e., the values of votes are not disclosed throughout the election. However, talliers 2

Source: https://nvd.nist.gov/vuln/detail/cve-2019-7167.

Zk-STARKs Open Vote Network

419

are still able to count the ballots at the end. This characteristic is captured by a security requirement called ballot secrecy. Many homomorphic e-voting schemes in literature have laid the foundation for developing practical homomorphic e-voting systems. In this study, we have inherited the design of Open Vote Network [7] as the theoretical framework for our protocol. Before Open Vote Network (2010), there had been other homomorphic e-voting protocols in existence - notably Kiayias-Yung’s protocol (2002) [8] and Groth’s protocol (2004) [6]. Kiayias-Yung’s protocol can satisfy privacy, fairness, universal verifiability, corrective fault tolerance, dispute-freeness, self-tallying, perfect ballot secrecy [8, Theorem 4, Sec. 4]. Although capable of achieving impressive properties, KiayiasYung’s protocol has several limitations, especially its eﬃciency. Groth addresses these limitations in [6]. Groth’s contribution to Kiayias-Yung’s work realizes a new protocol that preserves all essential properties of the prior protocol with improved eﬃciency. However, unlike Kiayias-Yung’s, Groth’s protocol requires each voter to update the election’s state with their vote based on the latest state, forcing voters to operate sequentially. Overall, Groth’s protocol is self-tallying, dispute-free, and has perfect ballot secrecy [6, Theorem 1, Sec. 2.3]. To resolve the eﬃciency issue of [6,8], Feng et al. proposed Open Vote Network - a voting protocol with a signiﬁcantly lighter computational load than its predecessors. A theoretical comparison between three voting protocols in terms of computational complexity is presented in Table 2. Aside from its impressive eﬃciency boost, Open Vote Network has been proven to satisfy the same three distinctive properties claimed by the other two protocols, i.e., self-tallying, dispute-freeness, and perfect ballot secrecy [7, Sec. 3]. Table 2. Comparison between [6–8] regarding the number of computations performed by each voter. Protocol

Number of exponentations

Number of knowledge proofs for exponent

Number of knowledge proofs for equality

Number of knowledge proofs for 1-of-k a

Kiayias-Yung

2n + 2

n+1

n

1

Groth

4

2

1

1

Open Vote Network 2 1 0 1 Proof that a value is in a set, e.g., vi ∈ {−1, 1} in Phase 2 of Open Vote Network.

a

Web-based e-voting systems are e-voting systems built upon the Internet. One of the most famous examples for this category is Helios [1] - a homomorphic web-based e-voting system. For participants to publish data during an election, many homomorphic voting schemes require the existence of bulletin boards. To maintain ballot secrecy, the bulletin board of Helios is managed by a server. The problem with this approach is that electoral fraud might occur if Helios’ server is compromised. For instance, the server can impersonate a voter and cast a vote on their behalf [1, Sec. 5.2].

420

N. Nguyen and K. Nguyen-An

As an alternative to centralized web servers, blockchains have been utilized to realize secure bulletin boards with minimal trust, hence the emergence of blockchain-based e-voting systems. In [9], Patrick et al. conducted feasibility research on blockchain-based e-voting by constructing a protocol based on Open Vote Network and Ethereum smart contracts. Due to the scalability problem of Ethereum, the resulting system can only support about 40 to 50 voters in an election. Similarly, Seifelnasr et al. also implemented Open Vote Network using smart contracts, but vote tallying is performed oﬀ-chain by an untrusted administrator [10]. Notably, dispute-freeness is sacriﬁced to enforce the integrity of the tally result. On a diﬀerent direction, Elsheikh et al. opted for SNARK-based zero-knowledge rollup to enhance scalability while preserving dispute-freeness. Although being able to retain three fundamental properties of [7] with improved scalability, the system loses part of its transparency due to the requirement for a trusted setup. This is where zk-STARK enters the picture.

4

Our Protocol

Our protocol consists of ﬁve phases, in which Phases 2-4 are when the election actually takes place.

Fig. 3. Five phases of our protocol

4.1

Phase 1: Smart Contract Deployment

In this phase, the aggregator prepares a set of seven parameters T1 , T2 , T3 , T4 , T5 , g, F0 , rootelg , where: – T1 , T2 , T3 , T4 , T5 : the block heights that mark the end of each subsequent ﬁve phases. If transactions related to a phase are not submitted within the time window of said phase, they are reverted. – rootelg : the root of the Merkle tree built from the ordered list of distinct voting keys of eligible voters. – g: generator of G, where G is a cyclic group of prime order q, in which the Decisional Diﬃe-Hellman (DDH) problem is intractable. – F0 : the exact amount of deposit required to join the election.

Zk-STARKs Open Vote Network

421

The aggregator deploys the main smart contract and invokes the setUp method with said parameters and a collateral deposit. The election is afoot if the submitted parameters pass all veriﬁcations and the aggregator sends the correct deposit amount.

Algorithm 1. Pseudocode for setUp method of smart contract Inputs: T1 , T2 , T3 , T4 , T5 , g, F0 Assert Sender.Address = Self.Aggregator Assert Transaction.Value = F0 Assert Block.Number < T1 < T2 < T3 < T4 < T5 Assert Self.Guard = 0 Store T1 , T2 , T3 , T4 , T5 , rootelg , g, F0 Self.Deposit[Sender.Address] := 1 Self.Guard := 1 Update guard attribute when the phase has been run successfully

4.2

Phase 2.1: Registration Validation

An eligible voter V registers to join the election by submitting (address, vk, m, s) to the aggregator, where: – – – –

address: Ethereum address of V . vk: voting key of V . Voter V is eligible if and only if vk is an element in E. m: Merkle proof of membership to prove that vk ∈ E. s: Schnorr’s signature for (vk, address(V )) to prove knowledge of logg (vk). This signature must be computed using the secret key corresponding to vk.

Aggregator veriﬁes m and s. If Merkle’s proof and Schnorr’s signature are valid, V is marked as partially registered. After a period of time, the aggregator invokes the registerVoters method of the main smart contract with V Kpar , Spar , Apar , πregister , where: – V Kpar , Spar , Apar : ordered lists of voting keys, Schnorr’s signatures, and Ethereum addresses of partially registered voters, respectively. – πregister : STARK proof that the registerVoter routine is executed correctly by the aggregator for all partially registered voters.

422

N. Nguyen and K. Nguyen-An

Algorithm 2. Pseudocode for registerVoters method of smart contract Inputs: V Kpar , Spar , Apar , πregister Assert Sender.Address = Self.Aggregator Assert T1 < Block.Number < T2 Assert Self.Guard = 1 Assert verifyRegisterVoter(rootelg , V Kpar , Spar , Apar , πregister ) = 1 for i in 1 ... length(V Kpar ) do VotingKey := V Kpar [i] Address := Apar [i] Self.TempKeys[Address] := VotingKey Mark voter as partially registered end for Self.Guard := 2

4.3

Phase 2.2: Registration Confirmation

To conﬁrm the registration made in the previous phase, each voter must send a collateral deposit to the registerConfirm method. Voters who submit the correct deposit amount in time are marked as registered. Algorithm 3. Pseudocode for registerConfirm method of smart contract Assert T2 < Block.Number < T3 Assert Self.Guard = 2 Assert Self.TempKeys[Sender.Address] = Null VotingKey := Self.TempKeys[Sender.Address] Assert Self.Addresses[VotingKey] = Null Prohibit registering two diﬀerent Ethereum addresses under one key Assert Transaction.Value = F0 Self.Deposit[Sender.Address] := 1 Mark voter as registered Self.VotingKeys.append(VotingKey) Self.Addresses[VotingKey] := Sender.Address Self.VoterIndex := Self.VoterIndex + 1 Emit AssignIndexEvent(Sender.Address, Self.VoterIndex) Assign an index to registered voter

4.4

Phase 3: Casting Votes

Let Vi denote the registered voter who was assigned the index i by the main smart contract in the previous phase. During this phase, a registered voter Vi submits (ci , vbi ) to the aggregator: – ci : encrypted vote of Vi such that ci = Yixi g vi , in which Yi , xi and vi (vi ∈ {−1, 1}) are the blinding key, secret key and plain vote of Vi , respectively; – vbi : proof-of-validity of ballot (CDS proof) for ci .

Zk-STARKs Open Vote Network

423

The aggregator checks if Vi is a registered voter and (ci , vbi ) is valid. If yes, the aggregator accepts the encrypted vote. After a period of time, the aggregator invokes the castVotes method of the main smart contract with , O, πcast ), (Creg , V Breg

where – Creg : ordered list of encrypted votes cast by registered voters Creg = ci : vki ∈ V Kreg , wherein V Kreg is the ordered list of voting keys of registered voters which equals to the Self.VotingKeys list stored on the main smart contract; : ordered list of truncated CDS proofs submitted by registered voters – V Breg along with their ballots, = (ai1 , bi1 , ai2 , bi2 ) : (ai1 , bi1 , ai2 , bi2 , ri1 , ri2 , di1 , di2 ) ∈ V Breg , V Breg wherein V Breg is the ordered list of CDS proofs submitted by registered voters; – O: an ordered list of elements in STARK’s base ﬁeld indicating the validity of ballot submissions made by registered voters. Oi equals the identity element if and only if Vi submitted ballot is valid; – πcast : STARK proof that the castVote routine is executed correctly by the aggregator for all registered voters.

Algorithm 4. Pseudocode for castVotes method of smart contract Inputs: Creg , V Breg , O, πcast

Assert Sender.Address = Self.Aggregator Assert T3 < Block.Number < T4 Assert Self.Guard = 2 Assert length(Creg ) = length(Self.VotingKeys) ) = length(O) Assert length(Creg ) = length(V Breg , O, πcast ) = 1 Assert verifyCastVote(Self.VotingKeys, Creg , V Breg Valid := 1 for i in 1 ... length(O) do if IsNotIdentity(O[i]) then Valid := 0 Address := Self.Addresses[Self.VotingKeys[i]] Self.Deposit[Address] := 0 Seize deposit of voter who registered, but did not cast an invalid ballot end if end for Assert Valid = 1 Store Creg Self.Guard := 3

424

4.5

N. Nguyen and K. Nguyen-An

Phase 4: Tallying Votes

After Phase 3 ends, the aggregator runs the voteTallying routine to tally cast ballots. Then, the aggregator invokes the voteTallying method of main smart contract with r, πtally , where: – r: tallying result outputted by the voteTallying routine, ⎡ ⎛ ⎞ ⎤ 1 ⎣ r = · logg ⎝ c⎠ + n⎦ , 2 c ∈ Creg

wherein n is the number of registered voters. – πtally : STARK proof that the voteTallying routine is executed correctly.

Algorithm 5. Pseudocode for voteTallying method of smart contract Inputs: r, πtally Assert Sender.Address = Self.Aggregator Assert T4 < Block.Number < T5 Assert Self.Guard = 3 Assert verifyVoteTallying(Creg , r, πtally ) = 1 Store r Self.Guard := 4

4.6

Phase 5: Refunding

Finally, each election participant can call the refund method to claim their deposit if it has not been seized by the smart contract due to misbehavior.

Algorithm 6. Pseudocode for refund method of smart contract Assert Block.Number > T5 Assert Self.Deposit[Sender.Address] = 1 Self.Deposit[Sender.Address] := 0 Sender.Address.transfer(F0 )

To evaluate our protocol, we implemented a proof-of-concept3 which includes a library called openvote containing the logic of our STARK-veriﬁable routines and a Solidity project called openvote-contracts having all smart contracts used by our protocol. 3

https://github.com/catp3rson/E-Voting-System.

Zk-STARKs Open Vote Network

5 5.1

425

Evaluation Theoretical Discussion

The table below shows a theoretical comparison between our protocol with those of [9,10], and [5] regarding functional and security requirements of interest. Because our protocol employs zk-STARK to enforce computational integrity, any misbehavior can be automatically detected without needing a dispute phase like that of [10]. Hence, dispute-freeness is retained. Unlike zk-SNARK, zk-STARK does not require trusted setups, which enables us to achieve scalability via zeroknowledge rollup without sacriﬁcing transparency. Table 3. Theoretical comparison between blockchain-based implementations of Open Vote Network

5.2

Requirement

Our protocol McCorry Seifelnasr ElSheikh

Dispute-freeness

Yes

Yes

No

Yes

Transparency

Yes

Yes

Yes

No

Scalability

Yes

No

Yes

Yes

Self-tallying

Yes

Yes

Yes

Yes

Perfect ballot secrecy Yes

Yes

Yes

Yes

Concrete Performance

In this section, we will evaluate the three most resource-intensive STARKveriﬁable routines of our system, namely, merkle, schnorr and cds (corresponding to three modules of the same names in our openvote library). Presented results are measured for ﬁve batches: 8, 16, 32, 64, and 128.

Fig. 4. Veriﬁcation times with respect to batch size

Based on Fig. 4, it can be observed that the execution time of the STARK veriﬁer is almost always less than that of na¨ıve veriﬁer, which means we managed time compression for all batch sizes through the use of zk-STARKs.

426

N. Nguyen and K. Nguyen-An

Fig. 5. STARK proof size concerning batch size

According to Fig. 5, proof size as a function of batch size is logarithmic, which has been proved in [2]. Let P S(n) denote the proof size corresponding to the batch size of n. If n is large enough, the value of P S(n) is always smaller than that of n · P S(1). Therefore, processing voters’ data in batches on an aggregator node is more eﬃcient than letting voters handle their data separately. This way, the overall communication complexity of our protocol can be signiﬁcantly reduced.

6

Conclusion

In this paper, we proposed a detailed design for a transparent and scalable evoting system that incorporates the Open Vote Network protocol, zk-STARKs, and Ethereum smart contracts. Finally, a proof of concept is developed, and its concrete performance is measured to justify the feasibility of our proposed system. Regarding the design of our e-voting system, we plan to add more features to enhance the core qualities of our system: – Reinforcing security by adding mechanisms to report aggregator to on-chain authority for oﬀ-chain misbehaviors, e.g., blacklisting voters, abortive attack, etc. – Improving decentralization by allowing an election to have multiple aggregators. – Reducing on-chain cost by minimizing the size of call data, especially public inputs, because they are stored on the blockchain for data availability. Regarding performance, current evaluation data has pointed out a few elements that need adjustments, such as the unnecessary use of zk-STARKs for tally result validation, inconsistent time rate of the veriﬁer, etc. As a ﬁnal point, we will consider building a more user-friendly interface for our system to make interacting with the system easier for non-technical users.

Acknowledgment. We acknowledge Ho Chi Minh City University of Technology (HCMUT), VNU-HCM, for supporting this study.

Zk-STARKs Open Vote Network

427

References 1. Adida, B.: Helios: web-based open-audit voting. USENIX Secur. Symp. 17, 335– 348 (2008) 2. Ben-Sasson, E., Bentov, I., Horesh, Y., Riabzev, M.: Scalable, transparent, and post-quantum secure computational integrity. Cryptology ePrint Archive (2018) 3. Cramer, R., Damg˚ ard, I., Schoenmakers, B.: Proofs of partial knowledge and simpliﬁed design of witness hiding protocols. In: Desmedt, Y.G. (ed.) CRYPTO 1994. LNCS, vol. 839, pp. 174–187. Springer, Heidelberg (1994). https://doi.org/10.1007/ 3-540-48658-5 19 4. Cramer, R., Gennaro, R., Schoenmakers, B.: A secure and optimally eﬃcient multiauthority election scheme. Eur. Trans. Telecommun. 8(5), 481–490 (1997) 5. ElSheikh, M., Youssef, A.M.: Dispute-free scalable open vote network using zksnarks. arXiv preprint arXiv:2203.03363 (2022) 6. Groth, J.: Eﬃcient maximal privacy in boardroom voting and anonymous broadcast. In: Juels, A. (ed.) FC 2004. LNCS, vol. 3110, pp. 90–104. Springer, Heidelberg (2004). https://doi.org/10.1007/978-3-540-27809-2 10 7. Hao, F., Ryan, P.Y.A., Zieli´ nski, P.: Anonymous voting by two-round public discussion. IET Inf. Secur. 4(2), 62–67 (2010) 8. Kiayias, A., Yung, M.: Self-tallying elections and perfect ballot secrecy. In: Naccache, D., Paillier, P. (eds.) PKC 2002. LNCS, vol. 2274, pp. 141–158. Springer, Heidelberg (2002). https://doi.org/10.1007/3-540-45664-3 10 9. McCorry, P., Shahandashti, S.F., Hao, F.: A smart contract for boardroom voting with maximum voter privacy. In: Kiayias, A. (ed.) FC 2017. LNCS, vol. 10322, pp. 357–375. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-70972-7 20 10. Seifelnasr, M., Galal, H.S., Youssef, A.M.: Scalable open-vote network on Ethereum. In: Bernhard, M., et al. (eds.) FC 2020. LNCS, vol. 12063, pp. 436– 450. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-54455-3 31

DarkMDE: Excavating Synthetic Images for Nighttime Depth Estimation Using Cross-Domain Supervision Thai Tran Trung, Huy Le Xuan, Minh Huy Vu Nguyen, Hiep Nguyen The, Nhat Huy Tran Hoang, and Duc Dung Nguyen(B) Ho Chi Minh City University of Technology (HCMUT), VNU-HCM, Ho Chi Minh City, Vietnam {thai.tran241002,huy.lexuan2k,huy.vu.cse.9,hiep.nguyena113872, huy.tranacc13042,nddung}@hcmut.edu.vn

Abstract. Self-supervised monocular depth estimation (MDE) has recently gained signiﬁcant attention, demonstrating remarkable results, particularly in daytime scenarios. However, MDE in nighttime images remains challenging due to the sensitivity of photometric loss to noise and undiﬀused light illumination. In this paper, we propose a simple but highly eﬀective representation using a novel Generative Adversarial Network (GAN). Speciﬁcally, we utilize the GAN to generate synthetic nighttime images from daytime images for network inference of depth. Additionally, we leverage daytime images to compute photometric error, beneﬁting from the reliable performance of monocular depth estimation in the day domain. We further introduce a pseudo-depth constraint derived from the day image, ensuring depth consistency between paired daytime and nighttime images. Extensive experiments conducted on the Oxford RobotCar datasets validate the eﬀectiveness of our approach, yielding competitive results compared to state-of-the-art methods. Keywords: Monocular depth estimation

1

· nighttime · self-supervised

Introduction

Monocular depth estimation is an important task in computer vision with a wide range of applications, such as 3D structure construction, autonomous driving, robotics, etc. In supervised learning, a LiDAR ground truth is often used to learn the mapping from color image to pixel-wise depth map. However, acquiring quality depth data for training is very expensive and time-consuming. Hence, self-supervised learning is a suitable approach, in which a depth map is used to reconstruct the target view from continuous frames. This approach has achieved amazing results in recent years on well-lit datasets such as KITTI [2]. However, in an adverse environment like night, the assumptions of static scenario and Lambertian surface no longer hold, leading to inﬁnity-depth holes in produced depth maps (Fig. 1). c The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 N.-N. Dao et al. (Eds.): ICIT 2023, LNDECT 187, pp. 428–438, 2023. https://doi.org/10.1007/978-3-031-46573-4_39

DarkMDE

429

Fig. 1. The problem when applying photometric loss to night images. Car lights and glares break the constraint on the static world of monocular depth estimation. This leads to errors in warping between consecutive frames based on pose estimation and the appearance of holes in the depth map.

Recently, many attempts have been made in this problem [6,9,11,13–15,17, 18], most of which use domain adaptation learning to reduce the gap between day and night domain. [13,15,17] adapt the model to the nighttime domain with the guide from the daytime domain in an adversarial manner. However, adversarial learning usually causes instability in the training process. Some other methods combine the Generative Adversarial Network manner in the training procedure [9,17]. Unfortunately, combining a GAN with a monocular depth estimation task takes a lot of resources as well as time to train. [14,18] focuses on improving the input image quality for the model. Leveraging the generated night data from CycleGAN [6,20] trained a model which is robust in both day and night domains. We observed that nighttime images encounter many obstacles such as underexposed/overexposed regions or noises (Fig. 1), making it diﬃcult to learn the structure information by photometric constraint. In this paper, to reduce the eﬀect of the adverse night environment, we propose a simple yet eﬀective framework with the help of cross-domain supervision. Our framework utilizes the Monodepth2 architecture [4], which includes a Pose network and a Depth network. Our model directly learns the features in the night domain, however, we do not employ the photometric constraint on nighttime images. Instead, we utilize the reference images on the day domain for optimization. First, we apply CycleGAN [20] on daytime photos to generate the paired nighttime ones. The fake night images are fed to the Depth estimation network to learn the RGB-to-depth mapping. Since it is easier to learn in a clean environment, daytime photos are used for the Pose network for better estimation of camera translation between frames. We further use the predicted depth map from nighttime images to warp daylight ones for calculating the photometric error in continuous frames. Moreover, we strengthen the learning result via a depth consistency loss between the paired day and night images. This approach still ensures that the features of the night domain can still be learned, yet

430

T. T. Trung et al.

bypasses the lighting obstacles in the night domain to increase the accuracy of the photometric constraints. Our contributions can be summarized as follows: – We proposed a simple learning framework for the problem of nighttime monocular depth estimation. It utilizes synthetic nighttime images with the supervision of the paired daytime images. – We mitigate the domain gap between day and night with a cross-domain supervision strategy, where daytime images are used for pose estimation and to compute the photometric loss based on the nighttime depth map. Furthermore, we introduce a pseudo-depth constraint derived from the day image, which helps ensure depth consistency between the paired daytime and nighttime images. – We conducted extensive experiments on the Oxford RobotCar dataset to validate the eﬀectiveness of our approach, demonstrating its competitive performance compared to state-of-the-art techniques.

2 2.1

Related Works Self-supervised Depth Estimation

Recently, there have been various studies conducted on the topic of depth estimation. The ﬁrst unsupervised depth estimation method was introduced as Monodepth [3], which relied on depth constraints derived from the known relative positioning of stereo camera setup. [1,8,10,12,16] applied this method and achieved successful results. Following that concept, Shu et al. [19] extended it to consecutive images instead of stereo images. They introduced a pose estimation network that estimated the transformation matrix between the camera positions of two scenes, compensating for the lack of explicit camera position information. Monodepth2 [4] handles the problem of moving objects which violated the assumption of self-supervised training using a min-reprojection loss and a masking scheme to ﬁlter out unchanged pixels. 2.2

Nighttime Self-supervised Depth Estimation

Depth estimation performs well on daytime images, producing noticeable results. However, performing depth estimation in nighttime scenarios is challenging due to undiﬀused light illumination and noise. To handle these challenges, domain adaptation techniques have been applied to depth space [6,9,15,17]. These approaches assume that the model can discover domain-invariant features, such as structural information, from two diﬀerent domains. However, scenes captured at diﬀerent times often change, which can mislead the model in learning common features. The image pairs with similar scenes in practice are also challenging. Some methods employ Generative Adversarial Networks (GANs) to convert daytime images to nighttime or vice versa, creating paired images that can enhance pattern recognition [5,6,9,17].

DarkMDE

3

431

Method

In this section, we will present our detailed DarkMDE framework (illustrated in Fig. 2). Our framework contains a Pose network Φpose , a Depth estimation network for the night Φnight and a Depth estimation network for the day Φday which we will use the pretrained version on daytime images.

Fig. 2. Our training framework. Φcycle denote the CycleGAN network to transfer the daytime domain to the nighttime domain. The Cross domain supervision module d includes a Pose network Φpose which uses the consecutive daytime images (Itd , It−1 d and Itd , It+1 ) to get the transformation matrix (Tt→t−1 and Tt→t+1 ). The Nighttime depth network predicts depth using the generated nighttime image, then the depth map is used in the cross-domain supervision module to project the source to the target frame for photometric loss. The Daytime depth network predicts depth using the paired daytime image to ensure depth consistency in the scene with Ldc .

3.1

Self-supervised Training

We follow the unsupervised learning approach proposed in [4,19]. We will consider it as a view synthesis constraint on consecutive frames, where the target frame It is constructed using the viewpoint of the source image Is , the corresponding depth Dt and the relative pose Tt→s . ps and pt are denoted as each pixel the point in the source image Is and target image It respectively. The relationship between the pixels of the two frames is expressed as follows: ps ∼ KTt→s Dt × K−1 pt

(1)

432

T. T. Trung et al.

Here, ∼ represents the homogeneous equivalence, Tt→s is the transformation matrix from the camera at the target image to the camera at the source image, predicted by the Pose network Φpose : Tt→s = Φpose (Is , It )

(2)

The output depthmap D is estimated using the depth prediction network Φdepth : Dt = Φdepth (It )

(3)

The target image can be constructed from the source image by the D coordinates of the projected depths Dt in Is produced by proj() and the sampling operator ·: Iˆt = Is proj(Dt , Tt→s , K)

(4)

In this approach, the intrinsic matrix K, the source image Is , and the target image It are known values. The reconstructed image quality is evaluated by pixel-wise photometric error: pe(Iˆt , It ) =

α (1 − SSIM (Iˆt , It )) + (1 − α)Iˆt − It 1 2

(5)

Monodepth2 [4] recommends min-photometric error to address the occlusions problem: (6) Lpe = min pe(It , It+k ) k∈{−1,+1}

Binary auto-masking [4] is used to ﬁlter out the case where the camera is stationary or there is an object moving in the same direction as the camera: Mauto = [

min

k∈{−1,+1}

pe(It , It+k→t )