120 95 35MB
English Pages 372 [368] Year 2023
Lecture Notes on Data Engineering and Communications Technologies 188
Nhu-Ngoc Dao Tran Ngoc Thinh Ngoc Thanh Nguyen Editors
Intelligence of Things: Technologies and Applications The Second International Conference on Intelligence of Things (ICIT 2023), Ho Chi Minh City, Vietnam, October 25–27, 2023, Proceedings, Volume 2
Lecture Notes on Data Engineering and Communications Technologies Series Editor Fatos Xhafa, Technical University of Catalonia, Barcelona, Spain
188
The aim of the book series is to present cutting edge engineering approaches to data technologies and communications. It will publish latest advances on the engineering task of building and deploying distributed, scalable and reliable data infrastructures and communication systems. The series will have a prominent applied focus on data technologies and communications with aim to promote the bridging from fundamental research on data science and networking to data engineering and communications that lead to industry products, business knowledge and standardisation. Indexed by SCOPUS, INSPEC, EI Compendex. All books published in the series are submitted for consideration in Web of Science.
Nhu-Ngoc Dao · Tran Ngoc Thinh · Ngoc Thanh Nguyen Editors
Intelligence of Things: Technologies and Applications The Second International Conference on Intelligence of Things (ICIT 2023), Ho Chi Minh City, Vietnam, October 25–27, 2023, Proceedings, Volume 2
Editors Nhu-Ngoc Dao Sejong University Seoul, Korea (Republic of) Ngoc Thanh Nguyen Wroclaw University of Science and Technology Wrocław, Poland
Tran Ngoc Thinh Vietnam National University Ho Chi Minh City (VNU-HCM) Ho Chi Minh City University of Technology (HCMUT) Ho Chi Minh City, Vietnam
ISSN 2367-4512 ISSN 2367-4520 (electronic) Lecture Notes on Data Engineering and Communications Technologies ISBN 978-3-031-46748-6 ISBN 978-3-031-46749-3 (eBook) https://doi.org/10.1007/978-3-031-46749-3 © The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 This work is subject to copyright. All rights are solely and exclusively licensed by the Publisher, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed. The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use. The publisher, the authors, and the editors are safe to assume that the advice and information in this book are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or the editors give a warranty, expressed or implied, with respect to the material contained herein or for any errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional claims in published maps and institutional affiliations. This Springer imprint is published by the registered company Springer Nature Switzerland AG The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland Paper in this product is recyclable.
Preface
This volume contains the proceedings of the Second International Conference on Intelligence of Things (ICIT 2023), hosted by Ho Chi Minh City University of Technology (HCMUT) in Ho Chi Minh City, Vietnam, October 25–27, 2023. The conference was co-organized by Ho Chi Minh City University of Technology (HCMUT), Hanoi University of Mining and Geology (HUMG), Vietnam National University of Agriculture (VNUA), Ho Chi Minh City Open University, and Quy Nhon University, Vietnam. In recent years, we have witnessed important changes and innovations that the Internet of things (IoT) enables for emerging digital transformations in human life. Continuing impressive successes of the IoT paradigms, things now require an intelligent ability while connecting to the Internet. To this end, the integration of artificial intelligence (AI) technologies into the IoT infrastructure has been considered a promising solution, which defines the next generation of the IoT, i.e., the intelligence of things (AIoT). The AIoT is expected to achieve more efficient IoT operations in manifolds such as flexible adaptation to environmental changes, optimal tradeoff decisions among various resources and constraints, and friendly human–machine interactions. In this regard, the ICIT 2023 was held to gather scholars who address the current state of technology and the outcome of ongoing research in the area of AIoT. The organizing committee received 159 submissions from 15 countries. Each paper was reviewed by at least two members of the program committee (PC) and external reviewers. Finally, we selected 71 best papers for oral presentation and publication. We would like to express our thanks to the keynote speakers: Ngoc Thanh Nguyen from Wroclaw University of Science and Technology, Poland, Emanuel Popovici from University College Cork, Ireland, and Koichiro Ishibashi from the University of ElectroCommunications, Japan, for their world-class plenary speeches. Many people contributed toward the success of the conference. First, we would like to recognize the work of the PC co-chairs for taking good care of the organization of the reviewing process, an essential stage in ensuring the high quality of the accepted papers. In addition, we would like to thank the PC members for performing their reviewing work with diligence. We thank the local organizing committee chairs, publicity chair, multimedia chair, and technical support chair for their fantastic work before and during the conference. Finally, we cordially thank all the authors, presenters, and delegates for their valuable contributions to this successful event. The conference would not have been possible without their support. Our special thanks are also due to Springer for publishing the proceedings and to all the other sponsors for their kind support.
vi
Preface
Finally, we hope that ICIT 2023 contributed significantly to the academic excellence of the field and will lead to the even greater success of ICIT events in the future. October 2023
Nhu-Ngoc Dao Tran Ngoc Thinh Ngoc Thanh Nguyen
Organization
Organizing Committee Honorary Chairs Mai Thanh Phong Thanh Hai Tran Thanh Thuy Nguyen Nguyen Minh Ha Do Ngoc My Nguyen Thi Lan
HCMC University of Technology, Vietnam Hanoi University of Mining and Geology, Vietnam Vietnam National University, Vietnam Ho Chi Minh City Open University, Vietnam Quy Nhon University, Vietnam Vietnam National University of Agriculture, Vietnam
General Chairs Nhu-Ngoc Dao Quang-Dung Pham Hong Anh Le Tran Vu Pham Koichiro Ishibashi Truong Hoang Vinh
Sejong University, South Korea Vietnam National University of Agriculture, Vietnam Hanoi University of Mining and Geology, Vietnam HCMC University of Technology, Vietnam The University of Electro-Communications, Japan Ho Chi Minh City Open University, Vietnam
Program Chairs Takayuki Okatani Pham Quoc Cuong Tran Ngoc Thinh Ing-Chao Lin Shin Nakakima Le Xuan Vinh
Tohoku University, Japan HCMC University of Technology, Vietnam HCMC University of Technology, Vietnam National Cheng Kung University, Taiwan National Institution of Informatics, Japan Quy Nhon University, Vietnam
viii
Organization
Steering Committee Ngoc Thanh Nguyen (Chair) Hoang Pham Sungrae Cho Hyeonjoon Moon Jiming Chen Dosam Hwang Gottfried Vossen Manuel Nunez Torsten Braun Schahram Dustdar
Wroclaw University of Science and Technology, Poland Rutgers University, USA Chung-Ang University, South Korea Sejong University, South Korea Zhejiang University, China Yeungnam University, South Korea Muenster University, Germany Universidad Complutense de Madrid, Spain University of Bern, Switzerland Vienna University of Technology, Austria
Local Organizing Chairs Pham Hoang Anh Le Trong Nhan Phan Tran Minh Khue
HCMC University of Technology, Vietnam HCMC University of Technology, Vietnam Ho Chi Minh City Open University, Vietnam
Publication Chairs Vo Nguyen Quoc Bao Laihyuk Park Ho Van Lam
Posts and Telecommunications Institute of Technology, Vietnam Seoul National University of Science & Technology, South Korea Quy Nhon University, Vietnam
Finance Chair Nguyen Cao Tri
HCMC University of Technology, Vietnam
Publicity Chairs Tran Trung Hieu Phu Huu Phung
University of Stuttgart, Germany University of Dayton, USA
Organization
Woongsoo Na Trong-Hop Do
Kongju National University, South Korea University of Information Technology VNU-HCM, Vietnam
Track Chairs Quan Thanh Tho Ngoc Thanh Dinh Tran Minh Quang Nguyen Huu Phat Mai Dung Nguyen Le Tuan Ho Pham Trung Kien Khac-Hoai Nam Bui Thuy Nguyen Hong Phan Thi Thu Van-Phuc Hoang Kien Nguyen
HCMC University of Technology, Vietnam Soongsil University, South Korea HCMC University of Technology, Vietnam Hanoi University of Science and Technology, Vietnam Hanoi University of Mining and Geology, Vietnam Quy Nhon University, Vietnam HCMC International University, Vietnam Viettel Cyberspace Center, Vietnam RMIT University, Vietnam FPT University, Vietnam Le Quy Don Technical University, Vietnam Chiba University, Japan
Program Committee Daisuke Ishii Takako Nakatani Kozo Okano Kazuhiko Hamamoto Tran Van Hoai Hiroshi Ishii Cong-Kha Pham Man Van Minh Nguyen Shigenori Tomiyama Le Thanh Sach Quan Thanh Tho Minh-Triet Tran Pham Tran Vu Nguyen Duc Dung Nguyen An Khuong Nguyen-Tran Huu-Nguyen Nguyen Le Duy Lai Van Sinh Nguyen
Tuan Duy Anh Nguyen Jae Young Hur Minh Son Nguyen Truong Tuan Anh Tran Minh Quang Vo Thi Ngoc Chau Le Thanh Van Hoa Dam Surin Kittitornkun Tran Manh Ha Luong Vuong Nguyen Tri-Hai Nguyen Nguyen Tien Dat Duong Huu Thanh Le Viet Tuan Denis Hamad Fadi Dornaika Thongchai Surinwarangkoon
ix
x
Organization
Kittikhun Meethongjan Daphne Teck Ching Lai Meirambek Zhaparov Minh Ngoc Dinh Tsuchiya Takeshi Thuy Nguyen Trang T. T. Do Ha X. Dang Nguyen Trong Kuong Nguyen Hoang Huy Quang-Dung Pham
Nguyen Doan Dong Tran Duc Quynh Phan Thi Thu Hong Nguyen Huu Du Nguyen Van Hanh Hirokazu Doi John Edgar S. Anthony Long Tan Le Tran Vinh Duc Le Duc Hung
Contents
AIoT Services and Applications Investigating Ensemble Learning Methods for Predicting Water Quality Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Huu-Du Nguyen and Thi-Thu-Hong Phan Age-Invariant Face Recognition Based on Self-Supervised Learning . . . . . . . . . . Minh Le Quang, Mi Ton Nu Quyen, Nguyen Nguyen Lam, Trung Nguyen Quoc, and Vinh Truong Hoang Detection of Kidney Stone Based on Super Resolution Techniques and YOLOv7 Under Limited Training Samples . . . . . . . . . . . . . . . . . . . . . . . . . . . . Minh Tai Pham Nguyen, Viet Tuan Le, Huu Thanh Duong, and Vinh Truong Hoang Hardware-Based Lane Detection System Architecture for Autonomous Vehicles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Duc Khai Lam, Pham Thien Long Dinh, and Thi Ngoc Diem Nguyen Video Classification Based on the Behaviors of Children in Pre-school Through Surveillance Cameras . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Tran Gia The Nguyen, Pham Phuc Tinh Do, Dinh Duy Ngoc Cao, Huu Minh Tam Nguyen, Huynh Truong Ngo, and Trong-Hop Do Land Subsidence Susceptibility Mapping Using Machine Learning in the Google Earth Engine Platform . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Van Anh Tran , Thanh Dong Khuc, Trung Khien Ha, Hong Hanh Tran, Thanh Nghi Le, Thi Thanh Hoa Pham, Dung Nguyen, Hong Anh Le, and Quoc Dinh Nguyen Building an AI-Powered IoT App for Fall Detection Using Yolov8 Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Phu Hoang Ng, An Mai, and Huong Nguyen Seam Puckering Level Classification Using AIoT Technology . . . . . . . . . . . . . . . Duc Dang Khoi Nguyen, Tan Duy Le, An Mai, Thi Van Khanh Nguyen, Song Thanh Quynh Le, Duc Duy Nguyen, and Kha-Tu Huynh
3
13
23
34
45
55
65
75
xii
Contents
Classification of Pneumonia on Chest X-ray Images Using Transfer Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Nguyen Thai-Nghe, Nguyen Minh Hong, Pham Thi Bich Nhu, and Nguyen Thanh Hai Bayesian Approach for Static Object Detection and Localization in Unmanned Ground Vehicles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Luan Cong Doan, Hoa Thi Tran, and Dung Nguyen
85
94
Diabetic Retinopathy Diagnosis Leveraging Densely Connected Convolutional Networks and Explanation Technique . . . . . . . . . . . . . . . . . . . . . . . . 105 Ngoc Huynh Pham and Hai Thanh Nguyen Deep Learning Approach for Inundation Area Detection Using Sentinel Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115 Giang Tran, Hoa T. Tran, Huong Tran, Long Hoang Nguyen, Hong Anh Le, and Dung Nguyen Classification of Raisin Grains Based on Ensemble Learning Techniques in Machine Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127 Nguyen Huu Hai, Nguyen Xuan Thao, Tran Duc Quynh, Pham Quang Dung, Nguyen Doan Dong, Tran Trung Hieu, and Hoang Thi Huong An Effective Deep Learning Model for Detecting Plant Diseases Using a Natural Dataset for the Agricultural IoT System . . . . . . . . . . . . . . . . . . . . . . . . . . 136 Tu-Anh Nguyen, Trong-Minh Hoang, and Duc-Minh Tran Real-Time Air Quality Monitoring System Using Fog Computing Technology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148 Tan Duy Le, Nguyen Binh Nguyen Le, Nhat Minh Quang Truong, Huynh Phuong Thanh Nguyen, and Kha-Tu Huynh An Intelligent Computing Method for Scheduling Projects with Normally Distributed Activity Times . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 158 Nguyen Hai Thanh Security and Privacy An Improved Hardware Architecture of Ethereum Blockchain Hashing System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 171 Duc Khai Lam, Quoc Linh Phan, Quoc Truong Nguyen, and Van Quang Tran
Contents
xiii
CSS-EM: A Comprehensive, Secured and Sharable Education Management System for Schools . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 181 Van Duy Tran, Thi Hong Tran, Shingo Ata, Duc Khai Lam, and Hoai Luan Pham A High-Speed Barret-Based Modular Multiplication with Bit-Correction for the CRYSTAL-KYBER Cryptosystem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 191 Trong-Hung Nguyen, Cong-Kha Pham, and Trong-Thuc Hoang Securing Digital Futures: Exploring Decentralised Systems and Blockchain for Enhanced Identity Protection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 200 Hoang Viet Anh Le, Quoc Duy Nam Nguyen, Thi Hong Tran, and Tadashi Nakano Enhancing Blockchain Interoperability Through Sidechain Integration and Valid-Time-Key Data Access Control . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 213 Tuan-Dung Tran, Kiet Anh Vo, Nguyen Binh Thuc Tram, Ngan Nguyen Bui Kim, Phan The Duy, and Van-Hau Pham An IoT Attack Detection Framework Leveraging Graph Neural Networks . . . . . 225 Iram Bibi, Tanir Ozcelebi, and Nirvana Meratnia Network Attack Detection on IoT Devices Using 2D-CNN Models . . . . . . . . . . . 237 Duc-Minh Ngo, Dominic Lightbody, Andriy Temko, Cuong Pham-Quoc, Ngoc-Thinh Tran, Colin C. Murphy, and Emanuel Popovici EVB - Electronic Voting System Based on Blockchain Technology . . . . . . . . . . . 248 Hoang-Long Duong, Minh-Chau Nguyen-Ngoc, Thu Nguyen, Khoa Tan Vo, Tu-Anh Nguyen-Hoang, and Ngoc-Thanh Dinh ZUni: The Application of Blockchain Technology in Validating and Securing Educational Credentials . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 258 Minh-Chau Nguyen-Ngoc, Thao-Vinh Tran, Thu Nguyen, Khoa Tan Vo, Tu-Anh Nguyen-Hoang, and Ngoc-Thanh Dinh SDN-Based Cyber Deception Deployment for Proactive Defense Strategy Using Honey of Things and Cyber Threat Intelligence . . . . . . . . . . . . . . . . . . . . . . 269 Nghi Hoang Khoa, Hien Do Hoang, Khoa Ngo-Khanh, Phan The Duy, and Van-Hau Pham Intermed: An Efficient Interoperable Blockchain Protocol for Electronic Health Record Transferring . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 279 Dung Truong-Viet, Hong Tai Tran, and Khuong Nguyen-An
xiv
Contents
A Blockchain-Based IoT System for Secure Attendance Management . . . . . . . . . 294 Vo-Trung-Dung Huynh An IDS-Based DNN Model Deployed on the Edge Network to Detect Industrial IoT Attacks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 307 Trong-Minh Hoang, Thanh-Tra Nguyen, Tuan-Anh Pham, and Van-Nhan Nguyen Design of a Secure Firmware Over-The-Air for Internet of Things Systems . . . . 320 Duc-Hung Le and Van-Nhi Nguyen An IDS-Based DNN Utilized Linear Discriminant Analysis Method to Detect IoT Attacks in Edge Computing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 332 Minh-Hoang Nguyen, Van-Nhan Nguyen, Nam-Hoang Nguyen, Sinh-Cong Lam, and Trong-Minh Hoang Applying the Distributed Ledger Technology for the Product Origin Traceability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 342 Ho Anh Thi and Nguyen Duc Thai Author Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 355
AIoT Services and Applications
Investigating Ensemble Learning Methods for Predicting Water Quality Index Huu-Du Nguyen1 and Thi-Thu-Hong Phan2(B) 1
School of Applied Mathematics and Informatics, Hanoi University of Science and Technology, Hanoi, Vietnam 2 Artificial Intelligence Department, FPT University, Danang, Vietnam [email protected]
Abstract. Water Quality Index (WQI) is a universal indicator for assessing the quality of water from different sources. It converts several physical, chemical, and biological parameters in water into a single value that can be used to represent water quality. In the calculation of WQI, the traditional method has several limitations. Recently machine learning techniques have emerged as an effective method in water resources management. This study aims to investigate the performance of ensemble tree-based algorithms for predicting the WQI. The experiments are conducted based on the data collected from the An Kim Hai system, an important irrigation system in the north of Vietnam, with different ensemble learning algorithms, including Bagging, Random Forest, Extra Trees, AdaBoost, XGBoost, and Decision Tree. The obtained results show that the use of the Random Forest algorithm leads to the best performance in predicting the WQI value from the system. Keywords: Water quality index · Ensemble learning learning · Bagging method · Boosting method
1
· Machine
Introduction
Water pollution has negative effects on many aspects of human life, leading to unexpected problems in human health, reducing agricultural and industrial output, and degrading the environment. Therefore, evaluating the quality of water is an important task in water resource management. A proper evaluation of water quality is the key to help managers planing using water efficiently or coming up with solutions to improve water quality in a timely manner. The quality of water is determined by a number of parameters. It can be physical parameters like temperature, turbidity, and suspended solids; it can be chemical parameters like pH, dissolved oxygen (DO), biochemical oxygen demand (BOD), and Chemical Oxygen Demand (COD); it can be biological parameters like coliforms and bacteria; or it can be pesticides and trace variables. That is to say, there are many parameters that need to be determined to H. Du Nguyen and T. Thu Hong Phan—Contributed equally to this research. c The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 N.-N. Dao et al. (Eds.): ICIT 2023, LNDECT 188, pp. 3–12, 2023. https://doi.org/10.1007/978-3-031-46749-3_1
4
H.-D. Nguyen and T.-T.-H. Phan
evaluate the quality of water. Even if these parameters are specified, it is not straightforward to give a threshold for each parameter. This makes the water quality assessment difficult. The Water Quality Index (WQI) is then introduced to solve the problem. By integrating several quantitative and intensive parameters into a single value, the index provides an effective tool to describe the general health or status of a water body. A typical WQI value is between 0 and 100 where the larger the WQI value, the better the water quality. In the literature, there is no unique formula to calculate the WQI, instead, this formula varies from country to country. Since the first introduction in [1], to date, over 20 WQI models have been developed by states or organizations worldwide. A review of WQI models and their use for assessing surface water quality was conducted in [18]. However, the calculation of a WQI can be separated into four main stages: selecting the parameters, generating sub-indices for each parameter, computing the parameter weighting values, and aggregating sub-indices for the final overall WQI. This calculation, in general, is considered lengthy and requires significant time and effort. Moreover, many studies have revealed recently that the WQI model produced remarkable uncertainty in its modeling process. The generation of model uncertainty can come from the subindex functions, the weight generation methods, and the structure of the aggregation function [12], namely, the uncertainty in WQI models are often related to various stages of a model and its processes. As a consequence, the WQI model may not reflect accurate water quality attributes. In recent years, machine learning (ML) methods have been used widely as an alternative approach for reducing the model uncertainty and computationally efficient and accurate estimation of the WQI. These techniques have been proven to have the power for modeling complex non-linear behaviors in water-resource research [15]. These techniques have been proven to have the power for modeling complex non-linear behaviors in water-resource research [15]. Further details of a variety of machine learning models and their use for predicting the WQI are available at [2,9,11,16]. As discussed in [7] each ML algorithm has its strengths and shortcomings. Among many ML algorithms, ensemble tree-based algorithms are considered potentially useful for predicting WQI. In machine learning, ensemble learning is a powerful technique, highly effective in many fields. Several ensemble techniques have been applied for the water quality assessment, see, for example, [8,10], and [6]. In this study, we aim to carry out a comparative study of the ensemble tree-based methods for WQI prediction. To be more specific, we investigate the performance of several ensemble learning models, including Random Forest (RF), Extra Trees, XGBoost, Bagging, and AdaBoost along with decision tree (DT) to determine outperformed models in estimating the WQI values. The use of these several methods is to find out which ensemble model is suitable for the WQI estimation problem. The experiments are conducted based on a dataset collected from the AN Kim Hai system, an irrigation system in the north of Vietnam.
Ensemble Learning Methods for the WQI Prediction
5
The rest of the paper is organized as follows. Section 2 is to describe shortly the methods applied in the study. The experiments and obtained results have been presented in Sect. 3. Some concluding remarks are given in Sect. 4.
2
Methodology
In this section, we present briefly the ensemble machine learning algorithms applied to calculate the WQI value in the study. 2.1
Decision Tree
Decision Tree (DT), refers to a supervised learning algorithm in machine learning. It creates a model that can use to predict the class (i.e. classification) or value of the target variable (i.e. regression) by learning simple decision rules inferred from data. A DT model is built by recursively splitting the training samples using the features from the data that work best for the specific task. It consists of two basic elements, namely nodes, and branches. At each node, a feature is evaluated to split the data in the training process or to make a specific data point follow a certain path through the tree when making a prediction. The predictions of a category or a numerical value are made at the leaf nodes the final nodes of the tree. The split in each node is done relying on certain metrics, like the Gini index or the Entropy for categorical decision trees, or the Residual or Mean Squared Error for regression trees. The DT algorithm is simple to understand and interpret, however, it is quite prone to overfitting to the training data and can be sensible to outliers. It is also considered a weak learner as a single decision tree normally does not make great predictions. This suggests combining multiple trees to introduce stronger ensemble models. 2.2
Bagging (Bootstrapped Aggregating)
Bagging belongs to ensemble machine learning methods that build a large number of models in order to improve the stability and accuracy of predictions. Each model is built on a random set of bootstrapped samples from the training dataset with replacement. These models are trained independently and in parallel, and then their outputs are averaged to get the final result for regression tasks and the most frequently predicted by the individual models for classification tasks. To create a bagging model, several base machine learning algorithms can be used such as support vector machines, decision trees, and neural networks. It is well documented that bagging can reduce prediction error, being the most effective for decision trees. This algorithm also reduces the risk of overfitting by training each individual model on a different sample of the data.
6
2.3
H.-D. Nguyen and T.-T.-H. Phan
Random Forest
Random Forest (RF) is a popular ensemble method that combines many randomized decision trees and aggregates their predictions by averaging. Each decision tree in the forest is created as follows. Firstly, RF takes randomly n samples from the original dataset using Bootstrapping technique. Once a data sample is selected, it is not removed from the original data set, but it is retained. The step is repeated until there are n data points. In this way, the new dataset of n data samples may contain duplicate data samples. Then, at each node, only a random subset of features is considered when splitting the data. This helps to decrease overfitting and increase the diversity of the trees in the forest. To split the data at each node, RF is based on the feature that provides the most considerable reduction of the impurity such as Gini impurity or entropy measure. The process of building a tree is repeated for multiple trees using a different random subset of data samples and features. Finally, the prediction results are then aggregated from all decision trees through a majority vote for classification problems, or by averaging the outputs for regression problems. 2.4
Extra Trees
Like the RF algorithm, the extra trees (ET) algorithm [5] creates many decision trees, but the sampling for each tree is random, without replacement. This creates a dataset for each tree with unique samples. A specific number of features, from the total set of features, are also selected randomly for each tree. The most important and unique characteristic of extra trees is the random selection of a splitting value for a feature. Instead of calculating a locally optimal value using Gini or entropy to split the data, the algorithm randomly selects a split value. This makes the trees diversified and uncorrelated. It is considered an extension of the RF algorithm and is sometimes referred to as an “extremely randomized” version of RF. The ET al algorithm is fast and efficient, and its randomization of feature and data subsets for each tree reduces overfitting, making it a good choice for large datasets with a high number of features. Additionally, it has been shown to be highly effective for both regression and classification tasks. 2.5
Adaptive Boosting
Adaptive Boosting (AdaBoost) [4] is one of the first boosting algorithms used as an ensemble method in machine learning. Unlike the RF algorithm where trees are built independently of each other, AdaBoost adds a new tree to complement already built ones. More specifically, boosting refers to a machine-learning technique that creates ensemble members sequentially. The newest member is created to compensate for the instances incorrectly labeled by the previous learners. Following this idea, when building the AdaBoost algorithm, a variety of classifiers are constructed sequentially by focusing the underlying learning algorithm on
Ensemble Learning Methods for the WQI Prediction
7
those training data that have been misclassified by previous classifiers. That is, AdaBoost combines multiple “weak classifiers” into a single “strong classifier”. AdaBoost has exhibited outstanding performance on several problems thanks to its ability to handle noisy data, deal with non-linearly separable data, and adapt to changes in the distribution of the training data over time. The algorithm is also computationally efficient, as it only requires a small number of weak learners to achieve good performance. However, it is sensitive to outliers and can be vulnerable to overfitting in some cases. 2.6
XGBoost (eXtreme Gradient Boosting)
eXtreme Gradient Boosting (XGBoost) [3] is an implementation of a generalized gradient boosting algorithm, a variation on boosting that represents the learning problem as gradient descent on some arbitrary differentiable loss function that measures the performance of the model on the training data. As discussed in [17], XGBoost is considered to perform better than other tree-boosting methods due to a number of reasons, including (1) the introduction of the regularised loss function, (2) the scale down of the weights of each new tree by a given constant which reduces an influence of a single tree on the final score, and (3) the columnsampling which works in a similar way as random forests. As a result, several advantages of XGBoost can be mentioned like the ability to handle missing data, support for both linear and tree-based models, and efficient implementation of parallel and distributed computing for faster training times. It also includes a variety of hyperparameters for fine-tuning the model, such as the learning rate, the number of trees, and the size of each tree. The algorithm has been shown to achieve state-of-the-art performance on many machine learning benchmarks and is widely used in industry and academia.
3 3.1
Experiments and Results Data Description
In order to evaluate and compare the performance of ensemble learning methods, we use a dataset collected from An Kim Hai, an important irrigation system in the north of Vietnam. It is about 36,570 hectares, responsible for irrigating and draining 15,946 hectares of cultivation and supplying Hai Phong City with about 80% of the domestic water. The dataset has been collected for 12 consecutive years from 2007 to 2020. It consists of a total of 657 samples, each sample contains information on 10 water parameters, including pH, Turbidity (Turb), Temperature, Total Suspended Solids (TSS), Dissolve Oxygen (DO), Biological Oxygen Demand (BOD5 ), Nitro3− gen (NH+ 4 −N), Phosphates (PO4 −P), Chemical Oxygen Demand (COD), and Total Coliform (TC). These are the most popular used parameters to calculate the WQI in Vietnam, called VN WQI, according to the technical guideline for calculation and publication of the VN WQI issued by the Ministry of Natural Resources and Environment (MONRE) of Vietnam [13]. The descriptive statistics of the dataset have been presented in Table 1.
8
H.-D. Nguyen and T.-T.-H. Phan Table 1. Descriptive statistics of water quality parameters in the dataset Variables
Min
Max
Median Mean STD
pH
pH unit 5.1
7.9
6.9
6.8
Turb
NTU
5.4
642
96
114.6 67.5
TSS
mg/L
2.3
1247.8
97
110.4 116.8
NH+ 4 −N
mg/L
0.004 37
0.9
2.2
4.7
mg/L
0.01
0.1
0.5
0.9
PO3− 4 −P
5.5
0.4
DO
mg/L
1.5
8.5
6.2
5.3
1.9
COD
mg/L
4.8
580
20.7
29.3
35.4
9.6
13.2
12.2
BOD5
mg/L
2.0
130.6
TC
MPN
45
5400000 5100
50003 332220
24.5
36.0
27.8
Temperature
3.2
Units
o
C
27.5
1.1
Metrics to Evaluate the Performance of ML Models
In this study, six popular metrics are applied to evaluate the performance of the machine learning models. Given a set of predicted values y1 , . . . , yn and the actual values x1 , . . . , xn , these metrics are defined as follows. i. Similarity measures the percentage of similar values between the predicted values y and the actual values x by the formula n
Sim(y, x) =
1 n i=1 1 +
1 |yi −xi | max(x)−min(x)
.
ii. R is the correlation coefficient between two variables y and x. This indicator makes it possible to assess the quality of a forecasting model. iii. Mean Absolute Error (MAE) represents the average difference between the predicted values y and the actual ones, x. That is n
M AE(y, x) =
1 |yi − xi |. n i=1
iv. Root Mean Square Error (RMSE) measures the average squared difference between the forecast values y and the respective true values x. It is computed by the formula n 1 (yi − xi )2 . RM SE(y, x) = n i=1 v. Nash Sutcliffe efficiency (NSE) is used to evaluate the predictive ability of hydrological models. The NSE values range from -∞ to 1, with higher values meaning a better fit between observed and forecast water level [14]. n (xi − yi )2 N SE = 1 − ni=1 . 2 i=1 (xi − xi )
Ensemble Learning Methods for the WQI Prediction
9
These metrics can be divided roughly into two types of different meanings. The first type is with MAE and RMSE, implying that the smaller these values, the better the machine learning models. The second type consists of NSE, R, and Similarity, meaning that the larger these values are, the more efficient the machine learning models are. 3.3
The Performance of Ensemble Methods on Predicting the WQI
For the experiment, we first split the data into the training set and the testing set in the ratio 2:1. That is, about 67% of the data is to train the machine learning models while the rest of 33% of the data is to test the performance of the trained models. The input data of 10 water parameters from the An Kim Hai system are used to train all six models. After training, these models are used to compute the WQI with the input from the testing set. Then, we compare the predicted WQI values with the true WQI values based on the five metrics mentioned above. The obtained results presented in Table 2 show that, in general, the ensemble tree-based models perform well in predicting the WQI. For example, between the predicted WQI and the true WQI, the correlation coefficients are all greater than 85%, and the Similarities are all greater than 90% for all the models applied in the study. Among these models, RF produces the best performance with the largest values of Similarity (0.942), R (0.9), and NSE (0.8), and the smallest values of RMSE (11.58) and MAE (6.72). The use of XGBoost also leads to similar performance as RF, since the values of all five metrics given by XGBoost are close to the ones given by RF. Meanwhile, AdaBoost proved to be the least effective among all the models. In addition, Table 2 shows the consistency of all five metrics. For example, for RF, the observed metrics are either the largest (Similarity, R, and NSE) or the smallest (MAE and RMSE); for XGBoost, all the metrics are the second best, etc. Table 2. Performance comparison of ensemble learning methods in predicting WQI Method
MAE RMSE NSE R
Similarity
DT
8.81
13.31
0.74 0.86 0.924
Bagging
7.5
12.5
0.78 0.88 0.935
RF
6.72
11.58 0.80 0.90 0.942
ET
9.12
13.00
0.75 0.87 0.92
AdaBoost 10.05 13.33
0.74 0.86 0.911
XGBoost
0.78 0.89 0.938
7.1
12
10
H.-D. Nguyen and T.-T.-H. Phan
To visualize the obtained results, we present in Figs. 1-2 the difference between the predicted WQI values and the true WQI values for the “worst” algorithm, i.e. AdaBoost, and the “best” one, i.e. RF, in the study. It can be seen that, for the case of the RF algorithm (Fig. 1), the predicted values are quite close to the true value, while these values are quite far apart in the case of AdaBoost (Fig. 2).
True value
Adaboost
100
80
WQI
60
40
20
0 −20
0
20
40
60
80
100 120 140 160 180 200 220 Samples
Fig. 1. True value vers Adaboost
Ensemble Learning Methods for the WQI Prediction True value
11
RF
100
80
WQI
60
40
20
0 −20
0
20
40
60
80
100 120 140 160 180 200 220 Samples
Fig. 2. True value vers Random Forest
4
Conclusions
In this study, we have investigated the performance of machine learning methods in predicting WQI. In particular, we have focused on ensemble tree-based models, including RF, ET, Bagging, XGBoost, and AdaBoost. The experiments have been conducted relying on a dataset collected from the An Kim Hai irrigation system in Vietnam. The obtained results show that these ensemble models can estimate the WQI value accurately, in which the use of the RF and XGBoost algorithms leads to the best performance. That means the ensemble models can be considered as an alternative method to calculate the WQI value, allowing to avoid the uncertainty and the requirement of computing many sub-indexes from the traditional method. This finding could be useful for environmental engineering to choose an effective model for WQI estimation. Acknowledgement. The authors would like to thank associate professor Bui Quoc Lap, (Faculty of Chemistry & Environment, ThuyLoi University, Hanoi, Vietnam) for sharing the dataset to conduct this study.
12
H.-D. Nguyen and T.-T.-H. Phan
References 1. Brown, R.M., McClelland, N.I., Deininger, R.A., Tozer, R.G.: A water quality index-do we dare. Water Sewage Works 117(10) (1970) 2. Bui, D.T., Khosravi, K., Tiefenbacher, J., Nguyen, H., Kazakis, N.: Improving prediction of water quality indices using novel hybrid machine-learning algorithms. Sci. Total Environ. 721, 137612 (2020) 3. Chen, T., Guestrin, C.: Xgboost: A scalable tree boosting system. In: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 785–794 (2016) 4. Freund, Y., Schapire, R.E.: A decision-theoretic generalization of on-line learning and an application to boosting. J. Comput. Syst. Sci. 55(1), 119–139 (1997) 5. Geurts, P., Ernst, D., Wehenkel, L.: Extremely randomized trees. Mach. Learn. 63, 3–42 (2006) 6. Khan, M.S.I., Islam, N., Uddin, J., Islam, S., Nasir, M.K.: Water quality prediction and classification based on principal component regression and gradient boosting classifier approach. J. King Saud Univ.-Comput. Inform. Sci. 34(8), 4773–4781 (2022) 7. Khoi, D.N., Quan, N.T., Linh, D.Q., Nhi, P.T.T., Thuy, N.T.D.: Using machine learning models for predicting the water quality index in the la Buong river, Vietnam. Water 14(10), 1552 (2022) 8. Khullar, S., Singh, N.: Machine learning techniques in river water quality modelling: a research travelogue. Water Supply 21(1), 1–13 (2021) 9. Kouadri, S., Elbeltagi, A., Islam, A.R.M.T., Kateb, S.: Performance of machine learning methods in predicting water quality index based on irregular data set: application on illizi region (algerian southeast). Appl Water Sci. 11(12), 190 (2021) 10. Lap, B.Q., et al.: Predicting water quality index (wqi) by feature selection and machine learning: a case study of an kim hai irrigation system. Ecological Inform. 74 101991 (2023) 11. Leong, W.C., Bahadori, A., Zhang, J., Ahmad, Z.: Prediction of water quality index (WQI) using support vector machine (SVM) and least square-support vector machine (ls-svm). Int. J. River Basin Manage. 19(2), 149–156 (2021) 12. Lumb, A., Sharma, T., Bibeault, J.F., Klawunn, P.: A comparative study of USA and Canadian water quality index models. Water Qual Expo Health 3, 203–216 (2011) 13. MONRE: River flow forecasting through conceptual models part i-a discussion of principles (2019) 14. Nash, J.E., Sutcliffe, J.V.: River flow forecasting through conceptual models part i-a discussion of principles. J. Hydrol. 10(3), 282–290 (1970) 15. Nearing, G.S., et al.: What role does hydrological science play in the age of machine learning? Water Resources Res. 57(3), e2020WR028091 (2021) 16. Othman, F., et al.: Efficient river water quality index prediction considering minimal number of inputs variables. Eng. Appl. Comput. Fluid Mech. 14(1), 751–763 (2020) 17. Pan, B.: Application of xgboost algorithm in hourly pm2. 5 concentration prediction. In: IOP Conference Series: Earth and Environmental Science, vol. 113, p. 012127. IOP publishing (2018) 18. Uddin, M.G., Nash, S., Olbert, A.I.: A review of water quality index models and their use for assessing surface water quality. Ecol. Ind. 122, 107218 (2021)
Age-Invariant Face Recognition Based on Self-Supervised Learning Minh Le Quang1 , Mi Ton Nu Quyen1 , Nguyen Nguyen Lam1 , Trung Nguyen Quoc1 , and Vinh Truong Hoang2(B) 1 FPT University, Ho Chi Minh City, Vietnam {minhlqse151488,mitnqse150215,nguyennlse151345,trungnq46}@fpt.edu.vn 2 Ho Chi Minh City Open University, Ho Chi Minh City, Vietnam [email protected]
Abstract. Face recognition across aging has grown into an extremely prominent and challenging job in the field of facial recognition in recent times. Many researchers have made improvements to this field, but there is still a massive gap to fill in. This study explores the effectiveness of Self-Supervised Learning (SSL), specifically the Bootstrap Your Own Latent (BYOL) technique, to improve age-invariant facial recognition models. The experimental results demonstrate that this method greatly enhances the model’s performance, achieving accuracy gains from up to more than 5%, even on challenging datasets such as FGNET. These findings highlight the potential of SSL methods such as BYOL in advancing the field of age-invariant facial recognition. Keywords: Aging model · Self-Supervised Learning Face recognition · Deep learning
1
· Age-Invariant ·
Introduction
The identification of human faces is the tough and critical field in the computer vision business. Because facial recognition has significantly aided the success of many essential occupations, including law enforcement, criminal justice, and national security. However, parameter variations such as photographic angle, expression, brightness, and aging make identifying an individual challenging. Among them, aging is, in reality, a very complicated process. Aging is different for everyone. Furthermore, it is heavily influenced by variables such as regional area, lifestyle, eating behaviors, cosmetics, physical and emotional health, and so on. These variations pose numerous challenges, but they are extremely efficient in real applications such as finding missing children, issuing passports, tracing criminals, etc. As a result, biometrics research has focused on facial identification, particularly age-invariant face recognition, in recent years [15]. Self-supervised learning is a machine learning technique [11] that has gained attention in recent years. It involves training a model to learn the internal features of the data without explicit supervision by automatically generating labels. c The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 N.-N. Dao et al. (Eds.): ICIT 2023, LNDECT 188, pp. 13–22, 2023. https://doi.org/10.1007/978-3-031-46749-3_2
14
M. Le Quang et al.
This approach can improve models significantly, as it has strong self-training capabilities. Age-invariant face recognition is basically divided into two categories: AgeInvariant Face Recognition (AIFR) and Age-Invariant Face Verification (AIFV). We will only consider the first type in this article. Although many researchers have worked on different angles and techniques to improve the accuracy of this system, we recognize that there is still a lot of room to grow with modern technology. Our age-invariant face recognition system in this paper applies selfsupervised learning technology and styleGAN2 in deep learning. Deep learning has become very popular nowadays, but self-supervised learning is a new technology, and this is the main contribution of this paper. Face recognition, face alignment, facial feature extraction, and soft-max are the four steps in AIFR. This paper explores the combination of self-supervised learning with AIFR, which is a novel direction in research. Our work investigates the effectiveness of this approach for improving face recognition models, and it has the potential to open up new avenues for future research in this field. The rest of this paper is organized as follows. Section 2 presents the related works. Next, Sect. 3 introduces the proposed method. Finally, the experimental results and conclusion are given in Sect. 4 and 5, respectively.
2
Related Works
This section elucidates the literature review conducted in the concerned field. In order to undertake the research, a thorough examination of numerous relevant works was carried out, with a particular emphasis on two principal areas: Ageinvariant face recognition and self-supervised learning techniques applied in face recognition. 2.1
Age-Invariant Face Recognition
Initially, in order to obtain a comprehensive understanding of the domain of age-invariant face recognition, we conducted an analysis of the research article entitled “Age Invariant Face Recognition Methods: A Review” [2]. Paper [2] provided us with extensive insight into the approaches utilized within this area. The article mentions many methods such as generative methods, discriminative, and using convolutional neural networks (CNN). Generative Methods (aging simulation) include age-invariant face detection methods based on facial geometry recognition [3], using Local Pattern Selection (LPS) [9] that a new feature descriptor used on two-level learning model. The works [4] propose a 2D/3D face aging pattern space that enables the generation of a facial image that matches a target face image prior to recognition, and employs a 2D face aging model to simulate facial aging and achieve face across aging. Discriminative approaches for facial recognition utilize facial components that are classified into two categories: age-invariant factors and aging factors.
Age-Invariant Face Recognition
15
To identify these factors, a hidden factor analysis model, described in [6], develops a linear model that separates the two factors into distinct subspaces, followed by the computation of cosine distance between the identity components of gallery and probe samples for facial recognition. However, assuming independence between identity factors and age in [6] is an oversimplification, as the extent of facial changes due to aging can vary considerably among individuals. To overcome this issue, [8] introduces an updated HFA model that accounts for correlated changing factors during facial recognition. Convolutional neural networks (CNNs) have emerged as a highly popular technique for Computer Vision applications, with a notable emphasis on face recognition. In particular, CNNs have been found to be the most optimal technique for face recognition models, including age-invariant face recognition. Researchers have employed various types of CNNs in their studies, resulting in promising outcomes. To address the challenges posed by the nonlinear and smooth transformation of aging in face images, several neural network models have been proposed. For instance, the coupled auto-encoder network (CAN), which bridges two autoencoders with two shallow neural networks, was introduced in [14]. The LFCNN model [13] employs a well-designed CNN model to acquire age-invariant deep features. The AE-CNN model [16] is specifically trained to separate agerelated variations from identity-specific traits, while the OE-CNN model [12] disintegrates deep facial models into a pair of orthogonal components, one for age-specific features and the other for authenticity-specific features. In addition to these models, other techniques, such as the usage of Stacked Autoencoder Deep Neural Networks (a form of unsupervised artificial neural network), have been presented in recent publications [1]. Zhao et al. [15] proposed a new deep learning model called the Age-Invariant Model (AIM) for identifying faces in uncontrolled settings. This model has three main innovations: it combines face synthesis and identification across generations to improve performance, it can modify the age of a face while maintaining its identity without requiring paired data or true age information, and it employs novel training strategies to explicitly separate age variation from face representation. Our research also involves the application of convolutional neural networks, however, we have further enhanced the model with the most recent technology in the field - self-supervised learning. 2.2
Self-Supervised Learning in Face Recognition
Self-supervised learning is a recently popularized keyword in the field of machine learning, and it has not been extensively studied in the context of age-invariant face recognition. Nevertheless, we widely recognized that self-supervised learning has significant potential to be applied in this field. Initially, we investigated the potential of self-supervised learning through some related works in the field of face recognition.
16
M. Le Quang et al.
The authors presented the FRGAN, a self-supervised technique that creates frontal face pictures from non-frontal photos by rotating them to their original posture using reconstruction and adversarial losses, in research on pose invariant face recognition [10]. Additionally, the FRGAN utilized a data augmentation technique called Random Swap to enhance performance by significant face areas between the input picture and its reconstructed counterpart in order to create more realistic synthetic images. Furthermore, Lin et al. [11] used SLL to improve the embedded opportunity of the facial recognition model and maximize the similarities between the embeddings of each picture and then it duplicated in both the source and target domains. This was done to reduce the model’s degradation from the original domain to the destination domain.
3
Methods
3.1
Data Augmentation
The dataset employed in the face recognition task is the normalized dataset, specifically FGNet (see Fig. 1). Initially, the original images within the dataset are cropped to extract only the face portion. Next, the background is eliminated and the image is adjusted for optimal quality. The image is then resized to 256 × 256 to facilitate the face detection procedure.
Fig. 1. Some examples of images that have been aligned in the FGNet dataset.
As a preliminary step, we conducted image pre-processing to eliminate any undesired noise and improve image contrast. Subsequently, upon conducting an age analysis of the dataset, we observed the absence of certain age segments or their inadequacy in terms of image availability. In order to enhance the quality of the model, we applied the Custom Structure Preservation (CUSP) [5] technique to incorporate additional images into the dataset (see Fig. 2).
Age-Invariant Face Recognition
Input (age 26)
Age 20
Age 33
Age 46
17
Age 60
Input (age 18)
Age 20
Age 40
Age 60
Input (age 43)
Age 20
Age 40
Age 60
Fig. 2. Example of a woman’s photo after using CUSP in the dataset
3.2
Age-Invariant Face Recognition Models
The principal objective of the study is to investigate the practical implementation of Self-supervised Learning in an age-invariant face recognition model. To serve as a baseline, a simplistic training model is utilized, which is a pre-trained CNN that has undergone fine-tuning (as shown in Fig. 3). The CNN is used to capture resilient and efficient age-invariant characteristics from the test image, which are then utilized to categorize it as belonging to one of the individuals.
Fig. 3. Workflow of the age-invariant face recognition framework
3.3
Self-Supervised Learning
In this phase of our research, we have implemented an SSL model using Bootstrap Your Own Latent (BYOL) [7] methodology to develop image representations.
18
M. Le Quang et al.
BYOL is a self-learning method that leverages the interaction between two neural networks. These two networks are called the online network (ON) and the target network (TN). To predict image representation in TN, BYOL trains ON by supplementing with a new perspective that facilitates the construction of robust image representations. In order to perform downstream tasks, the main objective of BYOL is to obtain the representation yθ (see Fig. 4). This is done by implementing a pair of neural networks ON and TN as discussed above. In which, the online network is characterized by a set of weights θ and the target network has a weight of ξ. The architecture of both networks is the same divided into three parts: an encoder, a projector and a predictor. The target network serves as the estimation target for training the online ones, and its parameters ξ are an EMA of the online parameters θ.
Fig. 4. BYOLs architecture.
The BYOL approach aims to minimize the similarity loss between qθ (zθ and sg(zξ ), where θ signifies the training weights, ξ refer to an exponential moving average of θ, and sg denotes the stop-gradient operation After the training process is completed, all components except fθ are removed, and θ is used as the resultant picture representation.
4 4.1
Experimental Results Experiment Setup
Our study was conducted on the datasets, namely FGNet and CACD. We apply Dlib 68 points landmark detection to detect landmarks and align faces from datasets. This algorithm will detect 68 facial landmarks such as the eyes, nose, and mouth. With the obtained coordinates from Dlib 68 points landmark detection, then the alignment is applied. We initially conducted experiments on the FGNet dataset and observed that it consisted of only 1002 images, which were divided into 82 classes. However, upon further examination, we found that certain age groups had very few or
Age-Invariant Face Recognition
19
no images. Therefore, we followed the steps outlined in Sect. 3.1 to address this issue. We divided the training and test sets into different ratios to perform experiments and determined that the most stable ratio was 7:3. We will use the transfer learning technique from a pre-trained ResNet18 that is available via Pytorch. Upon training with a pre-trained convolutional neural network (ResNet18) model, we achieved a baseline accuracy of 51.3% on the test set and 69.1% on the validation set. Moreover, we added the CUSP-augmented images to the training dataset rather than the test set, resulting in an increase in the number of training images from 670 to 2,857, with each class having around 30 images. Table 1. Dataset after using CUSP Dataset Ratio Original train set Train set with GAN FGNet
7:3
670 images
2857 images
Then re-training the model achieved an accuracy of 57.9% on the test set. Table 1 shows the number of images in the training set after using CUSP to generate images. 4.2
Experimental Results
Following the baseline accuracy results, we applied self-supervised learning to the model to evaluate its potential improvement over a low-accuracy model. The results revealed that the model exhibited an increase in accuracy on the test set, reaching 59.3% after the application of self-supervised learning (see Fig. 5 and Fig. 6). Moreover, the accuracy of the validation set also increased to 75.3%.
(a) FGNET
(b) GAN on FGNET
(c) SSL on FGNET
Fig. 5. Comparison of steady-state results on FGNet with different.
20
M. Le Quang et al.
We applied the same self-supervised learning method to the CACD dataset. Due to the large size of the CACD dataset, which consists of 163,446 images across 2,000 classes, there are ample images available for both training and testing sets. Therefore, there is no need to utilize CUSP to generate additional images.
(a) Baseline loss of train set on CACD (b) Loss of train set after using SLL on CACD
Fig. 6. Comparison of loss on train set of CACD dataset
(a) Baseline loss of val set on(b) Loss of val after using SLL on CACD CACD
Fig. 7. Comparison of loss on the validation set of CACD dataset.
On the CACD dataset, the accuracy of the test set increased by 0.7% and the validation set increased by 1% (see Fig. 7). We obtained the results for both experiments, as shown in Table 2.
Age-Invariant Face Recognition
21
Table 2. Result on two dataset Dataset/Sets Test set
Validation set
Baseline BYOL Baseline BYOL
5
FGNET
51.3
59.3
69.1
75.3
CACD
77.2
77.9
75.7
76.7
Conclusion
The experimental results demonstrate the potential of self-supervised learning for age-invariant face recognition. This advanced technique is effective even on low-precision models and has improved the model’s performance significantly. Notably, it has yielded promising results on the challenging FGNet dataset, with an accuracy increase of 8% on the test set and 6.2% on the validation set. (On the CACD dataset are 0.7% and 1%). We anticipate even more remarkable results when applied to more powerful models. The findings of this study contribute to the development of age-invariant face recognition models through the integration of self-supervised learning. Moreover, the methods outlined in this research can be adapted and implemented in other related studies to enhance their models.
References 1. Arora, R., Kaur, P., Kaur, E.D.: Age invariant face recogntion using stacked autoencoder deep neural network. In: 2020 International Conference on Intelligent Engineering and Management (ICIEM), pp. 358–363 (2020). https://doi.org/10.1109/ ICIEM48762.2020.9160113 2. Baruni, K., Mokoena, N., Veeraragoo, M., Holder, R.: Age invariant face recognition methods: A review. In: 2021 International Conference on Computational Science and Computational Intelligence (CSCI), pp. 1657–1662 (2021). https:// doi.org/10.1109/CSCI54926.2021.00317 3. Bijarnia, S., Singh, P.: Age invariant face recognition using minimal geometrical facial features. In: Choudhary, R.K., Mandal, J.K., Auluck, N., Nagarajaram, H.A. (eds.) Advanced Computing and Communication Technologies, pp. 71–77. Springer Singapore, Singapore (2016) 4. Du, J.X., Zhai, C.M., Ye, Y.Q.: Face aging simulation based on nmf algorithm with sparseness constraints. In: Huang, D.S., Gan, Y., Gupta, P., Gromiha, M.M. (eds.) Advanced Intelligent Computing Theories and Applications. With Aspects of Artificial Intelligence, pp. 516–522. Springer Berlin Heidelberg, Berlin, Heidelberg (2012) ´ Custom structure 5. Gomez-Trenado, G., Lathuili`ere, S., Mesejo, P., Cord´ on, O.: preservation in face aging. In: Avidan, S., Brostow, G., Ciss´e, M., Farinella, G.M., Hassner, T. (eds.) Computer Vision - ECCV 2022, pp. 565–580. Springer, Cham (2022) 6. Gong, D., Li, Z., Lin, D., Liu, J., Tang, X.: Hidden factor analysis for age invariant face recognition. In: 2013 IEEE International Conference on Computer Vision, pp. 2872–2879 (2013). https://doi.org/10.1109/ICCV.2013.357
22
M. Le Quang et al.
7. Grill, J.B., et al.: Bootstrap your own latent - a new approach to self-supervised learning. In: Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M., Lin, H. (eds.) Advances in Neural Information Processing Systems. vol. 33, pp. 21271–21284. Curran Associates, Inc. (2020) 8. Li, H., Zou, H., Hu, H.: Modified hidden factor analysis for cross-age face recognition. IEEE Signal Process. Lett. 24(4), 465–469 (2017). https://doi.org/10.1109/ LSP.2017.2661983 9. Li, Z., Gong, D., Li, X., Tao, D.: Aging face recognition: a hierarchical learning model based on local patterns selection. IEEE Trans. Image Process. 25(5), 2146– 2154 (2016). https://doi.org/10.1109/TIP.2016.2535284 10. Liao, J., Sanchez, V., Guha, T.: Self-supervised frontalization and rotation GAN with random swap for pose-invariant face recognition. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 911–915 (2022). https://doi.org/10. 1109/ICIP46576.2022.9897944 11. Lin, C.H., Wu, B.F.: Domain adapting ability of self-supervised learning for face recognition. In: 2021 IEEE International Conference on Image Processing (ICIP), pp. 479–483 (2021). https://doi.org/10.1109/ICIP42928.2021.9506677 12. Wang, Y., et al.: Orthogonal deep features decomposition for age-invariant face recognition. ArXiv:1810.07599 13. Wen, Y., Li, Z., Qiao, Y.: Latent factor guided convolutional neural networks for age-invariant face recognition. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 4893–4901 (2016). https://doi.org/10.1109/ CVPR.2016.529 14. Xu, C., Liu, Q., Ye, M.: Age invariant face recognition and retrieval by coupled auto-encoder networks. Neurocomputing 222, 62–71 (2017). https://doi. org/10.1016/j.neucom.2016.10.010, www.sciencedirect.com/science/article/pii/ S0925231216311729 15. Zhao, J., Yan, S., Feng, J.: Towards age-invariant face recognition. IEEE Trans. Pattern Anal. Mach. Intell. 44(1), 474–487 (2022). https://doi.org/10.1109/TPAMI. 2020.3011426 16. Zheng, T., Deng, W., Hu, J.: Age estimation guided convolutional neural network for age-invariant face recognition. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), pp. 503–511 (2017). 10.1109/CVPRW.2017.77
Detection of Kidney Stone Based on Super Resolution Techniques and YOLOv7 Under Limited Training Samples Minh Tai Pham Nguyen1 , Viet Tuan Le2 , Huu Thanh Duong2 , and Vinh Truong Hoang2(B) 1
2
Faculty of Advanced Program, Ho Chi Minh City Open University, Ho Chi Minh City, Vietnam [email protected] Faculty of Information Technology, Ho Chi Minh City Open University, Ho Chi Minh City, Vietnam [email protected], [email protected], [email protected]
Abstract. In machine vision, detecting and locating object have always been challenging tasks mostly in medical imaging due to the lacking of data and unique feature of its image. Several studies applied CNN-based models for detecting task but the methods were complex or impracticable to real time tasks while others used YOLO models which were lighter but poor performance. In this study, we propose Super Resolution techniques included EDSR, FSRCNN, LapSRN, ESPCN, ESRGAN and GFPGAN with the proposed architecture of YOLOv7 for kidney stone detection on KUB X-ray image. As a result, our proposed YOLOv7 outperformed the YOLOv7 base version with sensitivity 87.6%, Precision 92.2%, F1 Score 89.8%, mAP50 91.2% and each of Super Resolution methods enabled the model precision and sensitivity to be improved considerably with highest precision reached 97.3% and sensitivity achieved 91.7% on upscaled images compared to the non-upscaled images. Consequently, the re-designed YOLOv7 and Super Resolution methods are proposed to address the problem of detection in diagnosis kidney stone disease. Keywords: YOLOv7
1
· Super Resolution · Kidney stone detection
Introduction
Kidney stone disease or Renal Calculi is one of the oldest and most common disease known to medicine. This disease affects in all age, gender and race. A recent research of Zhu Wang et al. [18] indicated that kidney stone cases are reported increasing worldwide with 5.8% among Chinese adults affected which about 1 in 17 adults currently have disease. Generally, the rate of affecting Renal Calculi is ascending every year. c The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 N.-N. Dao et al. (Eds.): ICIT 2023, LNDECT 188, pp. 23–33, 2023. https://doi.org/10.1007/978-3-031-46749-3_3
24
M. T. P. Nguyen et al.
To address this problem, medical imaging method is used as it is the first step of examination. Many medical imaging types have become significant common for diagnosis like X-ray imaging and Computed tomography (CT). Most of recent researches are working on CT image because of its high quality imaging for examination. However, CT image usually expensive and requires a higher radiation dose than X-ray image. For example, in the research of K Fujii et al. [6] to record Kidney-Ureter-Bladder (KUB) image, a range from 8 to 34 mGy of radiation dose is used with CT image while X-ray image needed only 2.47 mGy which was indicated in research of George S Panayiotakis et al. [10]. Thus, X-ray image can be considered as the better solution in term of cost and less harmfulness compared to CT image. Nevertheless, X-ray imaging has the tendency of capturing the false positives or quality of image often insufficient to detect and classify the abnormalities as well. Thus, we proposed some Super Resolution (SR) methods which can be used for enhancing the resolution of X-ray image. Moreover, the X-ray image can only be executed correctly by experienced urologists or radiologists. Consequently, the diagnosis from emergency physicians with lesser experience can be inaccurate and lead to delay treatments or high medical costs even radiation dose increment. Therefore we proposed an object detection method which can be implemented into Computer Assist Diagnoses (CAD) system that shall achieve better performance in detecting small kidney stones and able to deploy for real time task to assist the diagnosis. The rest of the article is organized as follows: Sect. 2 shall provide related works. Section 3 shall describe the methodologies used in this research. Section 4 shall indicate and evaluate the results, Sect. 5 shall conclude the article.
2
Related Works
There are several researches have been carried out in order to achieve the ideal solution for detecting kidney stone with medical image. At first, the research aimed to classify if an image contained abnormalities or not. Therefore, ML algorithms and deep learning networks have been used, such as in research of Aksakalli et al. [1]. However, the performances were unable to reach the desired output. Thus, studies using advanced deep learning models have been carried out. For instance, in the work of Daniel C. Elton et al. [5], the authors have conducted segmentation task with 3D U-Net [3] before detecting kidney stone. CT image was used as the input for 3D U-Net to segment kidneys with denoising methods. After that, a CNN model was used for classifying kidney stones. The author concluded that with the use of CNN-based models, the results were better than the manual-built features models. However, their work was facing the problem of large amount of noises and fail detection due to plaques in the image. Another work of Ahmet Furkan Bayram et al. [2] proposed SOTA one stage detector YOLOv7 [14] for detecting kidney disease with CT image. They experimented on both YOLOv7 and YOLOv7 tiny with results 84.9% and 70.3% mAP50 respectively. They also concluded that the best model has been developed to the assist diagnosis system. However, CT imaging is an expensive imaging method and the deployed model performance is not significant. In the work
Detection of Kidney Stone Based on Super Resolution
25
of Yi-Yang Liu et al. [9] have applied CLAHE for X-ray image processing. Thus, they achieved a remarkable results with sensitivity 96.4%. Nonetheless, they used two models for detecting, the first one was a Mask-RCNN model used for detecting and masking spine and pelvis bone for unnecessary factors leading to false positives caused by the region around kidneys, then the second CNNbased model was applied for detecting kidney stones. Thus, their solution was impossible for real-time detection.
3 3.1
Methodology Dataset
This private dataset consists of 162 KUB images and the labels were provided by expert radiologists and urologists. However, there were corrupted images or non-KUB images and a small number of CT images appeared in the dataset. After selection, only 152 KUB X-ray images were used. The Fig. 1 indicates a number of samples in dataset.
Fig. 1. Several samples images from the dataset.
Because of insufficient data, the purpose is to evaluate model with most possible number of sample. Therefore, with test set, 100 KUB images were sampled randomly from the dataset. Thus, training set consist of 52 images then 24 images in the train set was split for validation set. To suppress the problem of lacking data for training, data augmentation techniques were applied which comprised of horizontal flipping, rotating in 10 degrees and cropping. As the result, our training set was augmented up to 240 images while 24 and 100 images were for validation set and test set respectively. All the sets were independent. 3.2
Convolutional Block Attention Module
A novel attention module mechanism proposed by Woo et al. in research [19]. CBAM consisted of two main parts, a Channel Attention Module (CAM) and a Spatial Attention Module (SAM). With the help of SAM, the network focuses on the pixel regions that contain the object. In term of CAM, the module handles
26
M. T. P. Nguyen et al.
Channel Attention
Spatial Attention
Fig. 2. Convolutional Block Attention Module (CBAM)
the allocation of the feature mapping channels. Following this, allocating attention to two dimensions enhance the performance of the network. The structure of CBAM can be seen in the Fig. 2 below. The CAM processing can be described as follows. Feature map F is used as the input of the module and passed through a global maximum pool layer and a global average pool layer with width and height 1 × 1 × C characteristic diagrams respectively, then sending them separately and entering two-layers neuron network. To obtain the final output of the module, Mk is created by adding characteristics in order which were the output of two-layers neuron network before activate them with Sigmoid function. Eventually, Mk is multiplied with feature map F to get the final features. The following is the calculation of formula. C Mc (F) = σ(W1 (W0 (FC avg )) + W1 (W0 (Favg )))
(1)
In term of SAM, the output of CAM is the input feature of the module. First, creating a global maximum pool layer and global average pool layer based on channels with H × W × 1 characteristic graphs. Then, performing the concatenation on these characteristic graphs and utilizing the convolution layer with kernel size 7 × 7 to reduce channel numbers. Eventually, passing these features to Sigmoid function and creating Ms as the spatial attention. In order to get the final features, Ms is multiplied by the input features. Thus, the formula is described as below. Ms (F) = σ(f 7×7 ([Fcavg ; Fsmax ])) 3.3
(2)
Single Image Super Resolution Techniques
Detecting small objects is one of the challenge task in Object Detection. Many techniques has been proposed to address the problem. In the early stage, researches applied conventional methods for upscaling resolution of image included Nearest-neighbor Interpolation, Bilinear and bicubic algorithms [12]. However, these methods did not reach the desired results. Therefore, the researches of using DNNs for upscaling resolution has been carried out become
Detection of Kidney Stone Based on Super Resolution
27
one of the interested topics and turned out to be the Super Resolution (SR) techniques. At first, the Supervised SR models reached remarkable results such as Enhanced Deep Super-Resolution (EDSR) [21], Laplacian Pyramid SuperResolution Network (LapSRN) [8], Fast Super-Resolution Convolution Neural Network (FSRCNN) [4] and Efficient Sub-Pixel Convolutional Neural (ESPCN) [13] with reasonable PSNR. While LapSRN, FSRCNN and ESPCN were created to optimize the generating speed, EDSR was proposed to the high quality generating task. However, the Supervised SR models still have drawbacks. One of the common problems of these methods are the the stunning results with large-scale factors. Therefore, GANs (Generative Adversarial Network) SR models were introduced to address the problem. Most of GANs SR models were built to generate 4× or above upsampling scale photo-realistic images. With GAN-SR methods, the researches support the discovery: a higher PSNR image does not have to deliver a better perceptual feeling. Currently, Enhanced Super-Resolution Generative Adversarial Network (ESRGAN) [17] and Generative Facial Prior Generative Adversarial Network (GFPGAN) [16] are the two most popular methods with impressed results. While ESRGAN was introduced to reduce the unpleasant artifacts problem, GFPGAN was another novel GAN-SR methods proposed for facial restoration. We adopted all the these SR methods in the experiments for X-ray medical image upscaling. 3.4
The Proposed Re-Designed YOLOv7 Architecture
YOLOv7 [14] is the latest model of YOLO series which were known as one of the most successful one stage detectors for object detection. One of the significant improvement of YOLOv7 architecture is the proposal of Extended ELAN (E-ELAN) block based on the original ELAN which enable the model to have more learning capability while still reserve the gradient path. Moreover, the authors implement re-parameterization and combine with the convolutional layers to obtain the outcome for detection which considerably increase model performance. However, the current architecture of YOLOv7 is not suitable for the task of kidney stone detection (see Table 1). Thus, we decided to re-design the YOLOv7 to adapt with the task that able to achieve higher accuracy while still maintain the reasonable model parameters and complexity. Therefore, this study try to experiment the two main things. First, the comparison of the re-designed architecture of YOLOv7 and the baseline. Second, the behavior of each models on upscaled X-ray images with different SR methods. The Fig. 3 below indicates our proposed architecture. YOLOv7 uses different scaled feature maps for detecting. However, too large feature maps may cause high computation cost while not increase much performance. Therefore, we proposed with these changes. First, assumed that the last scale maps normally have poor receptive field, so the MaxPool with kernel and stride 2 were added right after each of ELAN-H blocks in P3 and P4 before RepConv to address the problem and reduce the computation cost. Thus, each
28
M. T. P. Nguyen et al.
Fig. 3. Re-designed CBAM-YOLOv7
of feature maps in P3 and P4 were shrank to 40 × 40 and 20 × 20 respectively. While feature map size 40 would be used for detecting small objects, we avoided using MaxPool in P5 because the reduction was unnecessary as feature map size 20 was enough for large object detection. Second, a CBAM was added right in the last ELAN block of backbone to help the architecture focusing on the important features as this phase the information from objects were supposed to be extracted. Next, the rest CBAMs were attached right after each of RepConv to assist the detecting phase, the outputs of these CBAMs were fed to CBM with convolution kernel 3 × 3 in order to maintain the large receptive field for detection. Last, the SiLU activation function was replaced with Mish as this activation function was proved to reach higher performance and lower inference in research of Diganta Misra [11].
4
Experimental Results
SmAM is proposed by Yang et al. in [20] is an attention module that can adaptively allocate 3D weights to create attention for feature maps without increasing parameters. To compare various models with our models, we added YOLOv7, YOLOv7-tiny and the re-designed YOLOv7 with simAM, CA, ECA attached. Hence, the idea of this module is based on the neuroscience theory which to find the important neurons and assign these neurons with higher priority by using the energy function that separates neurons linearly. Therefore, the module is capable of avoiding excessive heuristics and adjustment work. The ECA module is proposed by Qilong Wang et al. [15] to avoid dimensional reduction by using interaction strategy in local cross-channel. Therefore, this mechanism assists the model in focusing on the object regions without negative impacts caused by the effect of channel attention learning. ECA module consists of a 1D convolution which kernel size k is determined adaptively by capturing information of local cross-channel interaction of each channels via its k neighbors.
Detection of Kidney Stone Based on Super Resolution
29
Hence, ECA used very few parameters. Thus, it is the lightweight attention module with significant impact. In term of Coordinate Attention (CA) which was proposed by Hou et al. [7], the mechanism comprises of two parts, coordinate information embedding and coordinate attention generation. First, two pooling kernels are used for extracting features horizontally and vertically. Second, a 1 × 1 convolutional transformation function is used for concentrating the information from spatial information extracted via two pooling kernels. Next, the module splits the resulting tensor into two separated tensors and capture the attention vectors. By this, the receptive field is significantly increased. 4.1
Environment and Hyperparameters Setup
All experiments with models were performed on Google Colab, using Tesla T4 to accelerate the training phase. With simAM, the lambda was set to 1e − 3. We setup the following hyperparameters for all model. The input size was rescaled to 640×640, batch size was 16, weight decay was 5×10−4 , momentum was 0.937 and learning rate was 0.01. 4.2
Evaluation Results
The Table 1 below indicates for the results of the models on normal X-ray image. To evaluate the models performance, Precision, Sensitivity, mAP50 are used. The annotation R is for re-designed architecture of YOLOv7. Table 1. Object Detection Network performance on KUB X-ray image. Model
Precision Sensitivity F1 Score mAP50 Parameters GFLOPS
YOLOv7
72.4
63.6
67.7
68.8
37.1M
105.1
YOLOv7-tiny
82.6
66.9
73.9
77.3
6.0M
13.0
simAM-YOLOv7-R 83.6
76.0
79.6
80.0
36.8M
98.4
ECA-YOLOv7-R
89.4
77.7
83.1
83.9
36.8M
98.4
CA-YOLOv7-R
90.5
78.5
84.0
85.0
36.9M
98.5
CBAM-YOLOv7-R 92.2
87.6
89.8
91.2
37.0M
98.7
As the table indicates the YOLOv7 baseline model has the lowest scores with only 68.8% mAP50 and 63.6% Sensitivity compared to other models, even though the YOLOv7 tiny version achieves more reasonable results with 77.3% mAP50 and 66.9% Sensitivity, it still not the best solution for the task, so these figures show that the current architectures are not the great fit for kidney stone detection. Hence, the re-designed YOLOv7 with attention module attached are able to reach higher performance with more reasonable parameters and GFLOPS. Our re-designed YOLOv7 with CBAM enhanced is the best solution as it reaches
30
M. T. P. Nguyen et al.
the highest scores with 91.2% mAP50 and 87.6% Sensitivity. Therefore, the proposed architectures not only enable the model to reach less parameters and lower GFLOPS but also greatly improved the performance. These results support that large feature map is not the only important factor that affect the model performance but the way of model extracting the information from objects is the crucial one. Thus, multi-scale feature maps may not much affect the outcomes without the corresponding receptive field of each scale maps. Next, we assess the re-designed and baseline models on SR upscaled images. SR methods increase the pixel intensity significantly. This study uses 2 times upscale factor, the original image size is 1213 × 733, so the upscaled image size is 2426 × 1466. The Table 2 indicates the result of both models with each SR methods.
Fig. 4. Several results of CBAM-YOLOv7-R on upscaled X-ray image
The table showed that, all SR methods enable both models to reach better performances. Our proposed YOLOv7 with EDSR and GFPGAN upscaling methods reaches the highest Precision with 97.3%, FSRCNN and ESRGAN for Sensitivity with 91.7% and each of SR images needs 18.3 ms for inference. In term of YOLOv7 baseline, while the precision on SR X-ray image is significant higher than normal X-ray image, the figures witness a smaller change in sensitivity score. So, SR methods for upscaled X-ray image can be used for increasing the model precision rather than the sensitivity. Thus, both of GAN and Supervised SR methods are suitable for improving X-ray resolution quality to enhance model performance especially the model confidence in object classification. The Fig. 4 shows some predicted samples of our model.
Detection of Kidney Stone Based on Super Resolution
31
Table 2. CBAM-YOLOv7-R and YOLOv7 performances on upscaled dataset. SR Methods
Precision
Sensitivity
F1 Score
mAP50
EDSR [21] FSRCNN [4] ESPCN [13] LapSRN [8] ESRGAN [17] GFPGAN [16]
97.3(↑ 5.1) 94.8(↑ 2.6) 95.6(↑ 3.4) 96.5(↑ 4.3) 96.5(↑ 4.3) 97.3(↑ 5.1)
90.8(↑ 3.2) 91.7(↑ 4.1) 90.8(↑ 3.2) 90.8(↑ 3.2) 91.7(↑ 4.1) 90.8(↑ 3.2)
93.9(↑ 4.1) 93.2(↑ 3.4) 93.1(↑ 3.3) 93.5(↑ 3.7) 94.0(↑ 4.2) 93.9(↑ 4.1)
93.6(↑ 2.4) CBAM-YOLOv7-R 93.3(↑ 2.1) 93.4(↑ 2.2) 93.6(↑ 2.4) 95.0(↑ 3.8) 94.0(↑ 2.8)
EDSR [21] FSRCNN [4] ESPCN [13] LapSRN [8] ESRGAN [17] GFPGAN [16]
87.7(↑ 15.3) 87.8(↑ 15.4) 82.8(↑ 10.4) 87.6(↑ 15.2) 88.4(↑ 16.0) 86.3(↑ 13.9)
65.8(↑ 2.2) 65.8(↑ 2.2) 69.2(↑ 5.6) 65.8(↑ 2.2) 64.2(↑ 0.6) 68.3(↑ 4.7)
75.1(↑ 7.4) 75.2(↑ 7.5) 75.3(↑ 7.6) 75.1(↑ 7.4) 74.3(↑ 6.6) 76.2(↑ 8.5)
75.2(↑ 6.4) 74.8(↑ 6.0) 75.2(↑ 6.4) 75.1(↑ 6.3) 75.3(↑ 6.5) 75.9(↑ 7.1)
5
Model
YOLOv7
Conclusion
As object detection task in medical field become more popular. In this research, the YOLOv7 architecture was re-designed to adapt with kidney stone detection task. As a result, the new model was able to maintain reasonable parameters and reach lower GFLOPS while still outperform the current architecture. Nonetheless, there’s potential for researches that aim to improve YOLO model parameters and complexity as this study only able to reach a slight change in these figures. Moreover, each of SR methods were utilized to evaluate the influence of upscaled images with the models performance. However, the models are only evaluated on 2 times upscale factor, so in the next work, we aim to expand the upscaled factor, so that the evaluation will be more generalized. The research contributed another aspect of YOLO architecture and Super Resolution methods to assist the kidney stone detection and improve the quality of X-ray image.
References 1. Aksakalli, I., Ka¸cdio˘ glu, S., Hanay, Y.S.: Kidney x-ray images classification using machine learning and deep learning methods. Balkan Journal of Electrical and Computer Engineering 9(2), 144–151 (2021) 2. Bayram, A.F., Gurkan, C., Budak, A., KARATAS ¸ , H.: A detection and prediction model based on deep learning assisted by explainable artificial intelligence for kidney diseases. Avrupa Bilim ve Teknoloji Dergisi 40, 67–74 (2022) ¨ Abdulkadir, A., Lienkamp, S.S., Brox, T., Ronneberger, O.: 3D 3. C ¸ i¸cek, O., U-Net: learning dense volumetric segmentation from sparse annotation. In: Ourselin, S., Joskowicz, L., Sabuncu, M.R., Unal, G., Wells, W. (eds.) Medical Image Computing and Computer-Assisted Intervention – MICCAI 2016: 19th International Conference, Athens, Greece, October 17-21, 2016, Proceedings,
32
M. T. P. Nguyen et al.
4.
5.
6.
7.
8.
9.
10.
11. 12. 13.
14.
15.
16.
17.
18. 19.
Part II, pp. 424–432. Springer, Cham (2016). https://doi.org/10.1007/978-3319-46723-8 49 Dong, C., Loy, C.C., Tang, X.: Accelerating the super-resolution convolutional neural network. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) Computer Vision – ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part II, pp. 391–407. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46475-6 25 Elton, D.C., Turkbey, E.B., Pickhardt, P.J., Summers, R.M.: A deep learning system for automated kidney stone detection and volumetric segmentation on noncontrast CT scans. Med. Phys. 49(4), 2545–2554 (2022) Fujii, K., Aoyama, T., Koyama, S., Kawaura, C.: Comparative evaluation of organ and effective doses for Paediatric patients with those for adults in chest and abdominal ct examinations. Br. J. Radiol. 80(956), 657–667 (2007) Hou, Q., Zhou, D., Feng, J.: Coordinate attention for efficient mobile network design. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 13713–13722 (2021) Lai, W.S., Huang, J.B., Ahuja, N., Yang, M.H.: Deep Laplacian pyramid networks for fast and accurate super-resolution. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 624–632 (2017) Liu, Y.Y., Huang, Z.H., Huang, K.W.: Deep learning model for computeraided diagnosis of urolithiasis detection from kidney-ureter-bladder images. Bioengineering 9(12), 811 (2022) Metaxas, V.I., Messaris, G.A., Lekatou, A.N., Petsas, T.G., Panayiotakis, G.S.: Patient doses in common diagnostic x-ray examinations. Radiat. Prot. Dosimetry. 184(1), 12–27 (2019) Misra, D.: Mish: A self regularized non-monotonic activation function. arXiv preprint arXiv:1908.08681 (2019) Parsania, P.S., Virparia, P.V.: A comparative analysis of image interpolation algorithms. Int. J. Adv. Res. Comput. Commun. Eng. 5(1), 29–34 (2016) Shi, W., et al.: Real-time single image and video super-resolution using an efficient sub-pixel convolutional neural network. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1874–1883 (2016) Wang, C.Y., Bochkovskiy, A., Liao, H.Y.M.: Yolov7: Trainable bag-offreebies sets new state-of-the-art for real-time object detectors. arXiv preprint arXiv:2207.02696 (2022) Wang, Q., Wu, B., Zhu, P., Li, P., Zuo, W., Hu, Q.: ECA-net: Efficient channel attention for deep convolutional neural networks. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11534–11542 (2020) Wang, X., Li, Y., Zhang, H., Shan, Y.: Towards real-world blind face restoration with generative facial prior. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9168–9178 (2021) Wang, X., et al.: Esrgan: Enhanced super-resolution generative adversarial networks. In: Proceedings of the European Conference on Computer Vision (ECCV) Workshops (2018) Wang, Z., Zhang, Y., Zhang, J., Deng, Q., Liang, H.: Recent advances on the mechanisms of kidney stone formation. Int. J. Mol. Med. 48(2), 1–10 (2021) Woo, S., Park, J., Lee, J.Y., Kweon, I.S.: Cbam: Convolutional block attention module. In: Proceedings of the European conference on computer vision (ECCV), pp. 3–19 (2018)
Detection of Kidney Stone Based on Super Resolution
33
20. Yang, L., Zhang, R.Y., Li, L., Xie, X.: Simam: A simple, parameter-free attention module for convolutional neural networks. In: International Conference on Machine Learning, pp. 11863–11874. PMLR (2021) 21. Yu, J., et al.: Wide activation for efficient and accurate image super-resolution. arXiv preprint arXiv:1808.08718 (2018)
Hardware-Based Lane Detection System Architecture for Autonomous Vehicles Duc Khai Lam1,2(B) , Pham Thien Long Dinh1,2 , and Thi Ngoc Diem Nguyen1,2 1
University of Information Technology, Ho Chi Minh City, Vietnam [email protected], {18521021,18520597}@gm.uit.edu.vn 2 Vietnam National University, Ho Chi Minh City, Vietnam
Abstract. Hough Transform (HT) algorithm is a method for extracting straight lines from an edge image. In the Hough Transform, the parameters of edge pixels, i.e., points in an image with sharp intensity changes, are treated as “votes”. Then, they will be accumulated in Hough Space (HS) variables (ρ, θ) to find the votes which are maximum values, in other words, pixels that have the same value as θ, and the ρ distance will be the same. However, this algorithm requires huge memory and computational complexity. In this paper, we propose an HT architecture that uses a Look Up Table (LUT) to store trigonometric values and use the value of orientation θ calculated in the Sobel Edge Detection algorithm instead of rotating small angles as the HT standard. We will then reduce the processing time for each image frame so that it can be applied in real-time processing. Our work has been processed at 170 MHz, and the processing time per 1024 × 1024 image resolution frame is 6.17 ms with an accuracy of 94 %. This design is synthesized on the FPGA Virtex-7 VC707.
Keywords: Traffic Lane Detection FPGA
1
· Hough Transform · Hardware ·
Introduction
Lane detection is one of the objectives with particular significance in image processing and computer vision. It is applied in the industry as vehicle indication or in advanced driver assistance technology Advanced Driver Assistance System (ADAS). Although many alternative algorithms have been used in lane recognition, the HT method has found widespread use because it consistently handles issues, including non-contiguous lines, the appearance of numerous broken lines, and the impact of traffic interference caused by the environment. However, due to the numerous trigonometric calculations and the substantial memory requirements, this technique has a high computational cost and will operate slowly in real-world situations. Implementing a lane-detecting system for driverless vehicles based on the HT method is thus still difficult. c The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 N.-N. Dao et al. (Eds.): ICIT 2023, LNDECT 188, pp. 34–44, 2023. https://doi.org/10.1007/978-3-031-46749-3_4
Hardware-Based Lane Detection System
35
In this paper, we propose a straight lane detection system including famous image pre-processing algorithms such as Gray Scale, Sobel Edge Detection, and the main image processing algorithm HT based on Standard HT is more suitable for FPGA implementation to achieve a real-time lane detection system.
2
Related Work
In recent years, the operating mechanism and voting procedure of Hough Transform have been the subject of numerous proposed upgrades or optimization algorithms to make it more suitable for real-time applications. A hardware design for HT implementation on a Field Programmable Gate Array (FPGA) with the parallelized voting technique was published in a survey of the HT method [2]. The Hough space in parametric form (ρ, θ), a 2-dimensional accumulator array used to calculate the strength of each line by a voting process, is transferred onto a 1-dimensional array with regular increments of θ. The Hough Space is then split up into several parallel halves. Therefore, it is possible to execute both the voting process for determining the straight line and the computation of (ρ, θ) for the edge pixels in parallel. Furthermore, a synchronized initialization for the Hough space speeds up straight-line identification and makes XGA video processing viable. The average processing speed of HT implementation in this application is 5.4 ms per XGA frame at a working frequency of 200 MHz. Similarly, the authors provided an HT-hardware architecture for road-lane detection with a parallelized voting process, a local maximum algorithm, and an FPGA prototype implementation in [3]. Based on θ - value discretization in the Hough space, the global design is parallelized. For edge detection in the original video frames, computation of the characteristic edge-pixel values (ρ, θ) in Hough Space and voting procedure for each (ρ, θ) value with parallel localmaximum-based peak voting-point extraction in HS to determine the detected straight lines. An average detection speed of 135frames/s for the VGA 640 × 480 frames was achieved at a 50 MHz working frequency. In [4], the HS and lane orientation were built using (Y-intercept, θ) parameterization, which also makes the Inverse Hough Transform (IHT) operation easier and shrinks the size of the accumulator. The efficiency of the proposed architecture is demonstrated by a software-hardware co-simulation on a Xilinx Virtex-5 ML505 platform. The architecture of [4] allowed a processing time per frame of 1.47 ms for an image size of 640 × 480 pixels and 200 MHz working frequency. The Angular Regions - Line Hough Transform (AR-LHT), a memory-efficient approach for the detection of lines in images, is a memory-efficient construction of the LHT reported in [5]. It was shown that by making use of the Hough Parameter Space (HPS) small peak dispersion, memory usage may be reduced. To significantly lower memory usage compared to the Standard LHT, two distinct, smaller memories (a 1-bit Region Bitmap (RBM) and a reduced-size HPS) were used. The RBM was utilized to ascertain the precise orientation of peaks in the optimized HPS after voting was finished. Results showed that for an image of 1024 × 1024 pixels, the Standard LHT used around 48% less RAM. A single image could be processed by the FPGA architecture in 9.03 ms.
36
D. K. Lam et al.
Fig. 1. Proposed hardware design architecture
In [6], the authors showed the implementation of a real-time monocular vision-based lane departure warning system. To deal with the various lighting conditions, they used the main algorithms like Otsu’s threshold method and Hough Transform. For lane detection to identify the Region of Interest and minimize computing complexity, disappearing point detection is crucial. The main contribution of [7] was to present an efficient implementation of the gradient-based Hough Transform for straight lines detection using a Xilinx Virtex-7 FPGA with embedded Digital Signal Processing (DSP) slices and block RAMs.√The architecture was performed in 260.061MHz working frequency and 2n + ( 2 + 2)n + 232 clock cycles for an n ∗ n gray-scale image. The authors built a new processing algorithm consisting of four parts, Gaussian blur, image graying processing, DLD-threshold (Dark-Light-Darkthreshold) algorithm, correlation filter edge extraction, and Hough Transform in [8]. This processing algorithm solved the situation that the lane contains arrows, text, and other signs. The max recognition rate of this algorithm’s verification was 97,2%. In [9], the authors presented a simple and stream-friendly line detection algorithm based on the line segment detector (LSD). This system maintained 60 fps throughput for VGA (640 × 480) images on the PYNQ-Z1 board with Xilinx XC7Z020-1CLG400C FPGA. However, the improvements or optimizations of the HT algorithm in these researches still can not completely satisfy the demand for high operation speed in real-time applications. Also, all the reported solutions mainly optimize the HT. Thus, in the paper, we propose a full system of image processing algorithms from the original video, at the same time improving the execution speed and processing resolution of the image frame. Then, a lane detection and simulation system has been designed and simulated on Matlab Simulink software combined with Xilinx System Generator [10] library to check the system’s functionality. The lane detector system can be implemented on the FPGA through the PCIe interface.
Hardware-Based Lane Detection System
3
37
Proposed Hough Transform Hardware Design Architecture
As you can see in Fig. 1, our proposed hardware design architecture has four modules: Masking module, Gray Scale module, Sobel Edge Detection module, and Hough Transform module. The Hough Transform is the key module of this system since it mainly affects to the processing time of the system. Therefore, this section presents in detail our proposed architecture design for the Hough Transform module. The key feature of our lane detection is the Hough Transform operator applied to the edge detection operators like Sobel to detect straight lines. Fig. 2 shows the geometric diagram of Hough Transform.
Fig. 2. Hough transform from 2D space to Hough space
In geometric space, the (x, y) coordinate can represent an edge pixel. The following equation can be used to define any straight line: y = mx + b
(1)
where, m is the slope of the line y and b is the y-intercept, which is the y coordinate of the location where the line crosses the y-axis. However, for vertical lines, the slope is undefined. Therefore, using the following equation, they can be expressed as polar coordinates: ρ = x cos θ + y sin θ(0◦ ≤ θ ≤ 179◦ )
(2)
Each straight line in Hough’s parameter space can be represented by a single point (ρ, θ). It produces a curve in the parameters space if (x, y) is displayed for various possible lines that cross those cartesian points. To identify the likely edge of the lane boundary, the Hough Transform employs a voting system. These actions are carried out in tandem on the right half of the image to identify the right lane border and on the left half of the image to find the left lane boundary. An already allocated accumulator array in memory, A(ρ, θ), is represented by the Hough Parameter Space (HPS). Utilizing Eq.(2) throughout a discrete range of values [0: Δθ: maxθ], where Δθ = 1 denotes the discretization step and maxθ = 179 denotes the highest value of θ, the edge pixels inside the image are processed. A vote is applied to the associated location in the HPS by use
38
D. K. Lam et al.
of the generated parameters, which create an address. The HPS is checked for peaks when the voting is over. As these relate to linear elements in the image, parameter coordinates with a sizable number of votes are noted.
Fig. 3. Hough Transform block diagram
The remainder of this section describes our Hough Transform architecture and functionality of each of the blocks in Fig. 3 works. Coordinate Counter: to calculate ρ for each segmented point, selecting coordinates (x, y) from an edge pixel is necessary. We use counter blocks for this module to select coordinates. Calculate rho: the transformation rule form (x, y) space to (ρ, θ) space, used in the HT for lane detection, is defined above in Eq.(2). Here, the angle between the x-axis and the vector orthogonal to the straight line is represented by θ, and the shortest distance between the origin and the straight line is marked by ρ. As a result, a point in (ρ, θ) space must be the same for all sinusoidal curves specified by Eq.(2) for points on straight lines in Cartesian space. Fig. 4(left) show Calculate ρ module. We use two LUTs to save values of sin(θ) and cos(θ), two multipliers, and one adder to implement Eq.(2). These LUTs employ a fixedpoint format and sample θ every 1◦ in this implementation. Generate Address HPS: to avoid potential read-write conflicts at the same address in HPS, the 2-dimensional Hough space with (ρ, θ) is mapped onto 1-
Fig. 4. Calculate ρ and Generate HPS Address
Hardware-Based Lane Detection System
39
Fig. 5. HPS Memory with Voting process odd (left) and Voting process even (right)
dimensional memory blocks creating a new address as Fig. 4(middle). The generation of this new address is based on the following formula: addr hps = (ρ + ρ of f set) + θ ∗ maxρ ∗ 2
(3)
Where max ρ is the maximum distance that ρ can reach. This is also the diagonal of the input image frame. With anorigin at the center of an height x 2 width size image, the maximum of max ρ is ( height )2 + ( width 2 2 ) . Initial Address: generate an address in the HPS space, then reset the initial voting value of this address in HPS memory to zero before starting the voting process. HPS Memory: control the voting process to select the value (ρ, θ) corresponding to the largest vote. HPS Memory is divided into two parts: HPS Memory for left and right regions as Fig. 4(right). These parts control the parallel voting process to select (ρ, θ) with the largest vote. The special part is that this block will reset each address, not this block, which means that this block must be reset after each picture frame is executed. Resetting the memory after every frame will therefore take a long time. That is why we propose to use two memory through two modules Voting process odd and the voting process even, to implement the alternate voting process in each region. If the polling of that frame is done in the Voting process even module, at the same time, the Voting process odd module will start resetting all addresses in the HT space in preparation for the next frame polling, also the opposite. The HW implementation for each memory is shown in Fig. 5. A Dual-port RAM is employed as the accumulator for various (ρ, θ) to provide simultaneous read-write operations. The two ports can execute the accumulation and voting operations simultaneously since they operate in separate modes (read/write operations) and have a programmable clock frequency. Following the mapping and accumulation procedures of the candidate (ρ, θ) pairings, the largest vote of the accumulator is identified using a straightforward comparison
40
D. K. Lam et al.
that is still straightforward and requires little processing effort. A straightforward comparator makes up the parallel peak detecting unit. When the corresponding vote number reaches a predetermined empirical threshold, the (ρ, θ) value is available at the output. The stored vote value for this address is immediately reset to zero. Another RAM will begin voting for the next frame. Peak Detection: after voting, we use (ρ, θ) with the largest vote to detect two lines on the left and right regions of the image.
4
Proposed Design Verification
Table 1. Results of our work comparison with different architectures [2]
[3]
[4]
[5]
This work
Image Resolution
1024 × 768 640 × 480 640 × 480 1024 × 1024 1024 × 1024
Fmax (MHz)
200
50
200
145
170
Processing Speed (ms/frame) 5.4
7.4
1.47
9.03
6.17
Normalized Speed (ns/pixel)
24.08
4.78
8.61
5.88
6.8
The hardware-based autonomous vehicle lane-detecting system is created using the Xilinx System Generator and Matlab Simulink tools. Then, we generate HDL code files to run the synthesis and post-timing implementation. The performance comparison of various HT implementations from some research is shown in Table 1. The frame rate is influenced by the architecture, the hardware platform being used, the size, and the number of symbols in the image, as well as other factors. Since the submitted works employ various resolutions, pre-processing techniques, and platforms, we adopt a normalized speed as a merit factor. Research [2] presents a processing system for 1024 × 768 images with a frequency of 200 MHz and has a processing speed is 5.4 ms/frame. The maximum frequency of [3] is only 50 MHz, as well as speed for each frame, is 7.4ms for 640 × 480 images. In [4], the authors show the work with the same image as [3], but the maximum frequency is 200 MHz, and the processing speed is 1.47 ms/frame. The architecture of [5] allows a processing speed of 9.03 ms/frame and a frequency of 145 MHz for 1024 × 1024 images. Our work needs only 6.17 ms to process an image that has the same size as [5], as well as the frequency is 170 MHz. This means that our system is faster than [5] but needs much more processing time than [4]. The FPGA verification methodology is shown in Fig. 6. The system consists of two components: The host computer has an AMD Ryzen 3 2200G x 64 Processor CPU and 16 GB of DDR4 RAM. The FPGA part includes the board enclosed by the lane detection system (IP LANE DETECTOR) using the PCIe interface. It was demonstrated that the Block Design could attain a clock frequency of
Hardware-Based Lane Detection System
41
Fig. 6. Verification methodology (left). Realtime Host PC-FPGA verification platform (right)
170 MHz on a Virtex-7 VC707 board by synthesizing and implementing it using the Xilinx Vivado Design Suite for various image resolutions. The model is then written through the PCIe link to the Virtex-7 VC707 FPGA Board. The Virtex-7 board is shown connected to the host PC’s PCIe connection in Fig. 6. The prototype system’s three primary components comprise its core circuit, which uses the FPGA to implement lane detection in the input video sequence. The first component, implemented in Python, carries out the preparation for image edge extraction and has the ability to read videos and change the IP of an architecture proposal’s size. The second component is the FPGA, which receives data and reads the results from the return table. Following post-processing, it proceeds to plot the detected coverage and show the results on the screen. We build it with Python script. The proposed implementation is tested on numerous videos with different lighting and road scene conditions, including road type (urban street, highway), road condition, occlusion, poor line paints, days, and nights. We should point out that the voting threshold was the same for all photographs. By visual comparison, we can demonstrate from this figure that the implemented architecture successfully recognizes the straight-lane lines. With a 1024 × 1024 pixel resolution, the frame rate is approximately 97fps. The processing speed is approximately 10.3 ms/frame. The processing time after implementing the system to the FPGA Board is larger than the Post Timing Implementation because the data of the reading and writing processes are discontinuous, and the frequency of the number of PCIe interface standards on the Virtex-7 Board is limited to 100 MHz. Therefore, it causes bottlenecks. This means that the processing time of the system after implementing it on the real FPGA device is not able to reach the maximum of 6.17 ms/frame.
42
D. K. Lam et al.
To fully evaluate the effectiveness of our system, we generate labels for the frames of our videos. Straight lines are marked and defined by four points on the image. The data are saved in a .txt file. Then, we build a Python code to compare via Precision, Recall, and F1-score metric. Precision is known as positive predictive value. The Recall is also known as sensitivity in diagnostic binary classification. F1- score is the harmonic mean of precision and summon (assuming that these two quantities are non-zero). Its values arrange from 0 to 1. They are defined as follows: P recision =
TN TN + FN
(4)
TP (5) TP + FN P recision ∗ Recall F 1 − score = 2 ∗ (6) P recision + Recall where, TP - true positive, is an outcome where the model correctly predicts the positive class, TN - true negative, is an outcome where the model correctly predicts the negative class. FP - false positive, is an outcome where the model incorrectly predicts the positive class. And FN- false negative, is an outcome where the model incorrectly predicts the negative class. Recall =
Table 2. Accuracy of Lane Detection for different light conditions Road type
Number of frame Precision Recall
F-score
Normal
1260
96.65%
98.46%
97.55%
Poor condition 1802
97.31%
98.12%
97.65%
Urban road
1810
83.61%
96.44%
88.41%
Total
4872
92.53%
97.67% 94.54%
Table 2 shows the obtained scores that the architecture can detect the straight lines with different condition lights and roads correctly by using metrics derived from these four outcomes. Their average accuracy is about 92.53%, 97.67%, and 94.54%, respectively. Some images of videos under different conditions are shown in Fig. 7. In general, the actual processing speed of our architecture is about 27– 35 fps.
Hardware-Based Lane Detection System
43
Fig. 7. Lane detection results in various conditions. Original image (left). Lane detector system results (right)
5
Conclusions
This paper presents a design lane detector system for autonomous vehicles. The algorithm uses Gray Scale, Sobel Edge Detection, and Hough Transform. This hardware architecture is built on Matlab Simulink and Xilinx System Generator tools. Then, we connect with the FPGA through a PCIe interface. This raises the detection accuracy to match the standards of real-time applications for road lane detection. Our real implemented FPGA system can handle around 6.17 ms per 1024 × 1024 resolution frame with a frequency of 170 MHz. In addition, when implemented on an FPGA board with a real-time video the processing speed is about 97fps and accuracy is about 94%. Acknowledgment. This research is funded by Vietnam National University HoChiMinh City (VNU-HCM) under grant number DS2023-26-02.
44
D. K. Lam et al.
References 1. Illingworth, J., Kittler, J.: A survey of the Hough transform. Comput. Vision, Graph. Image Process. 44(1), 87–116 (1988) 2. Guan, J., An, F., Zhang, X., Chen, L., Mattausch, H.J.: Real-time straight-line detection for XGA-size videos by Hough transform with parallelized voting procedures. Sensors 17, 270 (2017). https://doi.org/10.3390/s17020270 3. Guan, J., An, F., Zhang, X., Chen, L., Mattausch, H.J.: Energy-efficient hardware implementation of road-lane detection based on hough transform with parallelized voting procedure and local maximum algorithm. IEICE Trans. Inf. Syst. 102-D, 1171–1182 (2019) 4. Hajjouji, I.E., Mars, S., Asrih, Z., Mourabit, A.E.: A novel FPGA implementation of Hough Transform for straight lane detection. Eng. Sci. Technol. Int. J. 23, 274– 280 (2020) 5. Northcote, D., Crockett, L.H., Murray, P.: FPGA implementation of a memoryefficient Hough parameter space for the detection of lines. IEEE Int. Symp. Circ. Syst. (ISCAS) 2018, 1–5 (2018). https://doi.org/10.1109/ISCAS.2018.83511 6. Kortli, Y., Marzougui, M., Bouallegue, B., Bose, J.S.C., Rodrigues, P., Atri, M.: A novel illumination-invariant lane detection system. In: 2017 2nd International Conference on Anti-Cyber Crimes (ICACC), pp. 166–171 (2017). https://doi.org/ 10.1109/Anti-Cybercrime.2017.7905284. 7. Zhou, X., Ito, Y., Nakano, K.: An Efficient Implementation of the Gradient-Based Hough Transform Using DSP Slices and Block RAMs on the FPGA, pp. 762-770 (2014). https://doi.org/10.1109/IPDPSW.2014.88 8. Zhang, Z., Ma, X.: Lane recognition algorithm using the Hough transform based on complicated conditions. J. Comput. Commun. 7, 65–75 (2019). https://doi.org/ 10.4236/jcc.2019.711005 9. Manabe, T., et al.: Autonomous vehicle driving using the stream-based real-time hardware line detector. Int. Conf. Field-Programm. Technol. (ICFPT) 2019, 461– 464 (2019). https://doi.org/10.1109/ICFPT47387.2019.00093 10. Vivado Design Suite. Reference Guide. Model-Based DSP Design Using System. Generator. UG958 (v2018.1) April 4 2018
Video Classification Based on the Behaviors of Children in Pre-school Through Surveillance Cameras Tran Gia The Nguyen, Pham Phuc Tinh Do, Dinh Duy Ngoc Cao, Huu Minh Tam Nguyen, Huynh Truong Ngo, and Trong-Hop Do(B) University of Information Technology, Vnu-Hcm, Vietnam {20521940,20522020,20521661,20521871,20522085}@gm.uit.edu.vn, [email protected]
Abstract. In preschool, children are active and curious about the world around them. The tendency to engage in unusual or dangerous behaviors can pose a significant risk to their safety. Constantly monitoring surveillance camera footage and analyzing it to determine if any abnormal behavior is occurring requires considerable attention and effort from human observers. Therefore, applying technology to moni- tor the abnormal behavior of children is crucial in preschool education settings. With the development of technology, the problem of classifying video through surveillance cameras can be solved. However, there is still a shortage of datasets for this task with video data in preschool environments due to difficulties in collecting data and potential violations of children’s privacy and safety. Therefore, in this paper, we propose Behaviors of Children in Preschool (BCiPS), a new dataset for the above problem. BCiPS consists of 4268 videos with lengths ranging from 3–6 seconds. We evaluate some machine learning and deep learning models on BCiPS. The CNN-LSTM model achieves the highest performance, over 75%, with four performance metrics: accuracy, recall, precision, and F 1score . Additionally, we will analyze cases where the models fail to identify abnormal behavior to determine the reasons for these fails- ures, identify weaknesses in the dataset, and improve the accuracy of the dataset to create a reliable tool for building an effective warning system in real-life situations.
Keywords: Video Classification Learning · Surveillance Cameras
1
· BCiPS · Pre-school · Deep
Introduction
The purpose of action classification in videos is to determine what is happening in the video. Human Activities can be classified into various categories, including Human-Human interaction and Human-Object interaction. This classification is based on human actions specified by their gestures, poses, etc. Human action c The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 N.-N. Dao et al. (Eds.): ICIT 2023, LNDECT 188, pp. 45–54, 2023. https://doi.org/10.1007/978-3-031-46749-3_5
46
T. G. T. Nguyen et al.
recognition is challenging due to variations in motion, illumination, partial occlusion of humans, viewpoint, and anthropometry of people involved in the various interactions. Additionally, the issue of a person’s style when performing a gesture, not only in timing but also in how to perform the gesture. In recent years, advancements in neural network technology and deep learning have become effective methods for many tasks, including video action classification. Along with the emergence of pre-trained models, the task of classifying action has also significantly improved. However, a few datasets exist in the school domain, especially in preschool. Due to the problem of personal privacy, child protection is a sensitive issue, affecting many aspects of the lives of those recorded by the camera. Understanding such a situation, we have created the BCiPS dataset, including videos on the preschool domain. Our dataset includes videos capturing the activities of children in preschool. BCiPS is a valuable resource for researchers, enabling them to conduct analyses and develop models to identify children’s abnormal behaviors. Regarding legal matters, we have contacted relevant competent parties and obtained permission to provide videos from the preschool surveillance camera. Therefore, in this study, we performed the task of video classification based on the behaviors of children in preschool through surveillance cameras. The input of the task is a video, and the output is the classification of normal or abnormal human actions. After completing the experiments and research, we achieved two results: • Defined the problem and provided guidelines for creating the BCiPS dataset, which consists of 4,268 videos ranging from 3 to 6 s in length. BCiPS is one of the first datasets containing videos extracted from surveillance cameras in the preschool domain. • Experimented with deep learning models for the video classification task on the BCiPS dataset. Furthermore, we evaluated the performance of the models. The best-performing model was CNN+LSTM, achieving an accuracy of 75.88%. Additionally, we analyzed error cases to identify challenging scenarios that could help future research avoid similar errors and improve the performance of the models for real-world applications. The remaining parts of this paper are as follows. Section 2 is the section that introduces previous datasets for classifying human actions in videos. Section 3 presents the BCiPS dataset. Section 4 shows the baseline models used in the study. Section 5 describes the performances of models. Section 6 presents the achieved results and future works.
2
Related Works
Many datasets have been published for human action recognition in videos. These are large-scale datasets with many labels and various topics in various fields. Some examples of these datasets include HMDB-51: This dataset spans 51
Video Classification
47
action classes and contains 6,766 clips extracted from 3,312 videos. The UCF101 [12] dataset contains 101 action classes grouped into five types of actions and a total of 13,320 clips extracted from 2,500 videos. Kinetics Human Action Video Dataset [5] has 400 action classes and 306,245 clips extracted from 306,245 videos. NTU RGB+D [10]: This human action recognition dataset in a school domain. It includes 60 action classes, divided into three groups: 40 daily actions, nine health-related actions, and 11 mutual actions. Kinetics-700 [2]: This dataset contains over 650,000 videos and 700 action classes, with an average length of 10 s per video. The data domain of the dataset includes sports actions, movies, online videos, etc. The dataset was collected from YouTube, sports, and movie websites. The Something-Something dataset [4] contains over 1000,000 videos with 174 action classes.
3
The BCiPS Dataset
3.1
Dataset Creation
Phase 1. Data Collection and Pre-processing: Data Collection: We describe creating the BCiPS dataset for human action classification in videos task. The dataset was created by collecting video footage from two surveillance cameras in two preschool classrooms. The cameras recorded the daily activities of students during various times of the day, including studying, entertaining, and napping. Careful consideration was given to ensuring the privacy and safety of the children. Data Pre-Processing: Video data were collected in .dav format and then converted to .mp4 format using the web tool 123APPS 1 . This format is widely used and easy to work with various tools. As the study focuses on human actions, we removed any periods when no students were in the classroom from the videos. Subsequently, we used the FFmpeg library provided by Python to split the videos into 3–6 seconds segments. Phase 2. Guidelines and Agreement: Guidelines: The purpose of the task is to classify whether the input video contains normal or abnormal actions. Therefore, the output will be one of two labels: “Normal” or “Abnormal”. We define the labels as follows: • Abnormal: Videos contain unusual and dangerous actions for children, such as fighting (children physically impact each other or play excessively, causing harm to other children), falling (falling on the ground suddenly), chasing (more than two students run fast, they run without a certain trajectory, make noise and may have an accident), carrying heavy or oversized objects (tables, chairs, beds, or similarly large-sized items can cause injury to children if they trip or if heavy objects fall on them), and other abnormal activities (other 1
https://123apps.com/
48
T. G. T. Nguyen et al.
less common cases can still pose dangers to children). These actions can be dangerous, leading to injury and accidents or affecting the children’s mental health. • Normal: Videos contain actions that are not dangerous in the observed environment. For example, walking, talking, eating, playing, studying,..., performed normally and regularly in a school or preschool domain. For videos that children are not sleeping, we propose watching the actions in the video and evaluating whether they fall within the range of normal actions. If the activities in the video do not belong to the list of abnormal actions and do not harm children, we will label the video as “Normal”.
Fig. 1. Three frames of three videos labeled as ’Abnormal’.
Regarding the frame captured from CAM15, it shows a child carrying a heavy object (a chair) while another child is standing on the chair. This could lead to a fall and harm the children. As for the middle frame (left panel of CAM16), two children are chasing each other, which could also lead to falling and cause danger. In the frame on the right panel of CAM16, a girl is pulling a shirt and physically impacting someone else in the frame. Therefore, three videos are labeled as “Abnormal”. Annotators Agreement To ensure the quality and objectivity of the dataset, we hired five annotators to label the data, and these annotators were provided with labeling guidelines. Each annotator independently labeled 150 identical videos to assess their labeling consistency and determine the official labels. We measured the inter-annotator agreement among the annotators using Cohen’s Kappa index and the average annotation agreement [3]. According to the ranking table for annotation agreement in classification data [8], Table 1 shows that the average agreement among pairs is 0.68, which is considered good as it falls within the range of 0.6 and 0.81. In addition to calculating the annotation agreement among each pair using Cohen’s Kappa index, we also calculated the agreement among all five annotators using Krippendorff’s alpha [7]. The agreement was 67.55%, which is higher than the acceptable value (66.7%) for Krippendorff’s alpha. Therefore, after labeling the data according to the guidelines in Sect. 3.1, our agreement was higher than the minimum acceptable result for Krippendorff’s alpha and achieved good agreement according to Cohen’s Kappa index. As a result, we can officially label the collected data.
Video Classification
49
Table 1. The agreement among annotators as measured by Cohen’s Kappa. A1 A2
A3
A4
A5
A1
1
0.70 0.64 0.71 0.73
A2
-
1
0.46 0.56 0.56
A3
-
-
1
0.81 0.73
A4
-
-
-
1
0.86
A5
-
-
-
-
1
Average
0.68
Phase 3. Labeling and data splitting: The inter-annotator agreement results for both Cohen’s Kappa and Krippendorff measures were good, meeting the set requirements. Therefore, we proceeded to label the videos in our dataset. It is inevitable to encounter challenging and ambiguous cases during the labeling process. To address these issues, the annotators will have meetings to discuss and agree on how to label these problematic and ambiguous cases and update the guidelines to make them more complete. Following this process, we obtained a total of 4268 labeled videos. Finally, we split 4268 videos into three sets: training set, validation set (development set), and test set.
Fig. 2. Dataset creation processing
3.2
Dataset Analysis
After collecting, preprocessing, and labeling data, BCiPS include two labels, 4,268 videos with a total size of 19.9 GB, and a total length of 21,353 s. We
50
T. G. T. Nguyen et al.
divided the dataset into three sets: training, development, and testing, with a ratio of 8:1:1.
Fig. 3. Distribution of labels in the training, validation, and testing sets.
Figure 3 shows that the number of Normal and Abnormal labels is relatively balanced. In three sets, the ratio of 2 labels is relatively equal. The train set is much larger than the test and validation set because we want the model to learn more cases, covering the data. For the test set and the validation set with a ratio of 1:1, this helps evaluate the most objective and close to reality.
4 4.1
Baseline Models CNN+LSTM (ConvLSTM2D)
The Convolutional LSTM (ConvLSTM) Network was first introduced in the work of [11]. In a fully connected LSTM network, flattening the image into a 1D space does not retain any spatial information, hence the need for CNN to extract spatial features and transform them into a 1D vector space. Therefore, the ConvLSTM network was proposed for video classification tasks [9], using 2D structures as inputs. It can directly work with a sequence of images and perform convolutional operations on the input images to extract spatial features, while LSTM layers can extract temporal dynamics between frames. Therefore, the ConvLSTM network can capture spatial and temporal signals, which fully connected LSTM cannot achieve. 4.2
CNN+SVM
CNN is used to extract features from video data. It is typically used to learn 2D image features from each frame. The output results of CNN for each frame generate a feature vector, which is subsequently utilized to train the SVM model.
Video Classification
4.3
51
CNN+Random Forest
CNN is used to extract features from video data. It is typically used to learn 2D image features from each frame. The output results of CNN for each frame generate a feature vector. It is then used to train the Random forest model. 4.4
TimeSformer
As TimeSformer (Time-Space Transformer)[1] is one of the most advanced methods for video classification that has recently emerged, we have decided to use this architecture in our benchmark. The TimeSformer model is designed for videos and is pre-trained on the ImageNet dataset and fine-tuned on our dataset. TimeSformer does not contain convolutional layers. Instead, it employs a selfattention mechanism. TimeSformer adapts transformer architecture for computer vision and video processing tasks. 4.5
MoViNets
MoViNets [6] is a video classification model used for online video streaming or real-time inference in tasks such as action recognition. The classifier of MoViNets is based on efficient and simple 2D frame-level processing to run over the entire video or stream each frame one by one. As it cannot account for temporal context, it has limited accuracy and may give inconsistent output results from one frame to another. A simple 3D CNN that uses a two-dimensional temporal context can increase accuracy and temporal consistency. 4.6
(2+1)D Resnet-18
The following 3D convolutional neural network model is based on the work published by D. Tran et al.. The (2 + 1)D convolution allows for the separation of spatial and temporal dimensions, creating two separate steps. One advantage of this method is that it helps save parameters by analyzing combinations of spatial and temporal dimensions 4.7
EfficientNetB0
EfficientNet [13] was introduced by Tan and Le, who studied the scaling of models and determined that carefully balancing a network’s depth, width, and resolution can lead to better performance. They proposed a novel scaling method that evenly scales all dimensions of a network, including depth, width, and resolution. They used a neural architecture search tool to design a new base network and extended it to obtain a family of deep learning models. There are 8 variants of EfficientNet (B0 through B7), with EfficientNetB0 having 5.3 million parameters. Figure 5 illustrates the architecture of the EfficientNet network.
52
T. G. T. Nguyen et al. Table 2. Models Performance Accuracy Recall Precision F1
5 5.1
CNN + LSTM
75.88
Timesformer
75.63 75.43
75.53
74.71
74.80
74.82
74.81
CNN + RandomForest 74.00
73.97
73.98
73.98
EfficientNetB0
73.30
73.48
73.70
73,59
Movinet
70.02
69.61
71.30
70.44
CNN + SVM
65.57
65.61
65.59
65.60
(2+1)D Resnet-18
61.12
60.42
63.37
61,86
Results Results of Baseline Models
Table 2 shows that the CNN-LSTM achieves the highest results in all four metrics: accuracy, recall, precision, and F1-score with 75.88%, 75.63%, 75.43%, 75.53% respectively. The results of the CNN-LSTM model are significantly different from the other models. Besides, the (2+1)D Resnet-18 model has the lowest performance with only 61.12%, 60.42%, 63.37%, and 61.86%, much lower than the other models. It can be seen that the pre-trained models resulted in lower performance with our dataset than those combining traditional models. 5.2
Error Analysis
According to Fig. 4, the precision of the “Normal” class is 73.31%, while the precision of the “Abnormal” class is 79.55%. This suggests that the model tends to classify videos as abnormal rather than ignoring abnormal videos. The difference in precision between the two classes indicates that the model is moving in the right direction for the development of the dataset, but further improvement is still required. The error rate and the difference in the precision of the two classes suggest that the model is still prone to misclassifications and needs to be fine-tuned for better accuracy. Figure 5 shows that the video contains many ambiguous actions (e.g., “Running”), which caused the model to predict it as abnormal. The main reason is that some abnormal behaviors have not been fully defined. Additionally, the video also includes some ambiguous actions, and the dataset needs to be more diverse.
Video Classification
53
Fig. 4. Confusion Matrix
Fig. 5. Predicted label and actual label of a sample video from the test set.
6
Conclusion-Future Works
In this paper, we have created a dataset for classifying the behaviors of children in preschool named BCiPS. There are two labels, “Normal” and “Abnormal”. After experimenting with the dataset on several video classification models, including basic, combined, and pre-trained models, we achieved the highest accuracy of 75.88% with the CNN + LSTM model. This demonstrates the potential of the dataset we have constructed for video classification in a preschool domain. However, specific errors still need to be addressed, which hinder the optimal performance of the models. In order to achieve real-world application, we plan to improve the dataset by collecting more data, re-labeling the data more effectively, and establishing stricter guidelines for data labeling. Observing the existing limitations in the dataset, we intend to take some improvement steps towards applying the dataset in practical applications.
54
T. G. T. Nguyen et al.
Specifically, we will collect additional data, re-label the dataset more efficiently, and establish stricter labeling guidelines. Finally, we will explore suitable models for this dataset.
References 1. Bertasius, G., Wang, H., Torresani, L.: Is space-time attention all you need for video understanding? In: ICML, vol. 2, p. 4 (2021) 2. Carreira, J., Noland, E., Hillier, C., Zisserman, A.: A short note on the kinetics-700 human action dataset. arXiv preprint arXiv:1907.06987 (2019) 3. Cohen, J.: A coefficient of agreement for nominal scales. Educ. Psychol. Measur. 20(1), 37–46 (1960) 4. Goyal, R., et al.: The “something something” video database for learning and evaluating visual common sense. In: Proceedings of the IEEE International Conference on Computer Vision (ICCV) (2017) 5. Kay, W., et al.: The kinetics human action video dataset. arXiv preprint arXiv:1705.06950 (2017) 6. Kondratyuk, D., Yuan, L., Li, Y., Zhang, L., Tan, M., Brown, M., Gong, B.: Movinets: Mobile video networks for efficient video recognition. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16,020–16,030 (2021) 7. Krippendorff, K.: Content analysis (2004) 8. Landis, J.R., Koch, G.G.: The measurement of observer agreement for categorical data. biometrics, pp. 159–174 (1977) 9. Luo, W., Liu, W., Gao, S.: Remembering history with convolutional LSTM for anomaly detection. In: 2017 IEEE International Conference on Multimedia and Expo (ICME), pp. 439–444. IEEE (2017) 10. Shahroudy, A., Liu, J., Ng, T.T., Wang, G.: NTU RGB+ D: A large scale dataset for 3D human activity analysis. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1010–1019 (2016) 11. Shi, X., Chen, Z., Wang, H., Yeung, D.Y., Wong, W.K., Woo, W.c.: Convolutional LSTM network: A machine learning approach for precipitation nowcasting. In: Advances in neural information processing systems 28 (2015) 12. Soomro, K., Zamir, A.R., Shah, M.: Ucf101: A dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv:1212.0402 (2012) 13. Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114. PMLR (2019)
Land Subsidence Susceptibility Mapping Using Machine Learning in the Google Earth Engine Platform Van Anh Tran 1,4(B) , Thanh Dong Khuc2 , Trung Khien Ha2 , Hong Hanh Tran1,4 , Thanh Nghi Le1,4 , Thi Thanh Hoa Pham1 , Dung Nguyen1 , Hong Anh Le1 , and Quoc Dinh Nguyen3 1 Hanoi University of Mining and Geology, 18 Vien Street, Hanoi 100000, Vietnam
{tranvananh,tranhonghanh,lethanhnghi,phamthithanhhoa, nguyenthimaidung,lehonganh}@humg.edu.vn 2 Hanoi University of Civil Engineering, 55 Giai Phong Street, Hanoi 100000, Vietnam {dongkt,khienht}@huce.edu.vn 3 Phenikaa University, Nguyen Trac, Yen Nghia, Ha Ðong, Hanoi 100000, Vietnam [email protected] 4 Geomatics in Earth Sciences Group, HUMG, Hanoi, Vietnam
Abstract. This study aims to compare the effectiveness of two predictive models, CART regression and Random Forest in mapping land subsidence susceptibility. The analysis is supported by the Google Earth Engine cloud computing platform. The study focuses on Camau province, located in the Mekong Delta, where significant land subsidence occurs annually. Eight variables were considered in the models, including elevation, slope, aspect, land cover, NDVI, soil map, geology, and groundwater level. Land subsidence points, obtained through the PSInSAR method, were used in the study, comprising a total of 989 points. These points were divided into a 70% training dataset and a 30% testing dataset for both models. The results produced a land subsidence sensitivity map categorized into five levels: very low, low, moderate, high, and very high. The performance of the models was evaluated using ROC curve and the area under the curve (AUC). The AUC values for the Random Forest (RF) model are 0.86 and 0.87 for the training and validation datasets, respectively. In comparison, the CART model achieves AUC values of 0.79 and 0.73 for the training and validation datasets, respectively. The research findings demonstrate an 7% superior performance of the RF model compared to the CART method. Therefore, the RF model is chosen as the final model for land subsidence susceptibility mapping in Camau. Keywords: Camau · CART · Random Forest · subsidence
1 Introduction The Mekong Delta in Vietnam has been experiencing serious land subsidence in recent years due to various natural and artificial causes. Over the past few decades, large-scale land use changes have occurred due to rapid population growth, urbanization, and the © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 N.-N. Dao et al. (Eds.): ICIT 2023, LNDECT 188, pp. 55–64, 2023. https://doi.org/10.1007/978-3-031-46749-3_6
56
V. A. Tran et al.
increase in agricultural and aquacultural production. These have contributed to the subsidence of land and exacerbated the severity of flooding. One of the provinces most affected by land subsidence is Camau. Located in the southernmost part of Vietnam, Camau is facing the dangers of land subsidence, sea level rise, flooding, and saltwater intrusion. A meticulous study by Erban et al. [1] demonstrated that the Camau Peninsula and the entire Mekong Delta are subsiding at a rate of several centimeters per year, exceeding the current absolute sea level rise by a significant margin. Meanwhile, Minderhoud’s research has shown a close correlation between land use and the rate of land subsidence [2]. In order to study land subsidence and predict the risk of land subsidence effectively, the problem of input data and algorithms used for prediction are extremely important factors. Omid Rahmati [3] employed two machine learning algorithms, the maximum entropy (MaxEnt) and genetic algorithm rule-set production (GARP), to construct a subsidence assessment model in Kashmar, Iran. The model incorporated various data such as land use, lithology, distances to groundwater extraction sites and afforestation projects, distances to fault locations, and groundwater level reductions. The research findings indicate that the GARP algorithm outperforms the MaxEnt algorithm in terms of accuracy and performance. Both algorithms provide reliable subsidence prediction. Recently, the study by Ata Allah Nadiri [4] introduced a method for assessing land subsidence susceptibility using the ALPRIFT framework and various artificial intelligence models, including Sugeno Fuzzy Logic (SFL), Support Vector Machine (SVM), Artificial Neural Network (ANN), and the Group Method of Data. The research results indicate that the combination of multiple artificial intelligence models can improve the accuracy of determining the susceptibility to land subsidence in the studied area. Another study conducted in the Marand District of Tehran Province, Iran [5] utilized the adaptive-fuzzy inductive inference system (ANFIS) method with six categories of input data, including subsidence distance from borehole and faults, elevation, distance to roads, rivers and streams, groundwater depth, slope, and land use. ROC curve validation indicated that the Gauss MF and Dsig MF methods had high accuracy and were comparable. Lee’ research [6] applied the ANN method to forecast the risk of land subsidence associated with Korean coal mines and achieved an accuracy of approximately 98.95%. In another study, the authors employed a combination of FR, logistic regression (LR), and ANN models, and it was found that the combined method had higher accuracy than using any single model alone [7]. In 2018, a study by D. Tien Bui utilized four models, Bayesian logic regression (BLR), support vector machine (SVM), logistic model tree (LMT), and intermittent decision tree (IDT), to determine the susceptibility of land subsidence [8]. Eight input factors including slope, distance to the nearest fault, density of faults, geology, distance to the nearest road, density of roads, land use, and rock mass rating were used. Evaluation of the four models showed that BLR was the most accurate method (0.941) for mapping the susceptibility of land subsidence. Additionally, some studies have shown a correlation between floods and land subsidence [9].
Land Subsidence Susceptibility Mapping Using Machine Learning
57
The study by [10] evaluated the land subsidence susceptibility (LSS) in the Gharabolagh Plain in Fars Province, Iran based on factors such as changes in groundwater, distance to rivers, streams, distance to faults, elevation, slope, aspect, terrain wetness index (TWI), Landuse, and Lithology using the Google Earth Engine (GEE) platform, and with two probabilistic models, the belief function proof (EBF) and Bayesian theory (BT). Overall, Camau in the Mekong Delta has a dense network of rivers and canals, but has been experiencing water shortages in recent years. This is due to upstream hydropower dams blocking water flow, leading to water shortages downstream, resulting in more frequent saline intrusion and droughts. Agricultural cultivation depends mainly on groundwater, which is being overexploited and depleted, causing land subsidence throughout the region. Due to its low elevation, with an average height of around 1m above sea level, flooding is a common occurrence when sea level rises [11]. Therefore, the need for predicting land subsidence risk is becoming increasingly urgent. Our study aims to provide an overview of the land subsidence susceptibility for the Camau Peninsula based on existing data sources and using the cloud computing platform Google Earth Engine. The study aims to test two machine learning methods, CART and Random Forest for subsidence predicted modeling for this area.
2 Study Area Camau province is located in the southernmost part of the country, surrounded by the sea on three sides. Its east coast extends 107 km and borders the South China Sea, while its west and south coasts extend 147 km and meet the Gulf of Thailand. The North borders Bac Lieu and Kien Giang provinces. This area is characterized by low-lying terrain that is often flooded. The average altitude in Camau is only about 1m above sea level [2]. Camau has 5 main types of soil: acid, peat, alluvial, saline and canals.
Fig. 1. Distribution map of subsidence and non-subsidence sample points in Camau area and its location on the map of Vietnam.
58
V. A. Tran et al.
Camau’s ecosystem is home to a unique type of coastal mangrove forest extending 254 km along the coast. In addition, the province also has a melaleuca forest ecosystem located deep in the mainland in the districts of U Minh, Tran Van Thoi, Thoi Binh with an area of 35,000 ha. Mangroves cover 77% of the total area of mangroves in the Mekong Delta. Figure 1 shows the location of Camau on the map of Vietnam and the administrative boundaries of the province.
3 Materials and Methods 3.1 Methods Classification and Regression Tree – CART The Classification and Regression Trees (CART) algorithm is a widely used supervised machine learning technique for predicting a categorical target variable, creating a classification tree, or a continuous target variable, creating a regression tree. The CART classification requires a binary tree, which is a combination of an initial root node, decision nodes, and leaf nodes. The root node and each decision node represent a feature and the threshold value of that feature. Due to its easy-to-understand and straightforward nature, CART is one of the most commonly used machine learning methods today [12]. The schematic of the CART algorithm is presented in Fig. 2.
Fig. 2. Description of the CART algorithm.
The CART algorithm for Classification and Regression Trees require the tree to be classified in the best possible way. In the CART algorithm, the Gini index is used to evaluate whether the split at the condition nodes is accurate or not. To find the best classification, the total weight of the Gini index for all branch nodes is calculated, and then the part with the lowest Gini index is taken as the part with the best classification accuracy. Mathematically, when analyzing a feature, different threshold divisions will lead to different classification results, and there may be cases where the same threshold for that feature leads to different classification results. Therefore, the Gini index is used to
Land Subsidence Susceptibility Mapping Using Machine Learning
59
determine noise in the dataset. Assuming the dataset is classified into two classes A and B, the Gini index is determined as follows [12]: n (1) Gini Index = 1 − (Pi )2 = 1 − (PA )2 + (PB )2 i=1
where, PA is the probability of data belonging to class A, PB is the probability of data belonging to class B. The above formula uses probability to determine the Gini Index value on each characteristic branch, determining which branch most likely to occur. Random Forest Random Forest (RF) is an algorithm comprising of many single decision trees that act like unions. The algorithm uses random features to create a tree. The method of bootstrapping is used to create training samples and each selected feature is randomly sampled by replacing N (the size of the original training set). Finally, the final prediction result is obtained by combining multiple decision trees [12]. RF model is very effective for image classification and prediction because it mobilizes hundreds of smaller models inside with different rules to make the final decision. Each sub-model can be different and weak, but according to the” wisdom of the crowd” principle, the classification result will be more accurate than using any single model. Algorithm details can refer to [12]. Model Quality Assessment To assess the quality of a predicting model, the ROC curve and the area under the ROC curve (AUC) are used. The ROC curve describes the relationship between pairs of the true positive rate (TPR) and false positive rate (FPR) for land subsidence and non subsidence positions. Reference points with good results will have a high true positive rate and a low false positive rate, and vice versa. TPR and FPR values are usually calculated with different thresholds to evaluate the model’s effectiveness. The AUC is a comprehensive performance evaluation index of land subsidence prediction models. The closer the AUC value is to 1, the more effective the model. The statistical parameters: positive rate (TPR), false positive rate (FPR), true negative rate (TNR), and false negative rate (FNR) are showed in the equations below. TPR =
TP TP + FN
(2)
FPR =
FP FP + TN
(3)
where TP is true positive, TN is true negative, FP is false positive, and FN is false negative. 3.2 Materials Inventory of Land Subsidence The inventory of land subsidence is a critical component in assessing the susceptibility
60
V. A. Tran et al.
of subsidence in the study area [13]. The dataset for the inventory comprises 989 sample points of land subsidence obtained from Sentinel-1 Radar data using the PSInSAR method. Figure 1 shows the distribution of the subsidence points (red color) and nonsubsidence points (blue color) randomly taken from the set of subsidence PS points from the results in the Mekong Delta area of the Copernicus website [14]. Input Factors of Subsidence Susceptibility and Tools In order to identify the factors that contribute to land subsidence susceptibility in the study area, we refer to some researches in this region [2, 15] to understand the subsidence patterns in the region. Through this analysis, eight key factors were identified, including terrain elevation, slope, aspect, land cover, NDVI, soil, geology, and groundwater depth map. The land cover map is derived from the ESA’s landcover 10m 2021 product. The NDVI is computed from Sentinel-2 images, which have been averaged for the entire year of 2021. The geological map of the Camau area is sourced from a 1:100,000 scale map provided by the Vietnam Institute of Geosciences and Minerals. Groundwater level data represents the average water level observed during the years 2020, 2021, and 2022. The soil map of the Camau area is based on a 1:50,000 scale map provided by the Camau Department of Natural Resources and Environment. Additionally, the DEM map is obtained from the ALOS World 3D – 30m (AW3D30) dataset.
(a)
(b)
(e)
(f)
(c)
(g)
(d)
(h)
Fig. 3. Input factors of subsidence susceptibility. (a) Elevation map, (b) Land cover map, (c) Geology map, (d) Ground water level, (e) Soil map, (f) NDVI map, (g) Slope map, (h) Aspect map.
The Google Earth Engine (GEE) cloud computing platform is utilized in this study to take advantage of its ability to gather data from various sources in the cloud [16]. By doing so, we minimize the need for desktop data preparation. Multiple sources of data, including DEM elevation digital model maps, Land use land cover map, and land subsidence inventory data to train and evaluate the model. The data sets are summarized
Land Subsidence Susceptibility Mapping Using Machine Learning
61
in Fig. 3. The Sentinel-2 satellite image with 10 m and 20 m resolution is processed using the GEE platform to determine the NDVI vegetation index. To ensure consistency in the model, all data is set to 30x30 meter resolution. These land subsidence inventory data are then divided into training and testing sets at a ratio of 70:30 (Fig. 4). The training set is comprised over 692 randomly points to extract values of elevation, slope, aspect, NDVI, LULC, soil, groundwater level, geology, location of land subsidence with values of 1 (land subsidence) and 0 (non subsidence). Flow chart illustrating the process of image processing and the construction of predictive models using two methods, CART and RF (Fig. 4).
Training data
Data set Testing (30%)
Training (70%)
Import into GEE NDVI computation by Sentinel-2
Land subsidence susceptibility map
Stacking input data layers into one file
Build CART and RF models
Validating the models
Computing of input variable importance
Fig. 4. Flow chart of image processing and building predictive models by two methods CART and RF by GEE.
4 Results and Discussions By utilizing eight variables to assess the susceptibility to land subsidence mentioned above, the two models exhibit clear disparities. Figure 5 provides insight into the significance of variables in the application of two machine learning models: CART and RF. In the CART model, out of these input variables, up to four layers have no influence on the model, whereas all eight inputs contribute to the predicting process in the RF model. Figure 5 depicts the importance of the input variables for the two models. Both models highlight the substantial impact of the water level decline data layer on predicting the risk of land subsidence. By referring to Fig. 1, we observe that the distribution of settlements aligns with the groundwater depth map, enabling easy identification of the considerably high decline in groundwater level within the vicinity of Camau city, leading to concentrated settlements surpassing those in the surrounding areas. Conversely, the northern part of Camau province, which displays the lowest settlement density, exhibits the least decline in water level. This northern region encompasses the expansive U Minh mangrove area with a small population and minimal groundwater extraction compared to the surrounding areas.
62
V. A. Tran et al.
To evaluate model performance, the Receiver Operating Characteristics (ROC) curve is a graphical tool used to describe the relationship between false-positive rates and sensitivity across different thresholds [17]. This technique is widely used to evaluate the performance of probabilistic models. By adjusting the decision threshold, we can generate an ROC curve by plotting different combinations of the True Positive Rate (TPR) and the ratio of the False Positive Rate (FPR) [18]. The AUC value represents the area under the ROC curve, providing quantitative confirmation of the overall performance of land subsidence models [19]. Higher AUC values indicate superior performance of settlement models and can be classified into different grades: excellent (0.9–1), very good (0.8–0.9), good (0.7–0.8), moderate (0.6–0.7) and poor (0.5–0.6) [20]. After analyzing the training and test sets, it is clear that both the CART and RF models perform well, exhibiting high accuracy. The CART method achieved an AUC value of 0.79 for training set and 0.73 for testing set, while the RF method outperformed with an AUC value of 0.87 for training set and 0.86 for testing set. The AUC values for both machine learning methods are in excess of 0.7, confirming their effectiveness in predicting the susceptibility of land subsidence in the study area however the RF model gives very good performance which AUC is higher and more suitable. These findings strongly support the view that the RF model is reliable for such predictions.
Fig. 5. The importance of the input variables, CART (left). RF (right).
Fig. 6. The land subsidence susceptibility maps generated using the CART model, along with the ROC curve and AUC value of CART (on the left), and the RF model, along with the ROC curve and AUC value of RF (on the right).
Figure 6 depicts a map showing the susceptibility of land subsidence in the Camau region. This map was created using the CART and RF models in GEE, along with
Land Subsidence Susceptibility Mapping Using Machine Learning
63
multiple data layers from various sources. The values on the subsidence susceptibility map range from 0 to 1, indicated by shades of blue to red. These colors represent areas with low to high levels of subsidence, which are determined based on factors such as elevation, slope, aspect, land cover, NDVI, soil, geology, and groundwater depth. The regions with the highest susceptibility to land subsidence are concentrated in Camau city, followed by the southern districts of Camau, namely Dam Doi, and Nam Can. The northern districts of Camau province, which have extensive forest coverage and a low population such as U Minh, and Thoi Binh, experience a lower rate of subsidence.
5 Conclusions The study aims to generate subsidence susceptibility maps in the Camau area of the Mekong Delta using GEE cloud computing and a multi-source dataset, employing two machine learning methods: CART and Random Forest (RF). The resulting land subsidence sensitivity map showcases the potential of utilizing free data sources and cloudbased algorithms. Regarding land subsidence sensitivity prediction in Camau, Vietnam, the RF machine learning model demonstrated superior performance compared to the CART model, displaying better accuracy. These findings provide valuable insights into the land subsidence susceptibility map, offering useful information for managers and planners in devising strategies to mitigate this issue and facilitate rational land use conversion. For future research, it is recommended to expand the study by considering additional input variables that influence land subsidence, aiming to further enhance the accuracy of machine learning models on the GEE platform. Acknowledgement. This research was funded by Scientific Research project of the Ministry of Education and Training of Vietnam, code: B2022-MDA-13.
References 1. Erban, L.E., Gorelick, S.M., Zebker, H.A.: Groundwater extraction, land subsidence, and sea-level rise in the Mekong Delta, Vietnam. Environ. Res. Lett. 9, 8 (2014) 2. Minderhoud, P.S.J., et al.: Impacts of 25 years of groundwater extraction on subsidence in the Mekong delta, Vietnam. Environ. Res. Lett. 12, 6 (2017) 3. Rahmati, O., Golkarian, A., Biggs, T., Keesstra, S., Mohammadi, F., Daliakopoulos, N.: Land subsidence hazard modeling: Machine learning to identify predictors and the role of human activities. J. Environ. Manag. 236, 466–480 (2019) 4. Nadiri, A.A., Habibi, I., Gharekhani, M., Sadeghfam, S., Barzegar, R., Karimzadeh, S.: Introducing dynamic land subsidence index based on the ALPRIFT framework using artificial intelligence techniques. Earth Sci. Inform. 15, 1007–1021 (2022) 5. Ghorbanzadeh, O., Rostamzadeh, H., Blaschke, T., Gholaminia, K., Aryal, J.: A new GISbased data mining technique using an adaptive neuro-fuzzy inference system (ANFIS) and k-fold cross-validation approach for land subsidence susceptibility mapping. Nat. Hazards 94, 497–517 (2018) 6. Lee, S., Park, I., Choi, J.-K.: Spatial prediction of ground subsidence susceptibility using an artificial neural network. Environ. Manag. 49, 347–358 (2012)
64
V. A. Tran et al.
7. Park, I., Lee, J., Saro, L.: Ensemble of ground subsidence hazard maps using fuzzy logic. Cent. Eur. J. Geosci. 6, 207–218 (2014) 8. Tien Bui, D., et al.: Land subsidence susceptibility mapping in South Korea using machine learning algorithms. Sensors 18, 2464 (2018) 9. Yin, J., Dapeng, Y., Wilby, R.: Modelling the impact of land subsidence on urban pluvial flooding: A case study of downtown Shanghai China. Sci. Total. Environ. 544, 744–753 (2016) 10. Najafi, Z., Pourghasemi, H.R. Ghanbarian, G., Fallah Shamsi, S.R.: Land-subsidence susceptibility zonation using remote sensing, GIS, and probability models in a Google Earth Engine platform Environ. Earth Sci. 79, 491 (2020) 11. Tran, V.A., Le, T.L., Nguyen, N.H., Le, T.N., Tran, H.H.: Monitoring vegetation cover changes by sentinel-1 radar images using random forest classification method. In˙z. Mineralna 1(2), 441–451 (2021) 12. Breiman, L., Friedman, J., Olshen, R., Stone, C.: CART. Classification and Regression Trees: Wadsworth and Brooks/Cole, Monterey, CA, USA (1984) 13. Anh, T.V.: Monitoring subsidence in Ca Mau City and Vicinities using the multi temporal Sentinel-1 radar images. In: 4th Asia Pacific Meeting on Near Surface Geoscience & Engineering, vol. 2021, no. 1, pp. 1–5. EAGE Publications BV (2021) 14. https://emergency.copernicus.eu, https://emergency.copernicus.eu/mapping/list-of-compon ents/EMSN062. Accessed 26 Feb 19 15. Minderhoud, P.S.J., Middelkoop, H., Erkens, G., Stouthamer, E.: Groundwater extraction may drown mega-delta: Projections of extraction-induced subsidence and elevation of the Mekong delta for the 21st century. Environ. Res. Commun. 2, 1 (2020) 16. Van Anh, T., et al.: Determination of illegal signs of coal mining expansion in Thai Nguyen Province, Vietnam from a combination of radar and optical imagery. In: Nguyen, L.Q., Bui, L.K., Bui, X.N., Tran, H.T. (eds.) Advances in Geospatial Technology in Mining and Earth Sciences: Selected Papers of the 2nd International Conference on Geo-spatial Technologies and Earth Resources. ESE, pp. 225–242. Springer, Cham (2022). https://doi.org/10.1007/9783-031-20463-0_14 17. Thuiller, W., Araújo, M.B., Lavorel, S.: Generalized models vs. classification tree analysis: Predicting spatial distributions of plant species at different scales. J. Veg. Sci. 14(5), 669–680 (2003) 18. Baeza, C., Lantada, N., Moya, J.: Validation and evaluation of two multivariate statistical models for predictive shallow landslide susceptibility mapping of the Eastern Pyrenees (Spain). Environ. Earth Sci. 61, 507–523 (2010) 19. Pham, B.T., Pradhan, B., Tien Bui, D., Prakash, I., Dholakia, M.B.: A comparativestudy of different machine learning methods for landslide susceptibility assessment: A case study of Uttarakhand area (India). Environ. Model. Softw. 84, 240–250 (2016) 20. Tien Bui, D., et al.: Spatial prediction of rainfall-induced landslides for the Lao Cai area (Vietnam) using a hybrid intelligent approach of least squares support vector machines inference model and artificial bee colony optimization. Landslides 14, 447–458 (2017)
Building an AI-Powered IoT App for Fall Detection Using Yolov8 Approach Phu Hoang Ng1,2 , An Mai1,2(B) , and Huong Nguyen3 1
3
School of Computer Science and Engineering, International University, Ho Chi Minh City, Vietnam 2 Vietnam National University, Ho Chi Minh City, Vietnam [email protected] Forecast Department, Mobile World Investment Corporation, Ho Chi Minh City, Vietnam
Abstract. Falls are a serious health concern, particularly among older people and the disabled, because they contribute to increased mortality, hospitalization, and loss of independence. This is especially true for the elderly. For the purpose of solving this issue, we have developed a brand new artificial intelligence system for fall detection. To reliably detect falls, our system employs the powerful and state-of-the-art YOLOv8 algorithm, as well as advanced software development techniques. The YOLOv8 algorithm, which serves as the foundation for our deep learning model, which helps us be able to detect if an individual person captured by surveillance cameras is falling in acceptable counting time at the staging level. In our opinion, this model successfully differentiates falls from other movements and activities, exhibiting excellent accuracy and a low false positive rate across a wide range of settings and levels of illumination. Our fall detection method has the potential to dramatically improve the safety and well-being of vulnerable groups by giving immediate aid and support in the event that a fall occurs. Keywords: YOLO
1
· web-app · Streamlit · fall detection
Introduction
Falls among the elderly and individuals with mobility issues have become a significant public health concern. The World Health Organization (WHO) estimates that approximately 684,000 individuals die each year from falls globally, making it the second leading cause of unintentional injury death [1]. Additionally, falls can lead to long-term consequences, including fractures, immobility, and a decline in overall quality of life [2]. As a result, there is an increasing demand for efficient fall detection systems that can provide timely assistance to those in need. In recent years, various approaches have been developed for fall detection, including wearable devices, ambient sensors, and vision-based systems [3]. Among these, vision-based systems have gained popularity due to their nonintrusive nature and the advances in computer vision techniques. These systems c The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 N.-N. Dao et al. (Eds.): ICIT 2023, LNDECT 188, pp. 65–74, 2023. https://doi.org/10.1007/978-3-031-46749-3_7
66
P. H. Ng et al.
typically rely on cameras to monitor individuals in their environment and use algorithms to analyze the captured images or video feeds to detect falls [4]. Fall detection technology is revolutionizing safety measures by offering enhanced protection and support to individuals, particularly those at risk of falls. It utilizes advanced sensors, algorithms, and alert systems to detect falls, trigger immediate responses, and minimize the potential consequences of such incidents. One of the key benefits of fall detection technology is its ability to provide prompt emergency response. By continuously monitoring movements and employing sophisticated algorithms, these systems can accurately identify falls and promptly alert caregivers, family members, or emergency services. The swift response time ensures that individuals receive timely medical attention, thereby reducing the severity of injuries and potentially saving lives. This rapid emergency response significantly improves the safety and well-being of individuals, giving them reassurance that help will be readily available in times of need. Moreover, the constant monitoring and immediate alert capabilities create a safety net for those prone to falls, enabling them to engage in daily activities with confidence. On the other hand, with the power of CNN-based networks, the performance of these systems has improved significantly, leading to their integration into fall detection solutions [5]. One such example is the You Only Look Once (YOLO) series of object detection algorithms, which have demonstrated state-of-the-art performance in real-time object detection tasks [6,7], with many useful application [17]. Our proposed AI-based fall detection system utilizes advanced object detection algorithms of YOLOv8 to identify falls effectively in real-world environments. It ensures reliable performance by accounting for challenging factors such as lighting conditions, occlusions, and camera angles as well. The system provides real-time monitoring and prompt response, minimizing potential harm and reducing the risk of severe injuries. User privacy is a core principle, with data handled carefully and used solely for fall detection. It seamlessly integrates with existing monitoring systems, enhancing overall efficiency without replacing them. This AI-based system has the potential to revolutionize safety in fall-prone situations by focusing on accurate detection, real-time monitoring, privacy, integration, and scalability.
2
Related Works
In recent years, the convolutional neural network (CNN) has acquired popularity in deep learning due to its ability to select features automatically, thereby eliminating the need for manual selection. There are two primary categories of CNN-based target detection methods [8]. The first category is comprised of two-stage detection algorithms, which divide target detection into two phases: locating and recognizing. The traditional Region-convolutional neural network (R-CNN) algorithm, however, has demonstrated subpar performance and does not meet real-time requirements. Several R-CNN-based enhancements, such as Fast R-CNN [9] and Faster R-CNN [10], have been introduced, but they have yet to achieve the real-time performance expectations of users.
Building an AI-Powered IoT App for Fall Detection Using Yolov8 Approach
67
One-stage detection algorithms, on the other hand, optimize the positioning and recognition of the target in a single phase. This algorithm type is exemplified by the single shot multi-box detector (SSD) and the You Only Look Once (YOLO) series. In 2019, Lu et al. [11]. proposed a method for fall detection using a three-dimensional convolutional neural network (3D CNN) and a spatial visual attention mechanism based on long short-term memory (LSTM). Chen et al. [12] presented in 2020 a robust fall detection method using Mask-CNN and an attention-guided Bidirectional LSTM model in complex backgrounds. Zhang et al. introduced a human fall detection algorithm in which they analyzed temporal and spatial changes in body posture to determine whether a person was about to fall, creating a diagram of the temporal and spatial evolution of human behavior [13]. In 2021, Zhu et al. used a deep vision sensor and a convolutional neural network to train extracted three-dimensional body posture data for fall detection [14]. However, the real-time performance of this method was subpar. Cao et al. proposed a fall detection algorithm that incorporated both motion features and deep learning. They used You Only Look Once version 3 (YOLOv3) to detect human targets and combined human motion features with deep features extracted by CNN to determine whether a fall had taken place [15]. Based on a modified YOLOv5s algorithm, Chen et al. presented a fall detection algorithm. To improve feature extraction, the Backbone network’s existing basic convolution block was replaced with the ACB convolution block. In addition, the algorithm included a spatial attention mechanism module within the residual structure to enhance the precision of feature localization. Modifications were also made to the classifier and feature layer structure to better recognize geriatric fall behavior [16].
3 3.1
Methodology Network Architecture
The YOLOv8 (You Only Look Once version 8) network architecture serves as a powerful tool for building an AI system for fall detection. YOLOv8 is a stateof-the-art object detection algorithm known for its real-time performance and accuracy. The network architecture of YOLOv8 follows a multi-scale approach to detect objects of different sizes, making it suitable for detecting falls in various scenarios. The YOLOv8 architecture (Fig. 1) consists of a backbone network, known as Darknet-53, which is a deep convolutional neural network (CNN) comprising 53 convolutional layers. The Darknet-53 backbone is responsible for extracting rich and hierarchical features from the input images or video frames. It acts as a feature extractor, enabling the network to understand the visual information present in the fall detection scenes. On top of the Darknet-53 backbone, YOLOv8 adds several detection layers, including multiple YOLO detection heads, to perform object detection at different scales. These detection heads predict bounding boxes, class probabilities, and confidence scores for a predefined set of object classes, including the class for
68
P. H. Ng et al.
falls. Each detection head is responsible for detecting objects within a specific range of sizes, allowing YOLOv8 to handle objects of varying scales effectively.
Fig. 1. The architecture of Yolov8. It is divided into three parts: CSPDarknet is the backbone, PANet is the neck, and Yolo Layer is the head. The data is supplied through CSPDarknet for feature extraction before being loaded into PANet for feature fusion. Lastly, Yolo Layer then produces detection findings (class, score, location, size).
To capture features at different scales, YOLOv8 incorporates feature maps from multiple stages of the Darknet-53 backbone. These feature maps are fused together using feature pyramid networks (FPN) or similar techniques, enabling the network to detect objects at both fine-grained and coarse-grained levels. This multi-scale approach improves the detection accuracy of small and large falls alike, ensuring comprehensive fall detection coverage. YOLOv8 also utilizes anchor boxes, which are predefined bounding box priors of different aspect ratios and sizes, to predict accurate object locations. By using anchor boxes, YOLOv8 can handle object localization and achieve precise bounding box regression during the detection process. The overall architecture of YOLOv8 prioritizes efficiency and real-time performance while maintaining competitive accuracy. It takes advantage of parallel processing and optimized network design to enable fast inference, making it suitable for real-time fall detection systems that require timely responses. Building an AI system for fall detection using YOLOv8 involves training the network on fall detection datasets, fine-tuning the model for fall-specific classes,
Building an AI-Powered IoT App for Fall Detection Using Yolov8 Approach
69
and integrating the system with suitable hardware and software components to enable real-time inference and deployment. 3.2
Fall Detection Work Flow
For the main contribution, we develop a modeling flow to deploy and persist various Yolo models, then serve them through a readily accessible web application (the development flow is depicted in Fig. 2). The modeling flow is created in six steps: accumulating training images, using YOLOv8 to create the base models, adding the models’ parameters, compiling the models, training the models, and finally storing the models for use in the application’s subsequent inference step. Four versions were trained and analyzed during the training phase: YOLOv8n, YOLOv8s, YOLOv8m, YOLOv8l, and YOLOv8x. Lastly, we deployed and persisted the models and developed all web APIs using the GUI web application Streamlit as a foundation.
Fig. 2. App development flow
3.3
Performance Metrics
In classification, the model returns a single label that is either correct (fall detected) or incorrect (sitting or walking). Due to the variable model parameters, a perfect match between the predicted and ground-truth bounding boxes is not possible. Object detection systems like YOLO are evaluated using the mean average precision (mAP). The mAP generates a score by comparing the ground-truth bounding box to the detected box. The model’s detection accuracy improves as the score increases. The detail of mAP calculation can be found in [18].
4 4.1
Implementation and Result Data
The data acquired will be organized into two distinct folders named raw images and annotations respectively. This collection currently has 474 photos in it, together with their bounding box in YOLOv8 format. This is just a temporary state. There are three different sorts of falls: detecting falls, sitting falls, and walking falls. It is possible for an image to contain one or more people, including persons who have fallen or who have not fallen, as demonstrated in Fig. 3.
70
P. H. Ng et al.
Fig. 3. Examples of falls dataset
4.2
Comparison Result of Different YOLOv5 Models over 100 Epochs
We are astonished that the model trained with YOLOv8l (large) can outperform the model trained with YOLOv8x (extra-large) by even a small margin. Therefore, we decided to retrain this model with an image size of 640 × 640 pixels and a batch size of 16. After re-training the model over 150 epochs, we skip the training and demonstrate the outcome of applying this model as the primary model for fall detection in the Streamlit-supported web application as described in the following section. It should be noted, however, that to enhance the user experience, we deploy additional versions and allow users to transition between versions for inference. It is beneficial for balancing performance and time. Upon conducting training on the YOLOv8 model with various iterations, namely YOLOv8n, YOLOv8s, YOLOv8m, YOLOv8l, and YOLOv8x, a comprehensive assessment of the outcomes can be undertaken based on precision, recall, [email protected], and [email protected]:.95. Precision denotes the ratio of accurate positive forecasts to all positive predictions, thereby signifying the model’s capability to precisely identify objects. With regard to precision, YOLOv8l attains the highest value, followed closely by YOLOv8m and YOLOv8n, whereas YOLOv8s exhibits the least precision among the five variations. Recall, on the other hand, gauges the ratio of accurate positive predictions to the total number of actual positive instances, indicating the model’s ability to appropriately capture all relevant objects. In terms of recall, YOLOv8x records the highest value, trailed by YOLOv8m, whereas YOLOv8s boasts the second-highest recall and YOLOv8l and YOLOv8n exhibit comparatively lower recall scores. [email protected], denoting the mean Average Precision, presents an encompassing evaluation of detection performance by considering both precision and recall across distinct IoU (Intersection over Union) thresholds. YOLOv8m garners the highest value according to [email protected], followed by YOLOv8l, while YOLOv8s and YOLOv8x manifest comparable [email protected] scores, and YOLOv8n showcases slightly lower value. [email protected]:.95, a more rigorous variant of mAP, encompasses a broader range of IoU thresholds, evaluating the model’s performance across diverse levels of object overlap. Considering [email protected]:.95, YOLOv8m emerges with the
Building an AI-Powered IoT App for Fall Detection Using Yolov8 Approach
71
Table 1. Evaluation five models of Yolov8 YOLOv8n Class all fall detected walking sitting
Images 111 111 111 111
Labels 114 72 23 19
P 0.826 0.982 0.659 0.838
R 0.613 0.742 0.667 0.431
[email protected] 0.797 0.925 0.695 0.769
[email protected]:.95 0.55 0.643 0.5 0.506
Images 111 111 111 111
Labels 114 72 23 19
P 0.753 0.841 0.782 0.636
R 0.797 0.917 0.598 0.875
[email protected] 0.8 0.947 0.721 0.733
[email protected]:.95 0.556 0.605 0.554 0.509
Images 111 111 111 111
Labels 114 72 23 19
P 0.84 0.743 0.865 0.912
R 0.833 0.917 0.722 0.859
[email protected] 0.885 0.908 0.82 0.928
[email protected]:.95 0.677 0.689 0.67 0.672
Images 111 111 111 111
Labels 114 72 23 19
P 0.844 0.851 0.683 1
R 0.73 0.917 0.778 0.495
[email protected] 0.855 0.945 0.728 0.89
[email protected]:.95 0.614 0.674 0.532 0.635
Images 111 111 111 111
Labels 114 72 23 19
P 0.717 0.719 0.673 0.758
R 0.854 0.958 0.686 0.917
[email protected] 0.84 0.931 0.732 0.859
[email protected]:.95 0.613 0.682 0.556 0.596
YOLOv8s Class all fall detected walking sitting YOLOv8m Class all fall detected walking sitting YOLOv8l Class all fall detected walking sitting YOLOv8x Class all fall detected walking sitting
highest value, followed by YOLOv8s. YOLOv8l, YOLOv8x, and YOLOv8n display relatively lower [email protected]:.95 scores. It is crucial to note that the ranking is predicated upon the provided metrics and their respective values. The significance of each metric may fluctuate contingent upon the specific use case or requirements. Consequently, based on the rankings within each metric, the overall order of the YOLOv8 variants can be determined as YOLOv8m > YOLOv8l > YOLOv8s > YOLOv8x > YOLOv8n. In summary, the YOLOv8m variant consistently exhibits commendable performance across multiple evaluation metrics.
72
4.3
P. H. Ng et al.
Fall Detection Web Application Using Streamlit
Figure 4 and Fig. 5 illustrates the fundamental user interface of the application, which allows users to upload images from local directories or webcams. Users can also add local assets from the local environment of surveillance cameras to the video. In addition, the app allows users to select the serving models as an option.
Fig. 4. The user interface for the detection on image mode
Performing Detection on an Image. As seen in Fig. 4, the detection result is returned directly in the app. Also, it can be seen that YOLOv8 has provided excellent detection results, despite the fact that some individuals’ bodies are concealed.
Fig. 5. The user interface for the detection on video mode
Performing Detection on an Video. Although trained only with images of normal falls, the system was still able to detect falls with unusual trajectories, sudden changes in body orientation, or atypical postures. The Fig. 5 illustrates the user interface of the detection set for video mode in the app.
Building an AI-Powered IoT App for Fall Detection Using Yolov8 Approach
73
Fig. 6. Two examples of the series of frames for illustrating the detection on the video mode captured by Camera of Smart Phone
Performing Detection via a Webcam. In this instance, the detection was conducted on the series of images from the videos of the primary author captured by the phone’s camera, and the result was once again potential (see Fig. 6). By connecting to the system via the phone’s camera instead of the laptop’s, we have accomplished more precise and adaptable detection. This permits us to detect accidents not only in fixed areas but also in other locations, such as outdoor areas, where the system can still function effectively. However, there is still a problem with the system’s connection to the camera, which occasionally results in missing frames when the author falls, resulting in inaccurate predictions.
5
Discussion and Conclusion
This paper presents our app development pipeline that integrated a powerful model of the fall detector, which is able to contribute to the resolution of urgent public health issues. To be more clearly, we employed YOLOv8, the most recent iteration of the YOLO series, which is more adaptable than the previous single-stage detection algorithms. The aforementioned findings demonstrate that Yolov8 produced superior results for fall detection. Even if the bodies of some individuals in the image are blurry or obscure, a Yolov8-based object detection system could detect them accurately. Additionally, at staging level, the app works very well for a small batch of uploaded images or videos. However, this web application still has some limitations, despite the fact that it offers a beneficial solution to the fall detection problem. For instance, it incorrectly identifies whether a person has fallen when they are captured by a camera from a far away distance. In addition, the speed of detection on video or camera is still sluggish at the production level due to the fact that this Web application must detect frame-by-frame of videos with numerous frame cuts. Nonetheless, this web application is moderately useful in both public and private settings, such as hospitals and airports with slow-moving crowds, as well as homes with slippery surfaces or obstacles in walkways. It is still a work in progress to improve this web application’s ability to detect the bodies of small people who are far from the camera’s field of view.
74
P. H. Ng et al.
References 1. World Health Organization: WHO, “Falls”. https://www.who.int/news-room/ fact-sheets/detail/falls 2. Rubenstein, L.Z.: Falls in older people: epidemiology, risk factors and strategies for prevention. Age Ageing 35(suppl2), ii37-ii41 (2006) 3. Igual, R., Medrano, C., Plaza, I.: Challenges, issues and trends in fall detection systems. Biomed. Eng. Online 12(1), 66 (2013) 4. Mubashir, M., Shao, L., Seed, L.: A survey on fall detection: principles and approaches. Neurocomputing 100, 144–152 (2013) 5. LeCun, Y., Bengio, Y., Hinton, G.E.: Deep learning. Nature 521(7553), 436–444 (2015) 6. Redmon, J.: You Only Look Once: Unified, Real-Time Object Detection (2016). Redmon You Only Look CVPR 2016 paper.html 7. Bochkovskiy, A.: YOLOv4: Optimal Speed and Accuracy of Object Detection, arXiv.org, (23 Apr 2020) 8. Chen, K.-H., Hsu, Y.-W., Yang, J.-J., Jaw, F.-S.: Enhanced characterization of an accelerometer-based fall detection algorithm using a repository. Instrum. Sci. Technol. (2016) 9. Girshick, R.: Fast R-CNN. In: Proceedings of the IEEE International Conference on Computer Vision(ICCV), pp. 1440-1448 (Dec 2015) 10. Ren, S., He, K., Girshick, R., Sun, J.: Faster R-CNN: towards real time object detection with region proposal networks. IEEE Trans. Pattern Anal. Mach. Intell. 39(6), 1137–1149 (2017) 11. Lu, N., Wu, Y., Feng, L., Song, J.: Deep learning for fall detection: threedimensional CNN combined with LSTM on video kinematic data. IEEE J. Biomed. Health Inform. 23(1), 314–323 (2019) 12. Chen, Y., Li, W., Wang, L., Hu, J., Ye, M.: Vision-based fall event detection in complex background using attention guided bi-directional LSTM. IEEE Access 8, 161337–161348 (2020) 13. Zhang, J., Wu, C., Wang, Y.: Human fall detection based on body posture spatiotemporal evolution. Sensors 20(3), 946 (2020) 14. Zhu, Y., Zhang, Y.P., Li, S.S.: Fall detection algorithm based on deep vision sensor and convolutional neural network. Opt. Technique 47(1), 56–61 (2021) 15. Cao, J.R., Lu, J.J., Wu, X.Y.: Fall detection algorithm combining motion features and deep learning. Comput. Appl. 41(2), 583–589 (2021) 16. Elderly Fall Detection Based on Improved YOLOv5s Network. IEEE J. Mag. IEEE Xplore (2022) 17. Nguyen, H., Nguyen, A., Mai, A., Tam, D.N.: AI-app development for Yolov5based face mask wearing detection. In: NAFOSTED Conference on Information and Computer Science (NICS), pp. 294–299 (2022) 18. Henderson, P., Ferrari, V.: End-to-end training of object class detectors for mean average precision. arXiv (2016)
Seam Puckering Level Classification Using AIoT Technology Duc Dang Khoi Nguyen1,4 , Tan Duy Le1,4 , An Mai1,4 , Thi Van Khanh Nguyen2,4(B) , Song Thanh Quynh Le2,4 , Duc Duy Nguyen3,4 , and Kha-Tu Huynh1,4 1
School of Computer Science and Engineering, International University, Ho Chi Minh City, Vietnam 2 Department of Textile and Garment Engineering, Faculty of Mechanical Engineering, Ho Chi Minh City University of Technology (HCMUT), Ho Chi Minh City, Vietnam [email protected] 3 Department of Industrial Systems Engineering, Faculty of Mechanical Engineering, Ho Chi Minh City University of Technology (HCMUT), Ho Chi Minh City, Vietnam 4 Vietnam National University, Ho Chi Minh City, Vietnam
Abstract. Seam puckers or seam wrinkles are defects that commonly occur in garment manufacturing. Despite the advancements of nowadays modern sewing technologies, encountering such defects is unavoidable. As a result, manufacturers need quality management staff to evaluate the pucker’s level and check whether puckers appear. Consequently, the task consumes manufacturers’ time, effort, and human resources. To address this issue, we propose a framework using Deep Learning Neural Networks to classify puckers levels based on fabric images and a lowcost IoT technology to reduce the manufacturing process’s cost as it only requires a phone that can connect to the Internet. Moreover, this study also contributes a dataset of seam puckers for training the models. Our work provides a performance analysis among several Deep Learning neural network models to define the best performance. As a result, MobileNetV2 is the most promising model since the level of precision is highly encouraging, approaching nearly 80%. Using MobileNetV2, a web application is developed for controlling the phone’s camera and storage to capture fabric images and then submit them to the server for evaluation. By applying our proposed framework, garment manufacturers can save money, time, effort, and human resources as the evaluation process is now more efficient. Keywords: Seam pucker classification · Computer Vision · Image Processing · Image Classification · Deep Learning Neural Network · Internet of Things (IoT)
c The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 N.-N. Dao et al. (Eds.): ICIT 2023, LNDECT 188, pp. 75–84, 2023. https://doi.org/10.1007/978-3-031-46749-3_8
76
1
D. D. K. Nguyen et al.
Introduction
Seam wrinkles, a long-recognized issue, contribute to the product’s diminished aesthetic value, garnering the interest of numerous researchers. The growing demand for high-quality garment products has consistently prioritized addressing this problem, making it the utmost concern for businesses. Seam Wrinkling or Seam Puckering is recognized as one of the most essential product quality control criteria in the garment manufacturing industry [1,2]. There have been many studies to reduce and overcome the phenomenon of seam puckering [3–6], but practically it is still something that businesses are very interested in. There are some typical errors in which seam puckering often occurs when making product codes: shirts, pants, and jackets. Many factors affect seam puckerings, such as fabric composition and structure, sewing thread, processing conditions, technological parameters, and sewing equipment, to mention a few [6]. Especially at present, the degree assessment of wrinkles is still based on the senses and requires experienced quality management staff [7]. The appraisal of seam puckering is mainly based on subjective evaluation methods, namely AATCC 88B [8] and ISO 7770:2009 [9]. These methods rely on three-dimensional (3D) and two-dimensional (2D) photographic standards that categorize seam-puckering grades into five different levels, ranging from 1 (representing the worst quality) to 5 (representing the best quality) [10]. As seam pucker evaluation methods work on 2D and 3D images, utilizing Computer Vision in this area is highly advantageous and will reduce time and effort. Computer Vision is a field of Artificial Intelligence (AI) using algorithms and Deep Learning models to extract attributes from images and understand images’ content by analyzing data collected from their features and making decisions or predictions. It has shown significant performance and remarkable development in recent years, with the widespread application of Computer Vision across various areas, including The Classification of Skin Cancer [11], End to End Learning for Self-Driving Cars [12], and more, as long as they can be visualized. Along with Computer Vision, IoT technologies, which refers to the network of physical devices embedded with sensors, software, and connectivity capabilities that enable them to collect and exchange data over the Internet, have also become increasingly widespread nowadays, especially when AI can be integrated into IoT to make novel Intelligent Systems applying to human lives. By combining AI with IoT technologies, smart devices can collect and exchange data, make intelligent decisions, and take autonomous actions. Due to the powerful synergy of AI and IoT, this research proposes a framework that combines Computer Vision and IoT technology to classify the levels of seam puckering. This framework will give the apparel manufacturing industry a low-cost IoT solution to reduce time-consuming efforts in identifying and classifying seam puckering by leveraging the automated attribute and the convenience of IoT and Computer Vision, respectively.
Seam Puckering Level Classification Using AIoT Technology
77
The main contribution of this paper can be summarized as follows: • A dataset of 1087 images of seam puckers already labelled. The dataset can be accessed here1 . • A comparison between 5 Deep Learning models to provide an overview of which model is currently the preferred choice for classifying this problem. • The installation of an IoT framework for automatically classifying the pucker level of a given seam line. The remainder of the paper is organized as follows: Sect. 2 presents the background and related works. Section 3 describes the methodology used to design the model. Section 4 discusses the findings from using Deep Learning Neural Networks to classify puckers levels. Section 5 concludes the paper by highlighting the research finding and contribution, limitations, and future directions.
2
Background and Related Works
There have been many studies on analyzing seam puckers by traditional methods or utilizing Artificial Intelligence as a backbone to solve the problem. As mentioned above, the current seam puckering evaluation is mainly by senses based on AATCC 88B and ISO 7770:2009 subjective methods. This approach presents several issues, including the potential for misclassifying due to many products, time-consuming, and wasting human resources to ensure the products’ quality. Besides the above standards method for evaluating seam pucker, thickness measurement and fabric’s change in length can also be used to indicate the extent of puckering; these methods were considered to be better than subjective evaluation as they have their formulas for calculating the percentage of the seam pucker. However, they are inconsistent and consume more time[1]. In addition, there have been studies on utilizing Artificial Intelligence in the process of evaluating seam pucker, such as using Self-Organizing Map (SOMs) [13], which is trained using unsupervised learning; this method can learn patterns and structures in the data without the need for labelled training examples and the classification accuracy for each grade of the model in this paper is relatively high. By using this method, objective evaluation of seam pucker is achieved instead of the traditional subjective method. However, SOMs are primarily used for unsupervised learning and clustering tasks; they might not effectively capture the intricate patterns for precise classification. Moreover, in practice, the issue should be addressed on fabric pieces with different colours and positions of seam lines to enhance the model’s generalization ability. Although the SOMs model returned high accuracy in this research, all the samples in the dataset were mapped to monochrome images for training and classifying, which might result in the model only being good at predicting grey-level images while some fabrics with dark colour changed to grey-level images could make it harder for the model to make a prediction. 1
https://github.com/khoindd2000/seam-pucker-dataset
78
D. D. K. Nguyen et al.
Another research proposes the method of evaluating seam pucker by using an Adaptive Neuro-Fuzzy Inference System (ANFIS), which is a combination of fuzzy logic with the architectural design of a neural network to establish the relationship between seam pucker grades and textural features of seam pucker images [14]. However, this method has several limitations when applying it to the seam puckering classification problem. Firstly, ANFIS relies on manual feature engineering, which can be time-consuming and may struggle to capture complex image patterns and variations. Secondly, similar to the above research using SOMs, this research also transformed the sample to grey-level images, so as a result, the model might not have enough perspective on the issue, and their generalization ability might not be as good as expected. Our research uses Deep Neural Networks, i.e., Convolutional Neural Networks (CNNs), specially designed for analyzing and classifying images. They have proven effective in capturing spatial dependencies and extracting meaningful features from images. Besides, we also deploy the model to an IoT system to provide an inexpensive IoT solution for garment industries to reduce the effort and the need for human resources in the quality management process.
3 3.1
Methodology Framework Architecture
Fig. 1. Overview Seam Pucker Classification System Framework.
Seam Puckering Level Classification Using AIoT Technology
79
The framework consists of three sequential operations: (1) Receiving input, (2) Predicting process, and (3) Displaying returned output, as shown in Fig. 1. The proposed system uses a web application to manipulate a phone’s camera and storage to capture the raw images and send them to the server for further processing. In addition, the embedded system allows users to send the image to the server directly from the phone’s storage for prediction, and it also displays the final predicted result for users. Once the server receives the image, it is resized and transformed into a tensor. After a few more steps, the tensor can now be fed into the trained model for prediction. The model takes the tensor as input and returns the predicted value. By applying this framework, we can significantly reduce the time and effort of workers responsible for ensuring the quality of seam puckering within the garment manufacturing process. 3.2
Custom Dataset
Due to the lack of an available seam puckering image dataset, we created a dataset by using Juki DDL 8100e sewing machine to stitch along with random direction on 10cm × 10cm fabric pieces; the process of stitching is performed at the sewing workshop in Ho Chi Minh City University of Technology (HCMUT). In the sampling process, we proceed with several techniques to collect samples with different densities of wrinkles, including changing the speed of the sewing machine, configuring the stitching density as well as thread tension, and controlling the speed of the sewing machine feed dog2 The fabric pieces with stitches are then captured while ensuring the shadows of the wrinkles are preserved to depict the density of shadows on the fabric for labelling purposes. In the sampling process, we used fabric and sewing thread with varying materials and colours, combined with capturing images of seam lines from different directions by rotating fabrics to obtain a diverse perspective on wrinkles for the training section (with 26 combinations of fabrics and sewing thread, a total of 1087 samples). As a result, the machine can gather a broader range of information to enhance prediction accuracy, as shown in Fig. 2. The following process after we finish the collecting sample is to label puckering levels for each sample based on the ISO 7770 [9] and AATCC 88 B [8] standards’ rating scales to evaluate the appearance of the seam and label them from level 1 (the worst quality) to level 5 (the best quality). The standard image for seam puckering levels is shown in Fig. 3. Once the sampling process is finished, resulting in a labelled dataset based on established standards, the dataset is split into the training set (80%) and the test set (20%). To ensure that the models learn as much as possible in the training process, we use some augmentation methods to the training dataset, including randomly rotating the image between -180 ◦ C and 180 ◦ C and randomly flipping it horizontally. By utilizing these methods, the models are exposed to a 2
Sewing machine feed dog definition: https://sewingmachinebuffs.com/what-is-afeed-dog-on-the-sewing-machine.
80
D. D. K. Nguyen et al.
broader range of data samples which helps to improve their generalization ability and robustness. In addition, these augmentation methods can help prevent overfitting and improve the models’ performance by providing more diverse and representative training examples.
Fig. 2. Examples of Seam images at different levels.
Fig. 3. Standard image for seam puckering levels [15].
3.3
Models for Seam Puckering Level Classification
When it comes to Computer Vision tasks such as image classification, object detection, and image segmentation, it is undeniable that Convolutional Neural Networks (CNNs) have proved their significant performance in those tasks. CNNs are a type of deep learning model designed for extracting features from input data images; its architecture consists of a convolutional layer where image attributes are extracted, pooling layers that retain the most important features while reducing the complexity of the computational complexity of the network, and fully connected layer where the final classification is performed based on the features extracted from previous layers. This paper utilizes five CNN-based models to train on the seam puckering level dataset, which are EfficientNetB0, ResNet50, MobileNetV2, VGG19, and
Seam Puckering Level Classification Using AIoT Technology
81
VGG16. This experiment aims to find the most effective model that can understand the attributes of seam puckering and provide prediction with the highest performance and low cost that can be served efficiently and deployed in such an IoT device later. The number of units in the output layers of the above models is set to 5 according to 5 levels of seam puckering. 3.4
System Implementation
The proposed system is a combination of a phone and a web-based application with the ability to interact with the phone’s camera and storage. This web application is built with Streamlit, an open-source Python library that makes it easy to create and share beautiful, custom web apps for machine learning and data science [16]. The application provides users two ways to proceed with the classification process, capturing raw images or submitting an available image in the phone’s storage. The images can be sent through the system to the web server containing the trained model. The prediction process is performed on the server, and the resulting prediction is then returned and displayed. This system significantly streamlines the image processing and classification workflow and provides users with a straightforward and user-friendly interface.
4 4.1
Result and Evaluation Result
In this paper, Google Cloud services were used to carry out the experimental results. Specifically, Google Colab is used for the training section as it has the most convenient environment settings for developing deep learning models. For the main resource, the GPU Tesla V100 is used in the training process, and the models are trained by employing PyTorch 2.0. After importing all of the required libraries and configuring the train data loader and test data loader, as well as changing the number of units of output layers of all models to 5, the training process of the models is ready to proceed. The training process for each model runs 50 epochs with the batch size of data is 16; Adam optimizer and Cross-Entropy Loss are utilized throughout the training process as the purpose here is to classify the seam puckering levels, which is a multiclass classification problem, and the learning rate for the training section is set at 0.001. After running all the training processes for models, we calculated some metrics to evaluate the performance for each model, including Precision, Recall, F1-score, and Training time in minutes. The results are summarized in Table 1. As shown in Table 1, while these models EfficientNetB0, ResNet50, and MobileNetV2 have the highest performance and their performance are very close to each other, models VGG19 and VGG16, on the other hand, are not catch up with the others, their precisions are pretty low. Among EfficientNetB0, ResNet50, and MobileNetV2, it is quite hard to choose one for the classification as they all have their advantages, but it is clear from Table 1 that
82
D. D. K. Nguyen et al. Table 1. Model performance on seam puckering images dataset
Model Name
Precision (%) Recall (%) F1-score (%) Training Time (minutes)
EfficientNetB0 ResNet50
70.00 75.66
68.29 75.27
68.02 74.47
35.20 33.60
MobileNetV2 77.02
75.81
76.19
32.90
VGG16
4.63
20.00
7.50
35.00
VGG19
4.63
20.00
7.50
35.30
MobileNetV2 has robustly higher performance. That is why this paper proposes to use MobileNetV2 for prediction in practice. The framework is ready to proceed with the above decision to use MobileNetV2 for the classification seam puckering image. As mentioned earlier, the embedded system utilizes a web application to manipulate the phone’s camera and storage to capture the seamline’s image on the fabric. The model MobileNetV2 serves as the backbone network, responsible for extracting the features of the image and analyzing them to predict the corresponding puckering level. The result is then displayed on the web application. For example, Fig. 4 demonstrates two approaches to the puckering classification process, and users can capture raw images of seamline. The result is immediately shown below the image (Fig. 4a). Alternatively, users can select an available image from their phone’s storage and submit it to the application; the result is promptly returned (Fig. 4b).
(a) Capturing image directly
(b) Choosing an image from storage
Fig. 4. The returned result in embedded system application
Seam Puckering Level Classification Using AIoT Technology
4.2
83
Evaluation
The achieved results highlight the successful development of a framework that uses Computer Vision with IoT technologies for classifying seam puckering. Currently, the model’s classification precision may not be sufficiently high for use in practice. However, at the value 77.02%, with the relatively limited size of the current set of seam images, misclassification occurs mainly between two adjacent levels. In addition, the recall score of MobileNetV2 is higher than the others with the value of 75.91%, meaning that the proportion of relevant samples this model can classify correctly is better. Furthermore, as shown in Table 1, the F1-score exhibited from MobileNetV2 is also the highest with 76.19% proving that this model can perform better harmonic mean of the precision and recall. Within the seam pucker evaluation process of the nowadays textile industry, this framework brings the following advantages: • Low cost, as it only requires a phone with an Internet connection, and the resources for training the model are relatively inexpensive. • Short training time and high mobility, resulting in time and effort saving. It can consider that this is a deploy-one-use-many framework. • When applied in manufacturing, this framework significantly reduces the time and effort spent classifying or detecting the seam pucker.
5
Discussion, Conclusion, and Future Works
It can be observed that the model’s accuracy needs to be enhanced for further usage, and this can be achieved by increasing the amount of seam pucker images in the dataset. Therefore, in future work, the first thing that needs to do is to improve the size of the dataset, enabling the model to learn more and enhance its accuracy. Classification is just an initial step in evaluating the pucker of the seamline. As a result, to make the classification of the pucker more reliable, the framework should detect and highlight the puckers of the seamlines on the fabrics. Besides, the training process can be enhanced for better performance by utilizing Generative Adversarial Networks (GANs) to automate the training process. The potential of this framework is quite significant as it can help reduce the workload in classifying and detecting seam puckering. This framework can be used for industries seeking low-cost methods to reduce costs in improving manufacturing processes when applying Industry 4.0 technologies. In addition, the framework can be utilized in inspecting batches of clothing goods before they are imported or exported abroad to save time and costs in the transportation stage. The current user interface of the embedded system application is still simple, so in the future, we will improve it for users and make it smoother and more user-friendly to enhance the user experience when working with it.
84
D. D. K. Nguyen et al.
References 1. Hati, S., Das, B.: Seam pucker in apparels: a critical review of evaluation methods. Asian J. Textile 1(2), 60–73 (2011) 2. Shiloh, M.: The evaluation of seam-puckering (1971) 3. Chen, D., Cheng, P., Li, Y.: Investigation of interactions between fabric performance, sewing process parameters and seam pucker of shirt fabric. J. Eng. Fibers Fabr. 16, 15589250211020394 (2021) 4. Kamali, R., Mesbah, Y., Mousazadegan, F.: The impact of sewing thread’s tensile behavior and laundering process on the seam puckering of elastic and normal fabrics. Inter. J. Clothing Sci. Technol. 33(1), 13–24 (2021) 5. Mousazadegan, F., Latifi, M.: Investigating the relation of fabric’s buckling behaviour and tension seam pucker formation. J. Textile Inst. 110(4), 562–574 (2019) 6. Juci˙en˙e, M., Dobilait˙e, V.: Seam pucker indicators and their dependence upon the parameters of a sewing machine. Inter. J. Clothing Sci. Technol. 20(4), 231–239 (2008) 7. Pan, R., Gao, W., Li, W., Xu, B.: Image analysis for seam-puckering evaluation. Text. Res. J. 87(20), 2513–2523 (2017) 8. Smoothness of Seams in Fabrics after Repeated Home Laundering, AATCC Std. 88B (2011) 9. Textiles - Test method for assessing the smoothness appearance of seams in fabrics after cleansing, ISO Std. 7770 (2009) 10. Pan, R., Gao, W., Li, W., Xu, B.: Image analysis for seam-puckering evaluation. Text. Res. J. 87, 10 (2016) 11. Esteva, A.: Dermatologist-level classification of skin cancer with deep neural networks. Nature 542(7639), 115–118 (2017) 12. Bojarski, M., et al.: End to end learning for self-driving cars, arXiv preprint arXiv:1604.07316 (2016) 13. Mak, K., Li, W.: Objective evaluation of seam pucker on textiles by using selforganizing map. IAENG Inter. J. Comput. Sci. 35(1) (2008) 14. Mak, K.-L., Li, W.: Objective evaluation of seam pucker using an adaptive neurofuzzy inference system. In: VISAPP 2008-3rd International Conference on Computer Vision Theory and Applications, Proceedings (2008) 15. Jana, P., Khan, N.A.: The sewability of lightweight fabrics using x-feed mechanism. Inter. J. Fashion Design Technol. Educ. 7(2), 133–142 (2014) 16. Streamlit. Streamlit documentation. https://docs.streamlit.io/
Classification of Pneumonia on Chest X-ray Images Using Transfer Learning Nguyen Thai-Nghe1 , Nguyen Minh Hong2 , Pham Thi Bich Nhu1 , and Nguyen Thanh Hai1(B) 1 Can Tho University, 3-2 Street, Can Tho City, Vietnam {ntnghe,nthai}@cit.ctu.edu.vn, [email protected] 2 Can Tho University of Medicine and Pharmacy Hospital, 3-2 Street, Can Tho City, Vietnam [email protected]
Abstract. Pneumonia is a severe illness, particularly affecting infants, young children, individuals above 65 years old, and those with compromised health or weakened immune systems. Pneumonia is a dangerous disease and often causes death if not being detected and treated instantly. Extensive research has revealed various pathogens, including bacteria, viruses, and fungi, as the potential causes of pneumonia. Furthermore, the global spread of COVID-19 has had devastating impacts on the global economy and public health. Therefore, an accurate machine learning-based application for pneumonia diagnosis would significantly save time, resources, and enable timely treatment, reducing the risk of complications. This study proposes a transfer learning approach for pneumonia classification. Specifically, this work has utilized the pre-trained model (e.g., the VGG16 model) which already had very good parameters on large dataset. Based on the pre-trained model, we removed the last layer and replaced it with new fully-connected layers and an output layer to fit with problem of pneumonia classification, re-trained and fineturned the model to classify pneumonia diseases. We collect X-ray images from a variety of data sources to build a dataset for three classes such as Normal, COVID-19, and Viral diseases. Experimental results on a dataset of 2500 X-ray images show that using transfer learning approach can improve the accuracy of the prediction model. Keywords: Classification of pneumonia · X-rays images · Transfer learning · VGG16 · ResNet50
1 Introduction Pneumonia is an acute pulmonary infection that can be caused by bacteria, viruses, or fungi and infects the lungs, causing inflammation of the air sacs and pleural effusion, a condition in which the lung is filled with fluid. It accounts for more than 14% of deaths in children under the age of five years [1]. Pneumonia is most common in underdeveloped and developing countries, where overpopulation, pollution, and unhygienic environmental conditions exacerbate the situation, and medical resources are scanty. Therefore, early diagnosis and management can play a pivotal role in preventing the disease from becoming fatal. © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 N.-N. Dao et al. (Eds.): ICIT 2023, LNDECT 188, pp. 85–93, 2023. https://doi.org/10.1007/978-3-031-46749-3_9
86
N. Thai-Nghe et al.
Radiological examination of the lungs using computed tomography (CT), magnetic resonance imaging (MRI), or radiography (X-rays) is frequently used for diagnosis. X-ray imaging constitutes a non-invasive and relatively inexpensive examination of the lungs [2]. However, chest X-ray images are a difficult task even for specialist radiologists. The appearance of pneumonia in X-ray images is often uncertain, it can be confused with other diseases and behave like many other benign abnormalities. These inconsistencies have caused important subjective decisions and variations among radiologists in the diagnosis of pneumonia [3]. Therefore, computerized support systems are needed to help radiologists diagnose pneumonia from chest X-ray images. In medical imaging, the correct diagnosis or evaluation of a disease depends on both image acquisition and image interpretation. Image acquisition has recently begun to take advantage of computer technology, as image acquisition has improved significantly in recent years and devices are collecting data at faster and higher resolution [4]. Object recognition, perception and segmentation performance results were found to give better results compared to manual diagnosis. In deep learning, radiology imaging has generally shown similar performance improvements in the field of medical image analysis for the detection and segmentation tasks of human anatomical or pathological structures [5]. Thus, the task of automated medical image classification has grown significantly [6]. This task aims to diagnose medical images into predefined classes. Recently, Deep Learning (DL) has become one of the most common and widely used methods for developing medical image classification tasks [7]. Further, DL models produced more effective performance than traditional techniques using chest X-ray images from pneumonia patients [8, 9]. This work proposes an approach to enhance the diagnosis of pneumonia by using transfer learning. First, the input images are pre-processed, then transfer learning models are employed. In this transfer learning approach, we use the pre-trained deep learning model (such as the VGG16), customize and re-train the model using chest X-ray images to identify pneumonia. The rest of the paper is organized as follows: Sect. 2 provides a review of related works. In Sect. 3, dataset and the proposed method are presented. The pneumonia classification performance of the proposed method is given in Sect. 4. Lastly, the conclusion provides future scope in Sect. 5.
2 Related Works Over the past decade, many researchers have automatically used deep learning to detect lung infections and diseases from chest X-ray. For example, [10] explored different CNN architectures, aiming to obtain the best possible results. Furthermore, they analyzed the outcomes separately with and without data augmentation techniques. The findings revealed that when data augmentation was applied, the accuracy of the model improved significantly, reaching 83.38%. In contrast, when using the original dataset without augmentation, the accuracy was observed to be 80.25%. In [11], Asif Iqbal Khan et al. propose CoroNet, a deep neural network model for automatic detection of COVID-19 infection from chest X-ray images. The proposed model is based on the Xception architecture, which is pretrained on the ImageNet dataset, and
Classification of Pneumonia on Chest X-ray Images
87
fine-tuned on a dataset prepared by collecting COVID-19 and other pneumonia chest Xray images from different publicly available databases. CoroNet is trained and tested on the prepared dataset, and the experimental results demonstrate that the proposed model achieves high accuracy for the 3-class classification (COVID-19 pneumonia versus bacterial pneumonia versus normal). The proposed model achieves a classification accuracy of 95%. The work in [12] utilized the pretrained InceptionV3 model to extract image embeddings and an artificial neural network for classification. The mentioned architecture is capable of accurately and proficiently classifying and distinguishing various respiratory diseases, achieving an exceptionally high accuracy of 99.01%. In another study [13] Mohammad Khan et al. proposed performance analysis of various network models such as 2D CNN, ResNet-50, Inception ResNet V2, InceptionV3, DenseNet 121, and Mobile V2 for detecting COVID-19 from chest X-ray images. The dataset consisted of 2905 chest X-ray images categorized into three types: COVID19 affected images (219 cases), images affected by viral pneumonia (1345 cases), and normal chest X-ray images (1341 cases). The results showed that ResNet-50 exhibited excellent performance in classifying different cases with an overall accuracy of 96.91%. Futhermore, Thanh-Nghi Do et al. [14] proposed training a Support Vector Machine (SVM) model on deep networks such as DenseNet 121, MobileNet V2, InceptionV3, Xception, ResNet50, VGG16, and VGG19 to detect Covid-19 from chest X-ray images. The classification results for different lung conditions (consolidation and viral pneumonia) showed that SVM-on-Top achieved the highest classification accuracy of 96.57%. In a study [15], the authors used preprocessed data to make it suitable for transfer learning tasks. Different pretrained convolutional neural network (CNN) architectures, including VGG16, InceptionV3, and ResNet50, were utilized. Ensembles were created by combining CNN with Inception-V3, VGG-16, and ResNet50. Their results indicated Inception-V3 with CNN displayed superior performance in comparison to the other models, 99.29% accuracy was obtained according to the results of their model. In this work, we propose a transfer learning approach for pneumonia classification. Specifically, this work has utilized the pre-trained model (e.g., the VGG16 model) which already had very good parameters on large dataset. Based on the pre-trained model, we removed the last layer and replaced it with new fully-connected layers and an output layer to fit with problem of pneumonia classification, re-trained and fine-turned the model to classify pneumonia diseases.
3 Dataset and Methods 3.1 Dataset Dataset has been collected from the chest X-ray images data set1 and Chest-Xray dataset [16, 17]. After selecting, we have used 2500 images with different sizes for three classes including Normal (710 images), Viral pneumonia (710 images) and COVID-19 diseases (1080 images). Figure 1 displays chest X-ray images of (a) normal (healthy), (b) COVID19 and (c) Viral pneumonia patients from the selected data set. 1 https://www.kaggle.com/paultimothymooney/chest-xray-pneumonia (accessed on April 2023).
88
N. Thai-Nghe et al.
Fig. 1. Representative chest X-ray images of (a) normal (healthy), (b) COVID-19 and (c) Viral pneumonia patients.
3.2 Proposed Method A. Data pre-processing and augmentation In the dataset, data samples are resized to 224 × 224 pixels and converted to grayscale images. The contrast of the images is enhanced to highlight the lung regions, and the image quality is adjusted to 900 dpi. Each image is then merged into three channels, resulting in an input size of 224 × 224 × 3, representing the height, width, and number of channels. The number of channels is three, including the red, green, and blue channels. After resizing the images, data augmentation is performed to increase the size of the dataset to fit the VGG16 model. Subsequently, all images in the dataset are labeled once, where each class is transformed into a binary feature. This process helps the machine learning algorithm understand the format of the input and thus perform better. In addition, this study also employs data augmentation, which is a widely used technique to generate the necessary number of samples. Since the dataset is relatively small, we apply several advanced image enhancement techniques to expand the scale of our training data. Data augmentation follows a sequence of steps, starting with rotation. Next are the functions for scaling, flip_left_right, and flip_top_bottom. Finally, the random_distortion function and sample ratio are utilized. The normal images are generated to balance the dataset, helping reduce the model’s bias towards the majority
Classification of Pneumonia on Chest X-ray Images
89
class. Specifically, we applied image augmentation by randomly rotating the image by 20˚, randomly scaling the image in a range from 0.9 to 1.1, and translating the image in height and width proportionally in a range from −0.1 to 0.1. The overall architecture and the proposed approach are described in Figs. 2 and 3, respectively. In this study, the pre-trained VGG16 model will be removed from the last layer and replaced with new fully-connected layers and an output layer to classify 2 types of diseases which are COVID-19, Viral, and the normal images.
Fig. 2. General schema for Classification of Pneumonia disease.
The reason for using transfer learning is that we can utilize the pre-train model, e.g., the VGG16 model. This model has been trained on the ImageNet dataset with over a million labeled images belonging to 1000 different classes, thus, it has very good parameters. It is a good choice to reuse these parameters and fine-turn the model on new datasets to archive better results.
4 Experimental Results The dataset was divided to a training dataset comprising 80% of the images (2000 images) and a test dataset comprising 20% of the images (500 images). We have used the ResNet-50 [20] for comparison. Experimental results are presented in Table 1 and Fig. 4. From these results, we can see that after pre-processing data, the model can improve the prediction accuracy and F1-score. Moreover, using transfer learning approach can archive better results than the normal one, e.g., the ResNet50. We also report the confusion matrix as in Fig. 5. Figure 5 presents the confusion matrix and Fig. 6 presents the accuracy and the loss during training and validation process. We can observe that the accuracy metrics and loss functions are relatively stable throughout the training process. This indicates that the model has good generalization ability and is able to mitigate overfitting, which is the phenomenon of the model performing well on the training data but poorly on unseen data.
90
N. Thai-Nghe et al.
Fig. 3. Transfer learning approach for Classification of Pneumonia disease.
Table 1. Experimental result comparisons Models
Original Data
Preprocessed Data
Train accuracy
Test accuracy
F1
Train accuracy
Test accuracy
F1
Proposed approach
0.90
0.91
0.91
0.97
0.95
0.95
ResNet-50
0.93
0.88
0.88
0.98
0.92
0.92
Fig. 4. Comparison of the proposed approach and the ResNet-50.
Classification of Pneumonia on Chest X-ray Images
91
Fig. 5. Confusion matrices of the chest X-ray dataset using transfer learning with VGG16.
Fig. 6. Accuracy and loss on the training and validation sets using transfer learning.
5 Conclusions This study proposes a transfer learning approach for pneumonia classification. Specifically, this work has utilized the pre-trained model (e.g., the VGG16 model) which already had very good parameters on large dataset. Then, we removed the last layer and replaced it with new fully-connected layers and an output layer to fit with new dataset, re-trained and fine-turned the the model to classify pneumonia diseases. We collect X-ray images from a variety of data sources to build a dataset for three classes such as Normal, COVID-19, and Viral diseases. Experimental results on a dataset of 2500 X-ray images show that using transfer learning approach can improve the accuracy of the prediction model. In future research directions, we will explore and apply various data preprocessing techniques to further improve the predictive performance of the model. Additionally, the authors in [18] proposed a disease diagnosis support method using trained models
92
N. Thai-Nghe et al.
and the Grad-CAM (Gradient Class Activation Mapping) technique [19]. This forms the basis for our investigation into applying Grad-CAM to assist in localizing the disease area on X-ray images, enabling the selection and application of the appropriate model to accurately identify specific disease regions. Furthermore, it is necessary to collect additional data and expand the range of diseases not only on X-ray images but also on CT scans and MRI images.
References 1. World Health Organization: Pneumonia, KEY Facts (2021). https://www.who.int/news-room/ fact-sheets/detail/pneumonia. Accessed 20 Feb 2022 2. Raoof, S., Feigin, D., Sung, A., Raoof, S., Irugulpati, L., Rosenow, E.C.: 3rd Interpretation of plain chest roentgenogram. Chest 141(2), 545–558 (2012). https://doi.org/10.1378/chest. 10-1302 3. Ayan, E., Ünver, H.M.: Diagnosis of pneumonia from chest X-ray images using deep learning. In: Scientific Meeting on Electrical-Electronics & Biomedical Engineering and Computer Science (EBBT), Istanbul, Turkey, pp. 1–5. IEEE (2019). https://doi.org/10.1109/EBBT.2019. 8741582 4. Greenspan, H., van Ginneken, B. Summers, R.M.: Guest editorial deep learning in medical imaging: Overview and future promise of an exciting new technique. IEEE Trans. Med. Imaging 35(5), 1153–1159 (2016). https://doi.org/10.1109/TMI.2016.2553401 5. Wang, X., Peng, Y., Lu, L., Lu, Z., Bagheri, M., Summers, R.M.: ChestX-Ray8: Hospital-scale chest X-ray database and benchmarks on weakly-supervised classification and localization of common thorax diseases. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, pp. 3462–3471. IEEE (2017). https://doi.org/10.1109/CVPR. 2017.369 6. Wang, L., et al.: Trends in the application of deep learning networks in medical image analysis: Evolution between 2012 and 2020. Eur. J. Radiol. 146, 110069 (2022) 7. Singhal, A., Phogat, M., Kumar, D., Kumar, A., Dahiya, M., Shrivastava, V.K.: Study of deep learning techniques for medical image analysis: A review. Mater. Today Proc. 56, 209–214 (2022) 8. Ben Atitallah, S., Driss, M., Boulila, W., Koubaa, A., Ben Ghezala, H.: Fusion of convolutional neural networks based on Dempster–Shafer theory for automatic pneumonia detection from chest X-ray images. Int. J. Imaging Syst. Technol. 32, 658–672 (2022) 9. Iori, M., et al.: Mortality prediction of COVID-19 patients using radiomic and neural network features extracted from a wide chest X-ray sample size: A robust approach for different medical imbalanced scenarios. Appl. Sci. 12, 3903 (2022) 10. Khoiriyah, S.A., Basofi, A., Fariza, A.: Convolutional neural network for automatic pneumonia detection in chest radiography. In: International Electronics Symposium (IES), Surabaya, Indonesia, pp. 476–480. IEEE (2020). https://doi.org/10.1109/IES50839.2020.9231540 11. Khan, A.I., Shah, J.L., Bhat, M.M.: CoroNet: A deep neural network for detection and diagnosis of COVID-19 from chest X-ray images. Comput. Methods Programs Biomed. 196, 105581 (2020). https://doi.org/10.1016/j.cmpb.2020.105581 12. Verma, D., Bose, C., Tufchi, N., Pant, K., Tripathi, V., Thapliyal, A.: An efficient framework for identification of tuberculosis and pneumonia in chest X-ray images using neural network. Procedia Comput. Sci. 171, 217–224 (2020). https://doi.org/10.1016/j.procs.2020.04.023 13. Khan, M.M.R., et al.: Automatic detection of COVID-19 disease in chest X-ray images using deep neural networks. In IEEE 8th R10 Humanitarian Technology Conference (R10-HTC), Kuching, Malaysia, pp. 1–6. IEEE (2020). https://doi.org/10.1109/R10-HTC49770.2020.935 7034
Classification of Pneumonia on Chest X-ray Images
93
14. Do, T.-N., Le, V.-T., Doa, T.-H.: SVM on top of deep networks for Covid-19 detection from chest X-ray images. J. Inform. Commun. Converg. Eng. 20(3), 219–225 (2022). https://doi. org/10.56977/jicce.2022.20.3.219 15. Mujahid, M., Rustam, F., Álvarez, R., Mazón, J.L.V., de la Torre Díez, I., Ashraf, I.: Pneumonia classification from X-ray images with inception-V3 and convolutional neural network. Diagnostics 12, 1280 (2022). https://doi.org/10.3390/diagnostics12051280 16. Covid-19 Image Dataset. Kaggle.com. https://www.kaggle.com/datasets/pranavraikokte/cov id19-image-dataset. Accessed 10 Apr 2023 17. Kermany, D., Zhang, K., Goldbaum, M.: Labeled optical coherence tomography (OCT) and chest X-ray images for classification. Mendeley Data, V2 (2018). https://doi.org/10.17632/ rscbjbr9sj.2 18. Nguyen, H.T., Huynh, H.T., Tran, T.B., Huynh, H.X.: Explanation of the convolutional neural network classifying chest X-ray images supporting pneumonia diagnosis. EAI Endorsed Trans. Context-aware Syst. Appl. 7, e3 (2020). https://doi.org/10.4108/eai.13-7-2018.165349 19. Selvaraju, R.R., Cogswell, M., Das, A., Vedantam, R., Parikh, D., Batra, D.: Grad-CAM: Visual explanations from deep networks via gradient-based localization. In: IEEE International Conference on Computer Vision (ICCV), Venice, pp. 618–626. IEEE (2017). https:// doi.org/10.1109/ICCV.2017.74 20. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778. IEEE (2016)
Bayesian Approach for Static Object Detection and Localization in Unmanned Ground Vehicles Luan Cong Doan1,2(B)
, Hoa Thi Tran2
, and Dung Nguyen2
1 Faculty of Electro-Mechanics, Hanoi University of Mining and Geology, Hanoi, Vietnam
[email protected] 2 Faculty of Information Technology, Hanoi University of Mining and Geology, Hanoi, Vietnam
Abstract. This paper introduces a statistical and probabilistic framework in approaching for an automatic search and estimation of static object’s location in an unmanned ground vehicle setting. It relies on the probabilistic nature of the Bayesian approach in statistical inference and decision-making. In Bayesian statistic, probabilities are assigned to model parameters, hypotheses, and predictions, making them inherently probabilistic. In the context of machine learning, the Bayesian approach plays a crucial role in addressing various challenges, particularly in the realm of uncertainty estimation, parameter optimization, and decision-making. Bayesian method permits a more comprehensive exploration of the parameter space. In this paper, the proposed approach utilized a lidar sensor and a pan/tilt camera as the primary hardware components. The lidar sensor, with its wide angle of view and high 2D accuracy, first detected the object in the 2D plane. Subsequently, the camera captured and analyzed object features to confirm its presence before guiding the robot to the precise position and orientation. The Recursive Bayesian Estimation (RBE) algorithm was leveraged to continuously track and determine the target’s location, ensuring accurate and reliable updates throughout the process. A confidence level was assigned to each detection technique, and the lidar sensor enables continuous detection by incrementally increasing the probability of detection in a generalized manner. The approach was evaluated through the detection of a static object with specific features, and the results were demonstrated using a multi-space Bayesian approach and the lidar detection method. Keywords: Bayesian · Unmanned Ground Vehicle (UGV) · Static Object Detection · Localization · Lidar sensor · Pan/Tilt camera
1 Introduction In recent years, there has been a growing interest in the dynamic engagement of unmanned ground vehicles (UGVs) and mobile manipulators for the development of next-generation manufacturing robots capable of replacing humans in challenging working tasks, particularly in harsh environments [1, 2]. Placing robots in real environments has presented numerous challenges, with the primary concern being the ability of robots to detect, recognize targets, and navigate to the correct working positions [3]. © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 N.-N. Dao et al. (Eds.): ICIT 2023, LNDECT 188, pp. 94–104, 2023. https://doi.org/10.1007/978-3-031-46749-3_10
Bayesian Approach for Static Object Detection
95
Utilizing only a lidar sensor, several Simultaneous Localization and Mapping (SLAM) techniques have been developed for mapping environments and detecting objects in both 2D and 3D spaces [4]. While lidar-based scans in SLAM provide accurate range measurements, the limitations of lidar can result in the omission of crucial features, especially object recognition, which is crucial in the current mission [5]. On the other hand, cameras equipped with computer vision and machine learning techniques have proven effective in solving object recognition problems. However, they generally struggle with distance measurement and are prone to range measurement errors [6]. Researchers have explored the integration of both lidar and camera sensors to address the challenges of tracking and detecting objects [7]. Significant advancements have been made by enhancing sensor quality and implementing numerical techniques [8]. Nevertheless, there is still a need to develop new techniques that effectively address current issues such as dealing with near-Gaussian or non-Gaussian losses in the loss and found problem [9]. This paper presents a Bayesian approach for UGVs to accurately locate and approach static targets with the appropriate distance and orientation. The proposed approach utilizes two types of sensors: a lidar sensor and a pan-tilt camera. Leveraging the wide angle of view (AOV) and range measurement capabilities of the lidar sensor, potential targets are initially detected in the 2D plane. The pan-tilt camera is then used to confirm the targets and guide the robot to the precise position. By combining the observations from both sensors and updating the belief using Recursive Bayesian Estimation (RBE) [10], the proposed approach effectively locates the target in the global space. Although the confidence level (LOC) of each individual sensor is low, fusing the information from both sensors using a Bayesian filter increases the LOC and the probability of detecting the true target [11].
2 Methodologies 2.1 Detecting Algorithms Since SLAM and pattern recognition techniques have been employed in this paper, features of target were defined and classified as follow: • True features: features that were real and usable for SLAM and vision techniques, which were size, shape of target and tools. • False features: features were not real but estimated by an incomplete estimator. • Noise features: the noise of time-varying. Particularly, we solved problems based on the true features; most of false features and noise were removed by the numerical techniques in proposed approach. Consider a target t, the state of target in comparison with robot state is given by formulas (1) below: t = f t (xkt , wkt ) xk+1
(1)
where xkt ∈ X t it the state of target at time step k, and wkt ∈ W t is the system noise of target.
96
L. C. Doan
This target is observed by UGV and its global state will be assumed by the use of GPS, a compass or an IMU. The motion model of UGV is given by (2): s xk+1 = t s (xks , uks )
(2)
where xks ∈ X s and uks ∈ U s represent the state and control input of UGV, respectively. The UGV carries a set of sensors, each with an observation region, often known as FOV (Field of View), which is determined by the physical capability of sensors. Defining the probability of detecting a target Pd (xkt |xks ; π sj ) with parameters associated sj region of the jth sensor can be expressed with the sensor tπ s, thes observable t capability t s as: j X0 = xk 0 < Pd xk xk ; π j ≤ 1 . Accordingly, the target state observed from the jth sensor is given by: s t t s s t t sj X t jh x , x , jv sj t 0 k k k xk ∈ (3) zk = ∅ else where sj vkt represents the observation noise, and ∅ represents an “empty element”, indicating that the observation contains no information on the target or that the target in unobservable when is not within the observable region. 2.2 Noise Reducing and Navigating Procedures Recursive Bayesian Estimation: RBE estimated belief on a dynamical system by representing the belief in term of a probabilistic density function (PDF) and recursively update in both time and observation, named by prediction and correction. Let a sequence of states of UGV s and a sequence of observations by the j th sensor from time step 1 to k. Prediction: in each time step, the prediction step calculates the belief on the target’s s ), using information from the previous time current state, denoted as p(xkt |s z 1:k−1 , x1:k−1 t−1 s s step’s belief update p(xk | z 1:k−1 , x1:k−1 ). This prediction is performed throughout the application of the Chapman-Kolmogorov equation: ts t t−1 s s s dxkt p xk z1:k−1 , x1:k−1 = p xk z1:k−1 , x1:k−1 p xkt xk−1 (4) xt
t ) is a Markov motion model defined by Eq. (1). where p(xkt |xk−1 t−1 s s = p(x0t ) at k = 1. and p xk | z1:k−1 , x1:k−1 s ) based on the preCorrection: The correction step calculates the belief p(xkt |s z 1:k , x1:k t s t s s dicted belief p(xk | z 1:k−1 , x1:k ) and a new observation zk . This computation is derived by applying formulas for marginal distribution and conditional independence, resulting in the following expression: t s t t , xt s l xkt | s z1:k t s 1:k p(xk | z1:k−1 , x1:k s (5) p xk | z 1:k , x1:k =
t t t−1 s t s X t p xk |xk−1 p xk | z 1:k−1 , x1:k−1 dxk
Bayesian Approach for Static Object Detection
97
t , xt t s t here l xkt |s z1:k 1:k represents the join likelihood of xk given the observation zk and t xk . In the observation process involving multiple sensors, the assumption of conditional independence enables us to fuse the joint likelihood using observation fusion technique: ns s t s t t t t l xkt | s z1:k = (6) , x1:k l j xk | z1:k , x1:k j=1
Observation Likelihood: The formulation of the observation likelihood varies depending on whether the th target ist detected by the j sensor. The detectable region of the sensor, denoted as t , defines the area where the sensor confidently locates the target. The , x1:k l sj xkt |s z1:k observation likelihood is given by two cases as in Eq. 7: t s s ; π sj sj z t ∈ sj X t p xk | z 1:k ,x1:k sj t s t t d k (7) l xk | z1:k , x1:k = 1 − Pd xkt |x1:k ; π sj else If the target is within the detectable region, the likelihood is described by a highpeaked and near-Gaussian density probability function (PDF) around the observed target s ; π sj where represents the parameters specific to state. It is expressed as p xkt |s z 1:k , x1:k the jth sensor. If the observed target is outside the detectable region, the likelihood provides infor- mation that the target is unlikely within that region. It is described as 1−Pd xkt |x1:k ; π sj where Pd represents the probability of detection, and represents the parameters specific to the jth sensor. Figure 1 illustrates the RBE with unified likelihood in a one-dimensional target space, focusing on the non-Gaussian behavior when there is no target detection. The posterior belief become near-Gaussian when the target is detected, as both the observation likelihood and the prior belief are approximately Gaussian. However, in the absence of target detection, the posterior belief can become highly non-Gaussian due to the nonGaussian observation likelihood. Dealing with such highly non-Gaussian belief requires computationally expensive computation. Thus, it is preferable to maintain the target as detected, even if the detection accuracy is low, to mitigate the complexity arising from non-Gaussian belief. 2.3 Sensor’s Fusion and Target Localization Bayesian Estimation: Figure 2 shows the overview of the proposed Bayesian estimation approach. The target is observed by lidar sensor only in Stage 1 as a line segment since current lidar only can detect in 2D space, but it will come with quite accurate distance (range) measurement. The level of certain is still low because of noise from surrounding environment, and without feature (tool) detection from camera, the working position and direction of robot cannot point out.
98
L. C. Doan
After identifying the bearing of the target, the pan-tilt camera will adjust its orientation to ensure that the target is within its field of view. With both the lidar and camera observing the target, the level of certainty is enhanced by fusing the two observation likelihoods. At stage 3, as the state of the UGV is known, the RBE enables localization of the target in a globally defined coordinate frame through coordinate transformation. This process ensures that the target’s position is accurately represented in a consistent global reference frame, facilitating further analysis and decision-making.
Fig. 1. Unified join likelihood
With lower confidence in specific object detection, lidar is still usable in target detecting and to make estimation problem that is solvable in Gaussian technique. Since target is detected, zooming function on pan tilt camera will be utilize while robot approaching the target, though distance will be estimated from lidar range finder, for getting bigger scale of image in order to search for features. As soon as all needed features are detected or met with prior knowledge by camera, location and orientation of robot at working position will be solved. Objects recognition and localization techniques has already mentioned in the paper with employing some machine learning techniques, special computer vision, pattern recognition and classification.
Bayesian Approach for Static Object Detection
a,
b,
99
c,
Fig. 2. Overview of proposed Bayesian estimation approach a- Detection by Lidar; b- Detection by Lidar and Camera; c- Recursive Bayesian Estimation
3 Experiment and Prelim Results To validate the proposed methods, a simulation was conducted involving an unmanned ground vehicle (UGV) searching for a toolbox in a simulated environment. The testing field was flat and wide-open, providing an ideal scenario for evaluating the effectiveness of the approach. The toolbox was placed randomly within the field, with several wrenches hanging on one side of the box. The UGV initiated the search from a predetermined starting position located at one corner of the field. The simulation aimed to assess the capability of our proposal in successfully detecting and localizing the toolbox under these conditions. The UGV was developed based on Husky, a medium sized robotic platform from Clear path Robotics. Husky had very high-resolution encoders that delivered an improved state estimation and offered incredibly smooth motion profiles even at low speed ( 0.5 ⎨0 lef t padding = 1 Mi,j if j < Δw and δ ≤ 0.5 ⎪ ⎩ Mi,j−Δw if j ≥ Δw where M lef t padding is the new matrix after applying the padding-left method, Δw = random(0.25) ∗ w, δ = random(1), 0 ≤ i < h, 0 ≤ j < w + Δw , random(0.25) is a random real number between 0 and 0.25, random(1) is a random real number between 0 and 1.
140
T.-A. Nguyen et al.
Right Padding. Given a matrix M of size h x w (height x width), the paddingright method sets the right padding (0 or 1) of the matrix M, described as follows: ⎧ ⎪ if j ≥ w and δ > 0.5 ⎨0 right padding = 1 Mi,j if j ≥ w and δ ≤ 0.5 ⎪ ⎩ Mi,j if j < w where M right padding is the new matrix after applying the padding-right method, Δw = random(0.25) ∗ w, δ = random(1), 0 ≤ i < h, 0 ≤ j < w + Δw , random(0.25) is a random real number between 0 and 0.25, random(1) is a random real number between 0 and 1. Top Padding. Given a matrix M of size h x w (height x width), the paddingtop method sets the top padding (0 or 1) of the matrix M, described as follows: ⎧ ⎪ if i < Δh and δ > 0.5 ⎨0 top padding = 1 Mi,j if j < Δh and δ ≤ 0.5 ⎪ ⎩ Mi−Δh ,j if i ≥ Δh where M top padding is the new matrix after applying the padding-top method, Δh = random(0.25) ∗ h, δ = random(1), 0 ≤ i < h + Δh , 0 ≤ j < w, random(0.25) is a random real number between 0 and 0.25, random(1) is a random real number between 0 and 1. Bottom Padding. Given a matrix M of size h x w (height x width), the padding-bottom method sets the bottom padding (0 or 1) of the matrix M, described as follows: ⎧ ⎪ if i ≥ h and δ > 0.5 ⎨0 bottom padding = 1 Mi,j if j ≥ h and δ ≤ 0.5 ⎪ ⎩ Mi,j if i < h where M bottom padding is the new matrix after applying the padding-bottom method, Δh = random(0.25) ∗ h, δ = random(1), 0 ≤ i < h + Δh , 0 ≤ j < w, random(0.25) is a random real number between 0 and 0.25, random(1) is a random real number between 0 and 1. Contrast. Contrast defines the variation in brightness or color between different areas of an image, influencing the level of distinction between light and dark regions and enhancing the prominence of features. Brightness. Brightness adjustments can help highlight specific features within an image.
An Effective Deep Learning Model
3.2
141
Datasets
As mentioned in Sect. 2, PlantVillage is a dataset commonly used by researchers to compare the results of previous studies with their research results. However, PlantVillage Dataset contains images taken from laboratory setup conditions, therefore, not suitable for practical application. In order to be able to apply the research results in practice, studies should focus on constructing models on the PlantDoc Dataset, because this is a dataset collected from real data on the internet. Until the present time, based on the most recent surveys related to the PlantDoc dataset, accuracy is an issue that needs improvement with respect to models trained on this dataset. In this paper, we used several data augmentation methods mentioned in Sect. 3.1 to improve the input quality with the aim of increasing the accuracy of training models on the PlantDoc dataset. This increase in accuracy is our main goal, since the models on the PlantDoc dataset of prior studies only achieved around 70% accuracy. In addition, the application of the above data augmentation methods is also known as is geometric techniques, which Taylor and Nitschke propose in [25], aim to generate new data based on the existing dataset to avoid the overfitting problems and create effective models. Furthermore, choosing to build models on the PlantDoc dataset is an open direction for our future work, because this dataset can be expanded and supplemented by users through image data downloaded from the internet. This is not possible for the PlantVillage limited dataset. The PlantDoc dataset contains 2,598 data points, including 13 plant species and 16 classes of diseases. In addition, the time to annotate internet scraped images is approximately 300 working hours. It is our conviction that the utilization of this dataset will significantly enhance the proximity of research outcomes to real-world applications. 3.3
CNN Architecture
Convolutional Neural Network Architectures. Convolutional Neural Networks (CNNs) [8] are a class of supervised classification deep learning models specifically designed for analyzing visual data such as image or video data. A CNN is constructed from the fundamental building blocks to be convolutional layers, which apply a set of learnable filters to the input data. These filters perform convolution on the input and produce a set of features from the prior layer, which leads to an increasingly abstract representation of the input data. Through the sequential stacking of multiple convolutional layers, CNNs possess the capacity to acquire intricate and sophisticated representations of visual features. Incorporated within the CNN architectures are two other types of layers, namely the pooling layers and the fully connected layers. Pooling layers downsample the feature maps, reducing their spatial dimensions while retaining important information and promoting the extraction of higher-level abstract representations. Fully connected layers, also known as dense layers, aggregate and integrate the features extracted from the preceding layers to generate final predictions based on the learned features.
142
T.-A. Nguyen et al.
MobileNetV3 Architecture. MobileNetV3 [12] is a CNN architecture built upon the foundation of MobileNet, specifically tailored for resource-constrained devices to tackle image classification and object detection tasks. MobileNetV3 has superior performance in terms of accuracy, speed and model size compared to the previous version of MobileNetV2 [18], which makes it increasingly suitable for devices with limited computational resources such as mobile devices, embedded systems, and edge devices.
4
The Proposed Model
In this paper, we apply the MobileNetV3 architecture on the PlantDoc dataset, following the application of the data augmentation methods on the same dataset. All images in the PlantDoc dataset, which comprised 2,598 data points categorized into 13 plants and 16 diseases, were resized to 224 × 224 pixels prior to the application of data augmentation techniques. The objective of this experiment is to enhance the accuracy of the recognition problem on the PlantDoc dataset by employing an appropriate model. The following steps are undertaken to process the proposed model (shown in Fig. 1): First, we use the data augmentation methods outlined in Sect. 3.1 on the images in the PlantDoc dataset to increase the number of training images further. Second, we construct three classification models, all of which utilize the identical MobileNetV3 architecture and training dataset. It is evident that the output at the final layer of the models consists of probability vectors with varying sizes. The initial model is employed for classifying 13 plant species, resulting in a probability vector size of 13. The second model is utilized to classify 16 classes of diseases, leading to a probability vector size of 16. Lastly, the third model is used for classifying both plant species and classes of diseases, resulting in the probability vector having a size of 29. To integrate the outcomes of all three models, it is necessary to normalize the probability vectors to a uniform size. Specifically, we will expand the size of the probability vectors in the first model and the second model to 29 and ensure no information loss. To accomplish this, we carry out matrix multiplication between the probability vector in the first model and a binary matrix of size 13 × 29. Similarly, we perform matrix multiplication between the probability vector in the second model and a binary matrix of size 16 × 29. Following the normalization of the vectors, we proceed with integrating the outcomes from all three models through the following steps:
An Effective Deep Learning Model
143
– Step 1: Combine the normalized vectors of model 1 and model 2 by adding them together. – Step 2: Apply the softmax activation function to the sum vector obtained in step 1. – Step 3: Combine the vector in step 2 with the probability vector of model 3 by performing vector addition. Ultimately, we employ the softmax activation function to convert the sum vector in step 3 into a probability vector containing values for plant disease image classification. The experimental results of the proposed model are illustrated through the presentation of a confusion matrix and a comparison table, as depicted below: – The confusion matrix of 13 plants and 16 diseases shown in Fig. 2 depicts the predicted and actual values of the data points. Based on this confusion matrix, we observe that the model we proposed exhibited good performance. – In order to provide a clearer understanding of the performance of the proposed model, we utilize Table 1 to compare its results with those of state-of-theart models. By observing the comparison table, we can conclude that our proposed model performs admirably on the PlantDoc dataset. It not only demonstrates superior efficiency compared to other models but also requires less memory space when deployed for data processing.
Table 1. Compare the performance of models on the PlantDoc dataset Model
Training Set Test Set Accuracy F1-Score
VGG16 [21]
80%
20%
60.41
0.6
InceptionV3 [21]
80%
20%
62.06
0.61
InceptionResNetV2 [21]
80%
20%
70.53
0.70
MobileNetV2 [10]
80%
20%
74.96
NASNetMobile [10]
80%
20%
76.48
PI-CNN [1]
80%
20%
80.36
0.67
Our proposal model (MobileNetV3) 80%
20%
83.00
0.83
144
T.-A. Nguyen et al.
Fig. 1. The proposed model on the PlantDoc dataset.
An Effective Deep Learning Model
145
Fig. 2. Confusion matrix of the proposed model on the PlantDoc dataset.
5
Conclusions
A model utilizing the MobileNetV3 architecture along with data augmentation techniques has been introduced in this research. The objective is to develop a compact and efficient model specifically designed for edge devices, embedded systems, and mobile phones. Additionally, the model can be employed in plant disease detection systems. In the experimental phase, we implemented the proposed model on the PlantDoc dataset and achieved highly favorable outcomes in terms of accuracy in the recognition problem. This achievement is made possible due to the application of suitable data augmentation methods and suitable integration of MobileNetV3 within the framework of the proposed model. In the future, we continue to focus on researching other methods to improve the accuracy of plant leaf disease recognition. Specifically, our future studies will concentrate on refining the model and incorporating additional data augmenta-
146
T.-A. Nguyen et al.
tion techniques to highlight the features of the disease prior to their inclusion in the training model. We believe that future pest and disease diagnostic solutions will contribute to promoting the automation aspect in the field of agricultural IoT.
References 1. Batchuluun, G., Nam, S.H., Park, K.R.: Deep learning-based plant-image classification using a small training dataset. Mathematics 10(17) (2022). https://doi. org/10.3390/math10173091. https://www.mdpi.com/2227-7390/10/17/3091 2. Bi, C., Wang, J., Duan, Y., Fu, B., Kang, J.R.: Mobilenet based apple leaf diseases identification. Mob. Netw. Appl. 27 (2022). https://doi.org/10.1007/s11036-02001640-1 3. Cassava Dataset. https://www.kaggle.com/datasets/srg9000/cassava-plantdisease-merged-20192020. Accessed 15 Dec 2022 4. Hops Dataset. https://www.kaggle.com/datasets/scruggzilla/hops-classification. Accessed 15 Dec 2022 5. New Plant Diseases Dataset. https://www.kaggle.com/vipoooool/new-plantdiseases-dataset. Accessed 15 Dec 2022 6. PlantifyDr Dataset. https://www.kaggle.com/datasets/lavaman151/plantifydrdataset. Accessed 15 Dec 2022 7. Dataset. https://data.mendeley.com/datasets/fwcj7stb8r/1. Accessed 15 Dec 2022 8. Dhillon, A., Verma, G.K.: Convolutional neural network: a review of models, methodologies and applications to object detection. Prog. Artif. Intell. 9(2), 85-112 (2020). https://doi.org/10.1007/s13748-019-00203-0 9. Enkvetchakul, P., Surinta, O.: Effective data augmentation and training techniques for improving deep learning in plant leaf disease recognition (2021) 10. Enkvetchakul, P., Surinta, O.: Stacking ensemble of lightweight convolutional neural networks for plant leaf disease recognition. ICIC Exp. Lett. 16, 521–528 (2022). https://doi.org/10.24507/icicel.16.05.521 11. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition (2015). https://doi.org/10.48550/ARXIV.1512.03385. https://arxiv.org/abs/1512. 03385 12. Howard, A., et al.: Searching for mobilenetv3 (2019) 13. Howard, A.G., et al.: Mobilenets: efficient convolutional neural networks for mobile vision applications. CoRR abs/1704.04861 (2017). http://arxiv.org/abs/ 1704.04861 14. Hughes, D.P., Salath´e, M.: An open access repository of images on plant health to enable the development of mobile disease diagnostics through machine learning and crowdsourcing. CoRR abs/1511.08060 (2015). http://arxiv.org/abs/1511.08060 15. KC, K., Yin, Z., Wu, M., Wu, Z.: Depthwise separable convolution architectures for plant disease classification. Comput. Electron. Agric. 165, 104948 (2019). https://doi.org/10.1016/j.compag.2019.104948. https://www. sciencedirect.com/science/article/pii/S0168169918318696 16. Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep convolutional neural networks. Commun. ACM 60(6), 84–90 (2017). https://doi.org/ 10.1145/3065386 17. Meshram, V., Patil, K., Meshram, V., Hanchate, D., Ramkteke, S.: Machine learning in agriculture domain: a state-of-art survey. Artif. Intell. Life Sci. 1, 100010 (2021)
An Effective Deep Learning Model
147
18. Sandler, M., Howard, A., Zhu, M., Zhmoginov, A., Chen, L.C.: Mobilenetv2: inverted residuals and linear bottlenecks (2019) 19. Shaji, A.P., Hemalatha, S.: Data augmentation for improving rice leaf disease classification on residual network architecture. In: 2022 International Conference on Advances in Computing, Communication and Applied Informatics (ACCAI), pp. 1–7 (2022) 20. Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition (2014). https://doi.org/10.48550/ARXIV.1409.1556. https:// arxiv.org/abs/1409.1556 21. Singh, D., Jain, N., Jain, P., Kayal, P., Kumawat, S., Batra, N.: PlantDoc. In: Proceedings of the 7th ACM IKDD CoDS and 25th COMAD. ACM (2020). https:// doi.org/10.1145/3371158.3371196 22. Sladojevic, S., Arsenovic, M., Anderla, A., Culibrk, D., Stefanovi´c, D.: Deep neural networks based recognition of plant diseases by leaf image classification. Comput. Intell. Neurosci. 2016 (2016) 23. Szegedy, C., et al.: Going deeper with convolutions. In: 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1–9 (2015). https://doi. org/10.1109/CVPR.2015.7298594 24. Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J., Wojna, Z.: Rethinking the inception architecture for computer vision. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2818–2826 (2016). https://doi.org/ 10.1109/CVPR.2016.308 25. Taylor, L., Nitschke, G.: Improving deep learning using generic data augmentation (2017) 26. Wenchao, X., Zhi, Y.: Research on strawberry disease diagnosis based on improved residual network recognition model. Math. Probl. Eng. 2022, 1–13 (2022)
Real-Time Air Quality Monitoring System Using Fog Computing Technology Tan Duy Le1,3 , Nguyen Binh Nguyen Le1,3(B) , Nhat Minh Quang Truong1,3 , Huynh Phuong Thanh Nguyen2 , and Kha-Tu Huynh1,3 1
School of Computer Science and Engineering, International University, Ho Chi Minh City, Vietnam 2 Japan Advanced Institute of Science and Technology, Ishikawa, Japan 3 Vietnam National University, Ho Chi Minh City, Vietnam [email protected]
Abstract. Air pollution has become a significant concern in the twentyfirst century, posing environmental and public health threats. To address this problem, it is crucial to develop an Internet of Things (IoT) system for real-time air quality data visualization and monitoring. However, despite extensive studies on air pollution and monitoring techniques, there are still unresolved challenges in this field. To bridge this gap, our study proposes the use of fog computing technology for real-time air quality monitoring, employing multiple pollution parameters. Our proposed system uses the STM32F429ZIT6 microcontroller, WiFi module ESP8266 NodeMCU, and various low-cost sensors to collect and monitor pollutants such as PM2.5, CO2, CO, UV index, temperature, and humidity. The collected data is transmitted to ThingsBoard, our fog computing platform, which utilizes the processing capabilities of a Raspberry Pi 4. This enables real-time data analysis and visualization on both web and mobile applications. By adopting fog computing technology, our system offers a low delay, highly accurate, and precise approach to addressing air quality concerns actively. The accurate measurement and visualization of air pollutants provide valuable insights for policymakers, researchers, and communities, enabling informed decision-making and targeted interventions. Keywords: Air Quality Monitoring · Fog Computing · STM32F429ZIT6 · ESP8266 NodeMCU · Raspberry Pi 4 · ThingsBoard platform · Web/Mobile Application · Internet of Things (IoT)
1
Introduction
With the continuous development of IoT technology [1], sensor technology, and information communication technology (ICT), it is now easier than ever to monitor and interact with the environment in which people work and live at any time. c The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 N.-N. Dao et al. (Eds.): ICIT 2023, LNDECT 188, pp. 148–157, 2023. https://doi.org/10.1007/978-3-031-46749-3_15
Real-Time Air Quality Monitoring System
149
One area that has benefited from IoT is air quality control and monitoring. However, there are practical difficulties in IoT systems. Measuring and evaluating air quality is vital for human health, especially in Ho Chi Minh City, which is experiencing rapid urbanization and serious air pollution. Pollution density is increasing, leading to health problems such as lung cancer, heart disease, and respiratory infections. Vietnam ranked fourth in the Western Pacific region for pollution-related mortality in 2017 [2]. Air pollution accounted for nearly 70% of the 71,000 pollution-related deaths in Vietnam. Older people, those with respiratory illnesses, and people from less wealthy backgrounds are most vulnerable to air pollution. Recent research on air quality systems has utilized raw data from sensors without further processing, leading to a lack of consensus on data processing strategies when using low-cost air pollution sensors (with a low Signal-to-Noise Ratio). Alternative air quality monitoring systems are expensive as they rely on costly sensors (with a high Signal-to-Noise Ratio) to collect information on air pollution and require manual input. Moreover, these studies have predominantly employed Arduino microcontrollers for data collection, making it challenging to operate persistently for extended periods. To overcome these challenges, we have developed a real-time air quality monitoring and forecasting system that uses a semiconductor sensor connected to a user-friendly edge device called the STM32F429ZIT6 microcontroller. The microcontroller uses five pieces of equipment with sensor modules to gather data on the outdoor environment for carbon monoxide (CO), carbon dioxide (CO2 ), delicate particulate matter (P M ), U V index, humidity, and temperature. Additionally, the system utilizes fog computing technology [3] and the WiFi module ESP8266 NodeMCU to transport data from the sensors to the ThingsBoard server [4]. This enables better computing, massive storage, and real-time processing compared to cloud computing. The Raspberry Pi 4 hosts the ThingsBoard, allowing us to collect, view, and analyze data and perform other advanced computations. This system is less expensive, easier to maintain, and more versatile than competing enterprise IoT platforms, making it more competitive in comparable scenarios with its low-cost sensor, high precision, and ability to handle complex tasks. In general, a real-time air quality monitoring system employs five air quality sensors: DHT11 for detecting temperature and humidity, MQ135 and MQ7 for collecting CO2 and CO gas, respectively, and GP2Y1010AU0F and GUVAS12SD for computing P M 2.5 and UV index levels. The system sends data to our ThingsBoard server, which is hosted on a Raspberry Pi 4 to reduce costs, boost data transmission, handle more challenging tasks, and perform real-time computations thanks to fog computing. This system has web and mobile apps built with React and Flutter, and a Flask server for simplicity in setup, managing requests, and real-time data visualization. The main contribution of this paper is as follows: – Present the design and implementation of a real-time air quality monitoring system, with a particular focus on fog computing implementation.
150
T. D. Le et al.
– Incorporate multiple pollution parameters such as P M 2.5, CO2 , CO, U V index, temperature, and humidity. – Low-Cost Sensor Implementation.
2
Background and Related Works
Numerous studies have used low-performance microcontrollers to monitor air quality, utilizing cloud computing for data transfer. Firdhous et al. [5] proposed an IoT-based indoor air monitoring system. It monitors only O3 and communicates with a processing node via a Bluetooth-enabled gateway node and WiFi. The device they designed communicates with a gateway node over Bluetooth, which in turn communicates with the processing node via WiFi. Quynh Anh et al.’s work [6] employs IoT and machine learning techniques to design a distributed air quality monitoring and forecasting system with a web interface that predicts pollution levels every five minutes. The system uses sensors on the Arduino UNO R3 to collect essential air pollution elements, such as CO2 , CO, dust, U V index, temperature, and humidity. Data is transmitted via Wi-Fi to the IoT cloud platform ThingSpeak for analysis and visualization. However, the study has two significant drawbacks: the use of 8-bit microcontrollers with limited memory and slow clock speeds, and difficulty implementing ThingSpeak because of data transfer delays and limited customization options for data visualization tools. Similarly, the “Arduair” air quality monitoring station, which also uses an Arduino board, is owned by V. Chaudhry et al. [7], and focuses on some air pollutants such as CO2 , CO, SO2 , etc. With the criterion of reducing system cost as well as being compact, convenient, and accessible to many users, the author created a device to monitor air quality. Despite the way to validate the sensor data, the system is still too complex to read out and control the air quality as it lacks a web and mobile application for conveniently tracking from anywhere. On the other hand, Z. He, X. Ruan, and other colleagues et al. [8] propose a strategy to construct an Internet of Things-based system for monitoring air quality. This system uses the STM32F103RCT6 combined with DS18B20, DHT11, and MQ2 to record some essential air elements like temperature, humidity, and smoke corresponding. Although the hardware information is provided, this system needs to demonstrate data collecting or analysis and the absence of sensors to gather factors in air pollution. In 2020, CeADAR and Vietnam National University - Ho Chi Minh City released the HealthyAir app [9] for only iOS platforms, which is used to track air quality. Additionally, this system produced a web application with an air quality map feature so that users could access and track the air quality index; however, it has since been permanently shut down due to maintenance costs. Nevertheless, Ho Chi Minh City needs a CO concentration meter and air quality estimates because there are only six monitoring sites. Current research has established the foundation for an air quality monitoring system. However, the equipment used is often expensive, has slow transmission
Real-Time Air Quality Monitoring System
151
speeds, limited accessibility, or is located too close to the source. Our approach replaces expensive sensors with inexpensive ones and utilizes fog computing technology to enhance data transfer and computing efficiency. This approach provides a simple tool for air quality monitoring and tracking, such as a web or mobile application.
3 3.1
Methodology System Design
Fig. 1. Real-time Air Quality Monitoring System Overview.
The proposed system is shown in Fig. 1. It is an IoT node that consists of a sensor component, a microcontroller component for data collection, a transfer component for data delivery, and a fog computing component for processing and managing the content of websites and mobile applications. The software uses the following stages to carry out its functions: WiFi connection, initialization of the ThingsBoard server, reading, converting, and publishing data to ThingsBoard every five minutes. The STM32F429ZIT6 microcontroller collects all sensor data using analog and digital inputs. The data is
152
T. D. Le et al.
then combined into specified data structures and sent to the WiFi module using the UART protocol. The obtained data is uploaded using the HTTP protocol to ThingsBoard, which is our server running on the Raspberry Pi 4. This allows for evaluation and monitoring to decrease costs and enhance performance when using another enterprise platform. Finally, the APIs provided by ThingsBoard are used to access and visualize the data on the web and mobile applications. 3.2
Pollution Parameters Selection
The proposed system is designed to track air pollutants. Moreover, the selected air pollutants are more suitable for studying air pollution than other pollutants, such as VOCs and formaldehyde. The following air pollutants have been identified for the proposed air quality monitoring system that uses fog computing technology: – Carbon Dioxide (CO2 ): Considered air pollution due to increased traffic. It is a greenhouse gas that traps heat in the atmosphere, raising temperatures in urban areas. – Carbon Monoxide (CO): Poisonous gas created during combustion with insufficient oxygen. Real-time monitoring can prevent dangerous situations. – Particulate Matter (P M ) 2.5: Any particle less than 2.5 microns in width that can penetrate deep into the lungs. Short-term exposure can cause irritation while long-term exposure can cause respiratory issues. – Humidity: Amount of water vapor in the atmosphere that affects air quality. High humidity encourages the growth of microorganisms such as mold and other germs, which can thrive in a humid environment. – Temperature: Temperature can affect air quality in many ways, depending on location and time of day. Extreme heat can harm air quality and catalyze reactions that worsen already-polluted air. – UV Index: International assessment of the intensity of sunburn-causing ultraviolet (UV) radiation at a certain location and time. 3.3
Data Acquisition Module
Our system uses five air quality sensors to determine the primary causes of air pollution. Each of these sensors has a unique configuration since they are made by several companies. For the DHT11 sensor, which measures temperature and humidity, the outcome data is exported as a readable structure that is the digital signal. The other sensors, their output is an analog signal. Therefore, it is required to transform it by implementing the converting function. Take the MQ135 sensor as a prime instance. According to the datasheet, the analog output (AO) pin’s value ranges between 0 and 1023. As a result, we must transform the sensor’s output from an analog signal to a particular type of digital signal, namely the ppm values (parts per million or the concentration of the substance the sensor detects). Different places will have varying ppm levels depending on the degree of pollution in each environment. The average outdoor
Real-Time Air Quality Monitoring System
153
CO2 concentrations are generally between 300 and 500 parts per million (ppm). Consequently, there is an approach to computing the data from the analog signal. Our experiments show that the gas sensor can detect CO2 gas at voltage levels between 0 and 3 V in the range of 10 to 1000 ppm. Therefore, the ppm value changes linearly with voltages. It leads us to perform the linear equation shown in Fig. 2.
Fig. 2. The linear equation between voltage and PPM is related to the MQ135 sensor.
This equation is applicable to all sensors, including those whose output is an analog signal that must be converted to a digital format for reading purposes, not just the MQ135 sensor. Afterward, we can establish a system of equations and conclude the ultimate formula for transforming the analog signal of the MQ135 sensor to digital, as shown below: 5=a×0+b 1000 = a × 3 + b The equation used to obtain the findings may have some allowable inaccuracies since it is a relative method. Although a more precise method for calculating the outcome exists, it is exceedingly complicated and can interfere with the sensors’ components. In summary, our suggested system transforms an analog signal into a digital signal to verify the information gathered by unstable and low-cost sensors. ppm = 331.667 × voltages + 5 3.4
Communication Module
After structuring the data into a readable format, it will be transmitted to the ESP8266 NodeMCU module via UART protocol. The NodeMCU utilizes the
154
T. D. Le et al.
existing WiFi infrastructure to connect to the internet and is better equipped to manage temperature and operate stably over long-term periods. Additionally, the NodeMCU can connect to a computer through its built-in dongle and can be easily configured using the Arduino IDE. The data is then delivered to the ThingsBoard server through the connection between the ESP8266 WiFi module and the STM32F429ZIT6 board. With the available connection details, the WiFi module is developed to automatically access WiFi. 3.5
Fog Computing Module
This study presents our air quality monitoring system, which utilizes the ThingsBoard platform to reduce costs and maximize the benefits of fog computing. Users can collect, visualize, and manage devices using the ThingsBoard platform. Moreover, ThingsBoard permits the integration of gadgets connected to external and legacy systems via standard protocols such as MQTT, HTTP, and others. Additionally, ThingsBoard provides various tools for data description and adaptation, depending on the user’s settings. In this case, ThingsBoard customizes the real-time data functionality based on the data time-series chart in Fig. 3. This enables users to monitor the collected data from the air quality monitoring system, which must be frequently updated and presented in a logical and understandable manner.
Fig. 3. The visualization tool provided by ThingsBoard.
As previously mentioned, cloud computing can be a crucial component in many IoT systems. However, due to its poor reaction time and other characteristics, it is not suitable for studies on time-sensitive data that rely heavily on both accuracy and speed. To address these limitations, this research proposes the use of fog computing technology, which involves the additional usage of a Raspberry Pi 4 to locally host the ThingsBoard server. This will significantly improve data visualization tasks, real-time tracking, and the assessment of collected data.
Real-Time Air Quality Monitoring System
155
By hosting the server locally, we can respond effectively to any issues that arise and maintain control over its health. The ThingsBoard server is highly reliable for remote data collection and storage. Our recommended approach leverages a server-side API with an optimized web page and mobile app to enable easy access to air quality conditions from any location. The website’s front end, depicted in Fig. 4, is structured using the React framework and organizes the user interface into areas including data tracking. If air quality exceeds safe limits, the system issues a warning and displays trends. The mobile app’s core is built on the Flutter framework, as shown in Fig. 5, and uses the open-source map OpenStreetMap due to API restrictions with Google Maps.
Fig. 4. A web application visualizing sensors data
4 4.1
Fig. 5. A mobile application with air quality map
Implementation and Evaluation Implementation
Our air quality monitoring system with fog computing technology in Ho Chi Minh City recorded temperature, humidity, gases, UV index and PM 2.5 concentrations with local climate data. The sensors have accurately measured airborne pollutants dangerous to people, thanks to the research of sensor parameters. As a result, the project is progressing as expected. To lower infrastructure costs and improve system scalability, our air quality monitoring system can self-deploy using open-source IoT platforms like ThingsBoard and can be hosted on a small computer, such as the Raspberry Pi 4 Model B. Using appropriate libraries and frameworks is crucial to providing consumers with an effortless and considerate experience. Since the system aims to serve the community and contribute to a shared objective of protecting the environment and everyone’s health, the entire connecting procedure and tools are accessible to all users.
156
T. D. Le et al.
Table 1. Comparison between Proposal Air Quality System and Related Systems Field
Detail
Sensors
DHT11 MQ135 MQ7 MQ2/MQ5 PM2.5 UV Index Microcontroller STM32 Arduino Module WiFi ESP8266 NodeMCU ESP8266 ESP-01 Publication Open-source IoT Platform ThingsBoard ThingSpeak Result Steps showing Correctness Completeness
4.2
Our proposal system Air Quality Monitoring and Forecasting [4] Arduair [5] STM32 Indoor Air [6] x x x x x x x
x x x
x
x
x
x x x x
x
x x x x x x
x x x
x x
x x
Evaluation
Overall, our research aims to improve the pre-existing architecture of the realtime air quality monitoring system by introducing fog computing as a means of filtering and relaying data to the database. Our system has reached a critical stage and has gained a significant advantage over earlier research in the field, as shown in Table 1. The following benefits are provided by this system during the monitoring of the air quality of the current atmosphere: – Verify the collected data from these lost-cost and unstable sensors. – Self-deploy local successfully server with open-source IoT platform and utilize the advantage of the computational resources supplied by fog computing technology. – Implement a cross-platform dashboard system for simpler monitoring. Design a user-friendly, cost-effective, and continually data-capable technology that can be used for months at a time.
5
Discussion, Conclusion, and Future Works
This study developed a real-time air quality monitoring system using fog computing technology. Inexpensive equipment was utilized and the results were uploaded to ThingsBoard through its API. The formula for converting analog signals was derived using the characteristics of the sensors. The data obtained were validated, and a user-friendly interface with various functionalities was added to the website and mobile application. These include the ability to display the latest updated data, historical data, and interactive charts. The system can be used as a tool to identify air pollution levels and prompt the government to develop responses. It is an accurate and reasonably priced air quality monitoring system. However, there are still opportunities for further development. Deep Learning techniques and models can be used to estimate the level of air quality and the weather. Additionally, the air quality monitoring network system can be expanded to obtain more precise and trustworthy data.
Real-Time Air Quality Monitoring System
157
Although our system operates persistently and with great performance, there is still potential for development. To estimate the level of air quality and weather, we can use some deep-learning techniques and models. Lastly, we intend to expand the air quality monitoring network system to keep track of more precise and trustworthy data. With these goals in mind, this research is a promising technology that can help both individuals and communities greatly. Acknowledgment. This research is funded by Vietnam National University HoChiMinh City (VNU-HCM) under grant number C2023-28-03.
References 1. Zanella, A., Bui, N., Castellani, A., Vangelista, L., Zorzi, M.: Internet of things for smart cities. IEEE Internet Things J. 1(1), 22–32 (2014) 2. Bui, K., No, H.R., Whitehead, N.A.: Analyzing air quality of urban cities in Korea and Vietnam. In: Proceedings of the 2019 International Conference on Big Data and Education, pp. 19–25 (2019) 3. Ashi, Z., Al-Fawa’reh, M., Al-Fayoumi, M.: Fog computing: security challenges and countermeasures. Int. J. Comput. Appl. 175(15), 30–36 (2020) 4. Ismail, A.A., Hamza, H.S., Kotb, A.M.: Performance evaluation of open source iot platforms. In: 2018 IEEE Global Conference on Internet of Things (GCIoT), pp. 1–5. IEEE (2018) 5. Firdhous, M., Sudantha, B., Karunaratne, P.: Iot enabled proactive indoor air quality monitoring system for sustainable health management. In: 2017 2nd International Conference on Computing and Communications Technologies (ICCCT), pp. 216–221. IEEE (2017) 6. Tran, Q., Dang, Q., Le, T., Nguyen, H.-T., Tan, L.: Air quality monitoring and forecasting system using iot and machine learning techniques (July 2022) 7. Chaudhry, V.: Arduair: air quality monitoring. Inter. J. Environ. Eng. Manag. 4(6), 639–646 (2013) 8. He, Z., Ruan, X.: Research on indoor air monitoring system based on stm32. Acad. J. Eng. Technol. Sci. 5(7), 46–52 (2022) 9. Phuong Anh, T.T.: Launching ai-integrated air quality warning app (2022). https:// vnuhcm.edu.vn/su-kien 33356864/ra-mat-app-canh-bao-chat-luong-khong-khitich-hop-ai/343232326864.html
An Intelligent Computing Method for Scheduling Projects with Normally Distributed Activity Times Nguyen Hai Thanh(B) Faculty of Applied Sciences, International School, Vietnam National University, Hanoi, Vietnam [email protected]
Abstract. An effective computing method for scheduling and crashing projects, wherein the project activity duration times are crisp values, is Project evaluation and review technique. In some real-life situations, however, activity duration times are not crisp values since they possess certain kinds of fuzziness and randomness. Recently, there has been a sufficiently large amount of research work on project scheduling, where activity times are modeled as random variables. Still, to deal with such types of projects, it is necessary to enhance Program evaluation and review technique on its modeling and computing aspects. This paper proposes an innovative intelligent computing method, which is based on the fuzzy linear programming approach and a fuzzified chance constraint treatment, for project scheduling, wherein project activity times are normally distributed random variables. As a result, a new algorithm frame has been built up to help in analyzing and managing projects more efficiently. Keywords: Project scheduling · Program evaluation and review technique · Fuzzy linear programming · Fuzzified chance constraint · Membership function
1 Introduction An effective computing method for scheduling and crashing projects, wherein the project activity duration times are crisp values, is Project evaluation and review technique (PERT). In the era of business intelligence, to support big data management and datadriven decision making, PERT is continuously being enhanced regarding it’s modeling and computing aspects with the aim to deal with projects wherein activity duration times assume crisp values and/or certain kinds of fuzziness and randomness [4, 8, 11–13, 18, 22]. In Vietnam higher education as well as in universities of developed countries, for students of many bachelor and master programs, PERT is an important topic of several courses [3, 17, 19] to study to get knowledge and skills in project analysis and management [7, 9, 10, 16]. However, most of these courses supply only the PERT network diagram computing, and do not provide more advanced methods based on linear programming procedures for scheduling and crashing projects. A large amount of literature has recently been © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 N.-N. Dao et al. (Eds.): ICIT 2023, LNDECT 188, pp. 158–168, 2023. https://doi.org/10.1007/978-3-031-46749-3_16
An Intelligent Computing Method for Scheduling Projects
159
published on PERT wherein the project activity durations times (also referred to as project activity times) assume some kind of randomness [1, 2, 5, 6, 11, 12, 15]. Several recent papers, which focus on project scheduling, can only help in computing the expected value and the variance for the project completion time but cannot find the critical path of the project. The computational procedures, presented in recent literature on PERT, are complicated and involve difficulties because they are often described as arithmetical computing rules but not as detailed algorithms. Moreover, there is no research paper on applying fuzzy linear programming for project scheduling wherein the project activity duration times are random variables following normal distributions. This paper proposes an innovative intelligent computing method based on the fuzzy linear programming approach and a fuzzified chance constraint treatment to schedule projects, where project activity duration times are random variables following normal distributions. As a result, a new algorithm frame based on linear programming has been built up to help in analyzing and managing projects more efficiently. The other parts of this paper are presented as follows: In Sect. 2, an approach based on linear programming for project scheduling with crisp activity times will be reviewed [20, 21]. In Sect. 3, an approach based fuzzy linear programming on a fuzzified chance constraint treatment will be built up for project scheduling with normally distributed activity duration times. Preliminary conclusions will be made in Sect. 4.
2 Project Scheduling with Crisp Activity Duration Times Using Approach Based on Linear Programming 2.1 Program Evaluation and Review Technique: Some Fundamental Definitions and Concepts Some fundamental definitions and concepts of PERT with crisp activity times (also called classical PERT) can be recollected from references or textbooks used in higher education in Vietnam and developed countries [3, 17, 19]. In this paper, the following notations shall be of use following exactly symbols and terminologies in [21]: i) A = {1, 2, …, n} and A+ = {0, 1, 2, …, n, n + 1} denote, respectively, the set of project activities and the extended set of project activities, wherein notations 0 and n + 1 are used for the start activity and the finish activity, respectively; ii) R and R+ denote, respectively, the relations of precedence in A and in A+ : R = {(i, j)|i, j ∈ A, i is a predecessor immediate to j} and R+ = R ∪ {(0, j) | j ∈ A and j has no immediate predecessor}∪{(i, n + 1)|i ∈ A and i has no immediate successor}; iii) For an activity j m A, the following notations are used: tj for the project activity duration time, ESj for the earliest time to start, EFj for the earliest time to finish, LSj for the latest time to start, LFj for the latest time to finish, slack time sj = LSj – ESj = LFj – EFj ; iv) Also denote T, the minimum project completion time. For activity j to be called critical activity it is necessary that LSj = ESj and LFj = EFj that is when sj = 0. For a project, a critical path is built up from the start activity, critical activities, and the finish activity, which are linked by arrows according to the relation of precedence R+ , i.e., if (i, j) ∈ R+ then there is an outgoing arrow from activity i and to activity j.
160
N. H. Thanh
2.2 Scheduling Projects Using Linear Programming Consider activity j, j = 1, n, use xj for denoting the finish time then EFj ≤ xj ≤ LFj . Also assume x0 = 0 and xn+1 = T, xj ≥ 0 for j = 1, n. The integer linear programming problems (ILPPs) Problem 1, Problem 2, and Problem 3, as stated below, can be solved [20, 21] to compute activity finish times xi and the project completion time T, earliest times to finish EFj , and latest times to finish LFj , j = 1, n, respectively: Problem 1: Min z = xn+1 , subject to constraints: xj − xi ≥ tj for (i, j) m R+ ; where tj is the project activity duration time for activity j, j = 1, n, tn+1 = 0. Problem 2: Min z = nj=1 xj , subject to the constraints as described in Problem 1 and one more condition: xn+1 = T; where T assumes its value found in Problem 1. Problem 3: Max z = nj=1 xj , subject to the constraints as described in Problem 2. Table 1. Project’s activity start, finish, and slack times Activity Immediate predecessor Activity time (weeks) ES EF LS LF Slack Note activities (1)
6
0
6
0
6
0
(2)
1
3
6
9
13
16
7
(3)
1
4
6
10
6
10
0
(4)
2
3
9
12
16
19
7
(5)
3
9
10
19
10
19
0
(6)
2, 3
3
10
13
22
25
12
(7)
4, 5
6
19
25
19
25
0
(8)
3
3
10
13
22
25
12
critical critical critical critical
Example 1: Consider a project with the input data given in Table 1 (activities from 1 to 8 are given in 1st column, immediate predecessors for activities are given in 2nd column, the duration times or activity times of activities are given in 3rd column). By applying Problem 1, the optimal solution can be found: (x0 , x1 , x2 , x3 , x4 , x5 , x6 , x7 , x8 , x9 ) = (0, 6, 16, 10, 19, 19, 19, 25, 13, 25). The results of solving Problem 2 and Problem 3 are reported in Table 1 at EF and LF columns. Earliest and latest times for activities to start can then be calculated using formulas: ESj = EFj − tj , LSj = LFj − tj , for j = 1, 8, and are reported in Table 1 at ES and LS columns. The project completion time T is 25 and (0) → (1) → (3) → (5) → (7) → (9) is the critical path. Note 1: The computing results reported in Table 1 and the results produced by the classical PERT network computing procedure are the same. One could find the mathematical proof of this fact in general in [20].
An Intelligent Computing Method for Scheduling Projects
161
3 Schedule Project with Normally Distributed Activity Times Using an Approach Based on Fuzzy Linear Programming 3.1 Solving a Class of Stochastic Programming Problems Using Fuzzy Linear Programming The following special type classical stochastic linear programming problem (LPP), where the right hand side parameters are random variables, is most important for consideration: Problem 4: Minz = nj=1 cj xj , subject to constraints : nj=1 aij xj ≥ bi i = 1, m ; xj ≥ 0 j = 1, n . bj = N(mj , σj ) is a random variable which follows normal distribution with expected value mi and standard deviation σj , for j = 1, n.
Treatment of Constraints: On the basis of the fuzzified chance constraint approach [14], ith stochastic constraint: nj=1 aij xj ≥ bi can be solved as a fuzzified chance constraint: ∼ n P ˜ i , where p˜ i is a fuzzy threshold of type (pi , si )LL , pi is the j=1 aij xj ≥ bi ≥ p reference level and si is the left spread of p˜ i . n n Note that j=1 aij xj − bi is a normal random variable N j=1 aij xj − mi , σi with expected value nj=1 aij xj − mi and standard deviation σi . Therefore, if denote n n P j=1 aij xj − bi ≥ 0 by Probi then Probi = P Z ≥ − j=1 aij xj − mi /σi = n Φ a x − m i /σi , where notation Z stands for the standard normal random j=1 ij j variable and notation Φ(.) stands for the cumulative distribution ∼ function of Z. n Consider ith fuzzy flexible constraint: P a x ≥ b ˜ i , then the following i ≥ p j=1 ij j membership function can be built up w.r.t. decision vector x = (x 1 , x 2 , …, x n ) ∈ Rn . Accordingly, the level of belonging of vector x to Ci , the fuzzy set containing vectors, that satisfy the fuzzified chance constraint [24], can be measured by computing the membership function value: ⎧ ⎪ if Probi ≥ pi ⎨1 i −si ) μCi (x) = Probi −(p if pi > Probi ≥ pi − si si ⎪ ⎩0 if pi − si > Probi .
Note that Probi ≥ pi ⇔ Φ
n j=1 aij xj −mi
σi
≥ pi ⇔
n
j=1 aij xj
≥ mi + σi Φ −1 (pi );
and, similarly reasoning, the above expression of μCi (x) can be rewritten as follows: ⎧ n ⎪ ⎪ 1 if aij xj ≥ mi + σi −1 (pi ) ⎪ ⎪ ⎪ j=1 ⎨ n μCi (x) = Probi −(pi −si ) if m + σ −1 (p ) > aij xj ≥ mi + σi −1 (pi − si ) i i i ⎪ si ⎪ ⎪ j=1 ⎪ ⎪ ⎩0 if mi + σi −1 (pi − si ) > nj=1 aij xj .
162
N. H. Thanh
Now, for the sake of computational convenience, in the above expression, we replace the increasing function (Probi − (pi − si ))/si regarding variable y n n −1 a x by a linear increasing function ( = ij j j=1 ij xj − (mi + σi a(pi − j=1 si )))/ mi + σi Φ −1 (pi ) − mi + σi Φ −1 (pi − si ) . Then, the membership function assumes the form: μCi (x) = ⎧ n ⎪ ⎪ 1 if aij xj ≥ mi + σi −1 (pi ) ⎪ ⎪ ⎪ j=1 ⎨ n n −1 j=1 aij xj − mi +σi (pi −si ) −1 (p ) > if m + σ aij xj ≥ mi + σi −1 (pi − si ) i i i ⎪ −1 −1 ⎪ (mi +σi (pi ))−(mi +σi (pi −si )) ⎪ j=1 ⎪ ⎪ ⎩0 if m + σ −1 (p − s ) > n a x . i
i
i
i
j=1 ij j
Note 2: For i = 1, m, denote bi = mi + σi Φ −1 (pi ) and qi = bi − (mi + σiΦ −1 (pi − si )) = σi Φ −1 (pi ) − σi Φ −1 (pi − si ), then the fuzzy flexible constraint: ∼ n P ˜ i can be equivalently treated as the fuzzy flexible constraint j=1 aij xj ≥ bi ≥ p ∼ n ˜ ˜ j=1 aij xj ≥ bi , with bi = (bi , qi )LL where bi is the reference level and qi is the left spread of b˜ i . To measure the level of belonging of decision vector x = (x 1 , x 2 , …, x n ) ∈ Rn to Ci , the fuzzy set containing vectors that satisfy the fuzzified chance constraint, a membership function may be created as follows:
μCi (x) =
⎧ ⎪ ⎨ 1n
if
⎪ ⎩
if if
j=1 aij xj −(bi −qi ) qi
0
n
j=1 aij xj ≥ bi bi > nj=1 aij xj ≥ bi − qi bi − qi > nj=1 aij xj .
Treatment of the Objective: For objective function Minz = nj=1 cj xj in Problem 4, a fuzzy flexible goal may be generated as follows [24]. In Step 1, the LPP, as stated below, may be considered to compute value zu which is a possible upper limit for z: Minz =
n
cj xj , subject to constraints :
j=1
n
aij xj ≥ bi i = 1, m ; xj ≥ 0 j = 1, n .
j=1
Next, in Step 2, the LPP, as stated below, may be considered to compute value zl which is a possible lower limit for z: Minz =
n j=1
cj xj , subject to :
n
aij xj ≥ bi − qi i = 1, m ; xj ≥ 0 j = 1, n .
j=1
Then, the following membership function [21] can be constructed for the fuzzy set G or also called the fuzzy goal w.r.t the objective function: ⎧ n if ⎪ j=1 cj xj ≤ zl ⎨ 1n j=1 cj xj −zu μG (x) = if zl < nj=1 cj xj ≤ zu zl −zu ⎪ ⎩ 0 if zu < nj=1 cj xj .
An Intelligent Computing Method for Scheduling Projects
163
If the value of this membership function is large and close to 1, then the value of the objective function nj=1 cj xj is close to zl . The above fuzzy goal is, in fact, a fuzzy ∼ flexible constraint nj=1 cj xj ≤ z˜ wherein z˜ = (zl , zs )RR , with zl being the reference point and zs = zu − zl being the right spread of z˜ , is a fuzzy right threshold. Having denoted Ci , the fuzzy constraint, i = 1, m, and G, the fuzzy goal, the following max-min type optimization problem can be considered [14, 21] to compute the optimal decision vector x = (x 1 , x 2 , …, x n ) of Problem 5 (wherein μD (.) is the aggregation membership function): Problem 5: Max λ, subject to constraints: λ = μD (x) = min {μG (x), μC1 (x), μC2 (x),…, μCm (x)}; (x, μD (x)) ∈ D; wherein x = (x 1 , x 2 , …, x n ) and D = (G) ∩ (C 1 ) ∩ (C 2 ) ∩ … ∩ (C m ). Using the expressions of μG (x) and μCi (x), for i = 1, m, as discussed above, Problem 5 can now be converted to the LPP, as stated below, that can be solved in Step 3: Problem 6: Maxλ, subject to constraints: λ(zl − zu ) − nj=1 cj xj ≥ −zu ; −λqi + n j=1 aj xij ≥ bi − qi i = 1, m ; 0 ≤ λ ≤ 1; xj ≥ 0 j = 1, n . Note 3: For solving Problem 5, it is required to follow the computing procedure: finding zu (Step 1), finding zl (Step 2) and solving Problem 6 (Step 3).
3.2 Computing the Project Completion Time when Project Activity Duration Times are Random Variables following Normal Distributions The stochastic ILPP, as stated below, can be considered to compute the project completion time when project activity duration times are random variables following normal distributions, i.e., t j = N(mj , σj ), with expected value mi and standard deviation σj , for j = 1, n (assuming x0 = 0 and xn+1 = T, xj ≥ 0 for j = 1, n):
Problem 7: Min z = xn+1 , subject to constraints: xj − xi ≥ t j for (i, j) m R+ ; where xj is finish time for activity j, t j = N(mj , σj ) is the normally distributed activity time of activity j, j = 1, n.
Example 2: Assume that the project activity duration times are random variables following normal distributions t j = N(mj , σj ) for j = 1, 8 as given in Table 2 (for crisp activity times, the standard deviations σi = 0).
All the stochastic constraints in Problem 7: xj − xi ≥ t j , for (i, j) m R+ , can be treated based on the fuzzified chance constraint approach as fuzzy flexible constraints with the righthand side fuzzy left threshold p˜ i = (pi , si )LL , that may be specified depending on preferences of the project manager. For easy illustration, we choose p˜ i = (pi , si )LL = (0.99865, 0.49865). Obviously, Φ −1 (pi ) = 3, Φ −1 (pi − si ) = 0. Then, with the relation of precedence as described in Table 1 and the input data as described in Table 2 and in Table 3 (following the fuzzy flexible constraint approach as mentioned in Note 2 in Subsect. 3.1, bi = mi + σi Φ −1 (pi ) and qi = σ i Φ −1 (pi ) − σi Φ −1 (pi − si ) for i = 1, 8). Assume for all LPPs that will be now considered: x0 = 0, xj ≥ 0 and integer for j = 1, 9. With x9 being the project completion time, Problem 7 will have the form:
164
N. H. Thanh Table 2. Normally distributed activity times
Activity
Im. Pred. Activities
Activity time t j = N(mj , σj )
Activity
Im. Pred. Activities
mj (1)
Activity time t j = N(mj , σj )
σj
mj
σj
4
2/3
(5)
3
6
1
(2)
1
2
1/3
(6)
2, 3
2
1/3
(3)
1
4
0
(7)
4, 5
4
2/3
(4)
2
3
0
(8)
3
3
0
∼
∼
∼
Problem 7a: Min z = x9 , subject to: x1 – x0 ≥ (6,2)LL ; x2 – x1 ≥ (3,1)LL ; x3 – x1 ≥ ∼
∼
∼
∼
(4,0)LL ; x4 – x2 ≥ (3,0)LL ; x5 – x3 ≥ (9,3)LL ; x6 – x2 ≥ (3,1)LL ; x6 – x3 ≥ (3,1)LL ; ∼ ∼ ∼ ∼ ∼ x7 – x4 ≥ (6,2)LL ; x7 – x5 ≥ (6,2)LL ; x8 – x3 ≥ (3,0)LL ; x9 – x6 ≥ (0,0)LL ; x9 – x7 ≥ ∼ (0,0)LL ; x9 – x8 ≥ (0,0)LL . Table 3. Fuzzy left thresholds Activity
Fuzzy left threshold b˜ j = (bj , qj )LL bj
qj
(1)
6
2
(2)
3
(3)
4
(4)
3
Activity
Fuzzy left threshold b˜ j = (bj , qj )LL bj
qj
(5)
9
3
1
(6)
3
1
0
(7)
6
2
0
(8)
3
0
Now, to solve Problem 7, one could follow the computing procedure as mentioned in Note 3. To perform Step 1 and Step 2 of this procedure, following ILPPs, namely, Problem 7b and Problem 7c, can be solved to compute values zu = 25, and zl = 18: Problem 7b: (Step 1) Min z = x9 , subject to: x1 – x0 ≥ 6; x2 – x1 ≥ 3; x3 – x1 ≥ 4; x4 – x2 ≥ 3; x5 – x3 ≥ 9; x6 – x2 ≥ 3; x6 – x3 ≥ 3; x7 – x4 ≥ 6; x7 – x5 ≥ 6; x8 – x3 ≥ 3; x9 – x6 ≥ 0; x9 – x7 ≥ 0; x9 – x8 ≥ 0. Problem 7c: (Step 2) Min z = x9 , subject to: x1 – x0 ≥ 4; x2 – x1 ≥ 2; x3 – x1 ≥ 4; x4 – x2 ≥ 3; x5 – x3 ≥ 6; x6 – x2 ≥ 2; x6 – x3 ≥ 2; x7 – x4 ≥ 4; x7 – x5 ≥ 4; x8 – x3 ≥ 3; x9 – x6 ≥ 0; x9 – x7 ≥ 0; x9 – x8 ≥ 0. Applying the approach as shown in Problem 6 and using the data as shown in Table 3 and zl – zu = −7, following Problem 7d, can be considered to compute finish times for activity j, j = 1, 8:
An Intelligent Computing Method for Scheduling Projects
165
Problem 7d: (Step 3) Max z = x10 , subject to: –x9 – 7x10 ≥ –25; x1 – x0 – 2x10 ≥ 4; x2 – x1 – x10 ≥ 2; x3 – x1 ≥ 4; x4 – x2 ≥ 3; x5 – x3 –3x10 ≥ 6; x6 – x2 – x10 ≥ 2; x6 – x3 – x10 ≥ 2; x7 – x4 – 2x10 ≥ 4; x7 – x5 – 2x10 ≥ 4; x8 – x3 ≥ 3; x9 – x6 ≥ 0; x9 – x7 ≥ 0; x9 – x8 ≥ 0; 0 ≤ x10 ≤ 1; wherein x10 presents λ as mentioned in Problem 6. Problem 7d produces the optimal solution x = (x0 , x1 , x2 , x3 , x4 , x5 , x6 , x7 , x8 , x9 , x10 ) = (0, 5, 8, 9, 16, 17, 12, 22, 12, 22, 0.42857). This solution provides value of the project completion time x9 = T = 22, and x10 = λ = μD (xo ) = 0.42857, which is the maximum level of belonging of vector xo = (x0 , x1 , x2 , x3 , x4 , x5 , x6 , x7 , x8 , x9 ) = (0, 5, 8, 9, 16, 17, 12, 22, 12, 22) to the fuzzy set of decision vectors of Problem 7a. For finding the optimal solution xo of Problem 7a so that μD (xo ) is 0.42857 and x9 has the smallest possible value, following Problem 7e can be solved: Problem 7e: Min z = x9 , subject to the constraints as described in Problem 7d and one more condition: x10 = 0.42857. Problem 7e provides the optimal solution xo = (x0 , x1 , x2 , x3 , x4 , x5 , x6 , x7 , x8 , x9 ) = (0, 5, 8, 9, 17, 17, 12, 22, 12, 22), with only one difference x4 = 17 when compared to the optimal solution of Problem 7d. 3.3 Finding Critical Activities when Project Activity Duration Times are Random Variables following Normal Distributions Applying the approach as stated in Problem 2 and Problem 3, the earliest and latest times to finish of activity j, for j = 1, n, can be found. Example 2 (contd.): Using the input data as described in Table 3, Problem 7f and Problem 7g, as stated below, can be considered to compute the earliest time to finish EFj and latest time to finish LFj , for j = 1, 8: Problem 7f: Min z = 8j=1 xj under all restrictions as described in Problem 7e and one more condition: x9 = 22; Problem 7g: Max z = 8j=1 xj , under all restrictions as described in Problem 7f. Problem 7f provides the optimal solution x = (x0 , x1 , x2 , x3 , x4 , x5 , x6 , x7 , x8 , x9 , x10 ) = (0, 5, 8, 9, 11, 17, 12, 22, 12, 22, 0.42857), x9 = 22 is the project completion time and x10 = μD (xo ) = 0.42857. Problem 7gprovides the optimal solution x = (x0 , x1 , x2 , x3 , x4 , x5 , x6 , x7 , x8 , x9 , x10 ) = (0, 5, 14, 9, 17, 17, 22, 22, 22, 22, 0.42857), x9 = 22 is the project completion time and x10 = μD (xo ) = 0.42857. Now, one can calculate slack times for all activities using the formula sj = EFj = LFj . Activities 1, 3, 5, 7 are critical since sj = 0, for j = 1, 3, 5, 7. Therefore, the critical time path is (0) → (1) → (3) → (5) → (7) → (9). The activity times for the critical activities are: t1 = x1 = 5, t3 = x3 – x1 = 4, t5 = x5 – x3 =8, t7 = x7 – x5 = 5. Based on the relation of precedence and formula: tj = Mini∈pred (j) xj − xi , where xj = EFj for j = 1, 8, activity times can now be computed for all non-critical activities 2, 4, 6 and 8: t2 = Min{x2 – x1 } = 3, t4 = Min{x4 – x2 } = 3, t6 = Min{x6 – x2 , x6 – x3 } = 3, t8 = Min{x8 – x3 } = 3. The computing results are summarized in Table 4:
166
N. H. Thanh
Table 4. Project’s activity start and finish times when project activity duration times follow normal distributions Acti-vity
ES
EF
LS
LF
t
Slack
Acti-vity
ES
EF
LS
LF
t
Slack
(1)
0
5
0
5
5
0
(5)
9
17
9
17
8
0
(2)
5
8
11
14
3
6
(6)
9
12
19
22
3
10
(3)
5
9
5
9
4
0
(7)
17
22
17
22
5
0
(4)
8
11
14
17
3
6
(8)
9
12
19
22
3
10
Note 4: Problem 7f, Problem 7g and Problem 7e provide optimal solutions which also belong to the set of optimal solutions of Problem 7a, and, therefore, these ILPPs provide decision vectors with the maximum level of μD (.) = 0.42857 of their belonging to the fuzzy set of decision vectors of Problem 7a. All the time chance constraints are satisfied at least with probability α of level equal to 0.71(α ≈ 0.5 + 0.42857(0.99865–0.49865))
4 Concluding Observations This paper has provided a review of a recently proposed approach based on linear programming to project scheduling wherein the project activity duration times are crisp value [20, 21]. As a scientific contribution, this paper proposes an intelligent computing method using the fuzzy linear programming approach and a fuzzified chance constraint treatment for project scheduling, that is, for finding the project completion time and critical activities as well as critical path when project activity duration times are random variables following normal distributions. The new innovative intelligent computing method is based on an algorithm frame that is built up by a series of ILPPs (see Subsects. 3.3 and 3.4) as proposed in Sect. 3. The new intelligent computing method shows the computational efficiency by supplying optimal solutions having the maximum level of the aggregation membership function μD (.) determining the level of belonging of these solutions to the fuzzy set of decision vectors (see Note 4). Moreover, the integer restrictions on project activity duration times can be omitted for obtaining more punctual solutions. The research results as presented in this paper may be improved by specifying the membership functions μG (x) and μCi (x) for i = 1, m (see Subsect. 3.1) of nonlinear type and piecewise linear type. The application of these types of membership functions will help in making more effective decisions for project scheduling with uncertain activity times in general, and with normally distributed activity times in particular. For the purpose, instead of linear programming algorithms as shown in this paper, other value-function-based optimization algorithms such as genetic, evolutionary, swarm intelligence algorithms or any suitable metaheuristic search algorithms may be of use. Moreover, using the approach based on the fuzzy linear programming approach and a fuzzified chance constraint treatment as presented in this paper, an algorithm frame can be built up for crashing projects wherein the project activity duration times are random variables following normal distributions.
An Intelligent Computing Method for Scheduling Projects
167
Acknowledgment. This research is funded by International School, Vietnam National University, Hanoi (VNU-IS) under project CS.2021-03.
References 1. Abdelkader, Y.H., Al-Ohali, M.E.: Estimating the completion time of stochastic activity networks with uniform distributed activity times. J. Arch. Des. Sci. 66(4), 115–134 (2013) 2. Afruzi, E.N., Aghaie, A., Naja, A.A.: Robust optimization for the resource-constrained multiproject scheduling problem with uncertain activity times. Int. J. Sci. Technol. Scientia Iranica 27(1), 361–376 (2020) 3. Anderson, D., Sweeney, D., Williams, T., Camm, J., Cochran, J.: Quantitative Methods for Business, 13th edn. Cengage Learning, South-Western (2015) 4. Bagshaw, K.B.: PERT and CPM in project management with practical examples. Am. J. Oper. Res. 11, 215–226 (2021) 5. Biruk, S., Rzepecki, L.: Simulation model for resource-constrained construction project. Open Eng. (2019). https://doi.org/10.1515/eng-2019-0037 6. Bordley, R.F., Keisler, J.M., Loganc, T.M.: Managing projects with uncertain deadlines. Eur. J. Oper. Res. 274(1), 291–302 (2019) 7. Chao, O.Y., Chen, W.L.: A hybrid approach for project crashing optimization strategy with risk consideration: a case study for an EPC project. Math. Probl. Eng. (2019). https://doi.org/ 10.1155/2019/9649632 8. Ehsani, E., Kazemi, N., Udoncy Olugu, E., Grosse, E.H., Schwindld, K.: Applying fuzzy multi-objective linear programming to a project management decision with non-linear fuzzy membership functions. Neural Comput. Appl. 28, 2193–2206 (2017) 9. EVN Central Power Corporation, PC3-INVEST: An application of PERT in scheduling and controlling hydropower plant Da Krong construction project (2015) 10. He, Y., He, Z., Wang, N.: Tabu search and simulated annealing for resource-constrained multi-project scheduling to minimize maximal cash flow gap. J. Ind. Manag. Optim. 17(5), 2451–2474 (2021) 11. Kamburowski, J.: Normally distributed activity times in PERT networks. J. Oper. Res. Soc., 1051–1057 (2017). https://doi.org/10.1057/jors.1985.184 12. Lin, L., Lou, T., Zhan, N.: Project scheduling problem with uncertain variables. Appl. Math. 5, 685–690 (2014) 13. Mahmoudi, A., Feylizadeh, M.: A mathematical model for crashing projects by considering time, cost, quality and risk. J. Proj. Manag. 2(1), 27–36 (2017) 14. Mohan, C., Thanh, N.H.: An interactive satisficing method for solving multiobjective mixed fuzzy-stochastic programming problems. Fuzzy Sets Syst. 117(1), 61–79 (2001) 15. Ortiz-Pimiento, N.R., Diaz-Serna, J.F.: The project scheduling problem with nondeterministic activities time: a literature review. J. Ind. Eng. Manag. 11(1), 116–134 (2018) 16. Phong, N.H., Van, L.T.: Quantitative Methods for Business and Project Management in Construction Industry. Construction Publishing House (2015) 17. Taha, A. H.: Operations Research: An Introduction, 10th edn. Prentice Hall (2017) 18. Tenjo-García, J.S., Figueroa-García, J.C.: Simulation-based fuzzy PERT problems. In: 2019 IEEE Colombian Conference on Applications in Computational Intelligence Proceedings, pp. 1–5 (2019). https://doi.org/10.1109/ColCACI.2019.8781978 19. Thanh, N.H.: Applied mathematics, Postgraduate Textbook. University of Education Publishing House (2005)
168
N. H. Thanh
20. Thanh, N.H.: Application of linear programming to project scheduling and project crashing. Vietnam J. Agri. Sci. 19(11), 1499–1508 (2021) 21. Thanh, N.H.: A new intelligent computing method for scheduling and crashing projects with fuzzy activity completion times. In: Nguyen, NT., Dao, NN., Pham, QD., Le, H.A. (eds.) ICIT 2022. LNDECT, vol. 148, pp. 44–57. Springer, Cham (2022). https://doi.org/10.1007/978-3031-15063-0_4 22. Wang, X., Ning, Y.: Uncertain chance-constrained programming model for project scheduling problem. J. Oper. Res. Soc. 19(3), 384–391(2018)
Security and Privacy
An Improved Hardware Architecture of Ethereum Blockchain Hashing System Duc Khai Lam1,2(B) , Quoc Linh Phan1,2 , Quoc Truong Nguyen1,2 , and Van Quang Tran1,2 1
University of Information Technology, Ho Chi Minh City, Vietnam [email protected], {18520993,18521568}@gm.uit.edu.vn 2 Vietnam National University, Ho Chi Minh City, Vietnam
Abstract. Blockchain technology has become extremely popular, especially in recent years. It has been successfully applied in the field of cryptocurrencies. Prominent is Ethereum, the most popular and highly valued cryptocurrency on the market. It is used in many aspects of society, including healthcare, education, and the economy. Ethereum mining is the process of adding a block of transactions to the Ethereum Blockchain. This process generates a new block, which processes, stores, and secures the transaction information created by the user’s transaction processes. Ethash is Ethereum’s current mining algorithm. The primary cryptographic hash functions of Ethash are Keccak512 and Keccak256. This paper proposes a 2-stages pipelined architecture with high throughput and high efficiency for the Keccak hash function. Then we applied this architecture to implement the Ethash algorithm on hardware. This design will be implemented on Xilinx Virtex-7 VC707 FPGA. The results of our proposed Keccak architecture achieve a Frequency of 816.742 MHz, Throughput and Efficiency reach 35.545 Gbps and 36.196 Mbps/Slices, respectively. The Hashrate of the proposed system achieves 414.14 KH/s with an operating Frequency of 66.263 MHz.
Keywords: Ethash Algorithm
1
· Blockchain · SHA-3 · Keccak · FPGA
Introduction
Ethereum is defined and understood in its simplest terms as a decentralized platform based on Blockchain technology, best known for its native cryptocurrency, called Ether (ETH). Ethereum was invented by Vitalik Buterin (a programmer who co-founded Bitcoin Magazine) in 2015 [15]. Currently, Ethereum is applied in many fields, such as healthcare [9], education [11], and economics [4]. A new block is formed by a competitive and decentralized mining process. Adding new validated blocks to the Ethereum Blockchain, this process keeps the network stable and secure. Ethereum, like many other cryptocurrencies such as Bitcoin and Litecoin, uses the Proof-of-Work (PoW) consensus mechanism to do this [7]. PoW is a consensus mechanism that is also the proof of work of c The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 N.-N. Dao et al. (Eds.): ICIT 2023, LNDECT 188, pp. 171–180, 2023. https://doi.org/10.1007/978-3-031-46749-3_17
172
D. K. Lam et al.
Miners in solving the problem to find the nonce parameter satisfying the problem requirements. Ethash is the PoW algorithm for Ethereum [15]. According to our study and analysis, the Ethash algorithm uses two major cryptographic hash functions, Keccak512 and Keccak256 [15]. Currently, there are many related works on the implementation of these algorithms. In the paper [10], the Keccak256 algorithm is implemented according to the Iterative architecture, which is a traditional architecture with a Frequency of 301.57 MHz. In [6], the author optimizes the Iterative architecture by optimizing the controller of the system. The results are better than the study [10], the Frequency achieved is 309.6 MHz, and throughput and efficiency are 14.04 Gbps and 11.24 Mbps/Slices, respectively. In the paper [1], the author mentioned the pipelined architecture by placing registers between computational modules to shorten the critical path, which helps to increase the computation speed of the hash function on the FPGA but still ensures the area cost is moderate. In which paper [1] is implemented on the Virtex-7 platform, the resulting frequency is 832.848 MHz and throughput is 36.985 Gbps, so the Efficiency of the architecture is very good by 22.11 Mbps/Slices. Besides, the paper [12] also implements the design according to another 2 stages pipelined architecture but on the Arria 10 platform with quite good Frequency results of 498.62 MHz, throughput and efficiency reach 22.604 Gbps and 15.89 Mbps/Slices. Concerning the implementation of the Ethash algorithm on hardware, the paper [14] proposed the hardware design and implementation of the Ethash algorithm on the FPGA board, specifically the Xilinx Virtex-7 FPGA VC707. The Hashrate of [14] achieved 88 KH/s at a frequency of 19.231 MHz, and its throughput was 0.045 Gbps. It can be seen that the operating frequency of this system is still quite low, and the main reason is that the modulo function has too much delay. Therefore, the operating frequency of the system is entirely dependent on this function. Currently, GPUs are widely utilized in mining [5]. The Hashrate [8,13] is high due to GPU processing with multiple cores. However, the trade-off is that it consumes a lot of energy. Thus efficiency is not good. In this paper, we propose a hardware design architecture to implement the Ethash system. To achieve a highly efficient Ethash system, we propose a design for two stages pipelined Keccak function with high Throughput and Efficiency. Besides, a pipelined architecture for the Modulo function is also built based on the paper [3]. The outline of the paper is as follows: The hardware architecture is shown in Sect. 2. Section 3 presents implementation results. Finally, the conclusion is shown in Sect. 4.
2
Proposed Ethash Hardware Architecture
In this section, the proposed hardware architecture will be presented. We propose two stages pipelined Keccak design and then build three stages pipelined modulo architecture that applies to the five stages pipelined Main Loop. Moreover, using
An Improved Hardware Architecture of Ethereum Blockchain
173
{header,nonce} 320
Keccak512_2stage({header,nonce})
full_size
DAG
32
512
REG_Seed2
Seed
REG_Seed1
Main_Loop(Seed, full_size)
512
256 Cmix
Keccak256_2stage({Seed,Cmix})
Seed
256
ethash_value
Fig. 1. The architecture of Ethash system
a pipelined architecture for component modules allows the Ethash system to take two consecutive inputs and produce two consecutive Ethash value outputs 2.1
The Architecture of Ethash
The Ethash system consists of three main processing Keccak512 2stage, Main Loop, and Keccak256 2stage module, which are generally described in Fig. 1. Our proposed Ethash architecture processes two consecutive input values to produce two values corresponding to Ethash values consecutively. This design helps to improve the hash rate for the system. The Keccak512 2stage module is two stages pipelined Keccak512 function that receives a 320 bits input value made up of two values, 256 bits header, and 64 bits nonce. The output of the Keccak512 2stage module is a Seed value of length 512 bits. The Seed value is input to the Main Loop module. In addition, the Main Loop module also receives the full-size value, which is the size in bytes of the DAG file. The Main Loop module, which is pipelined with five stages, performs 64 loops. Each loop fetches the data from the DAG file and performs the necessary calculations, finally producing a 256 bits Cmix value. The Keccak256 2stage module is two stages pipelined Keccak256 function, receiving an input value of 768 bits made up of two values: 512 bits Seed and 256 bits Cmix. The output of this module is the Ethash value of the Ethash system.
174
D. K. Lam et al.
The two registers REG Seed1 and REG Seed2 are used to store two consecutive Seed values generated from the Keccak512 2stage module. These two Seed values are used as input to the Keccak256 2stage module of the Ethash system. The two modules Keccak512 2stage and Keccak256 2stage have the same architecture at the KeccakRound stage, and they only differ in the before and after KeccakRound processing stages that are simple operations such as bit concatenation in Padding, bit XOR operation in Mapping, and bit trimming in Truncating to give the desired output length. Since our result is to compare with the Keccak256 function in related studies, we will discuss the architecture and the results of the synthesis and hardware resources of the Keccak256 2stage module in the following sections. For the Ethash system, after one clock cycle, when there is a start signal, the system will start computing at the Keccak512 2stage module. After 49 clock cycles, the Keccak512 2stage module will output the first result and put it into the Main Loop. The Main Loop module takes 320 clock cycles to output the first result and put it into the Keccak256 2stage module. After 50 clock cycles, the Keccak256 2stage module produces two Ethash values. Thus, the system takes 1 + 49 + 320 + 50 = 420 clock cycles to produce the first two Ethash values. While the Main Loop module is calculating, the Keccak512 2stage module receives new input and starts calculating so that after the Main Loop has just finished executing, the Keccak512 2stage module has the results to start a new Main Loop. So after we have the first two Ethash values, it only takes 320 clock cycles to get the next two Ethash values. In the worst case, we theoretically have to iterate the Ethash algorithm 264 times to find 264 Ethash values if no Ethash value is found less than or equal to the target. We need to repeat the Ethash algorithm 264 times because the 64-bit nonce value that is part of the Ethash system input is initially set to zero. If the Ethash value does not satisfy the requirements, the system will increase the nonce value by 1 unit. According to our design, after each execution of the Ethash algorithm, two consecutive Ethash values are produced, so in the worst case, we only need to repeat the Ethash algorithm 264 / 2 by 263 times. So we have the average number of cycles to get one Ethash value as Eq. (1):
Cycles = 2.2
420 × 1 + 320 × (263 − 1) ≈ 160(cycles) 264
(1)
The Architecture of Keccak256 2stage
By analyzing the research on implementing the Keccak algorithm on hardware, we found that the 2 stages pipelined architecture presented in the article [1] has many advantages. Placing the registers between the subfunctions in KeccakRound helps to shorten the critical path, which means that the operating frequency is increased, and the processing speed is high, but the cost of the area is still moderate. Especially, this also allows the system to be able to process two blocks of two different input messages at the same time, called message in
An Improved Hardware Architecture of Ethereum Blockchain
Keccak256_2stage
Z
Truncating
load message_in 768
Padding
1
1088
1088'b0
1088
1088
{out_xor, out_iota[511:0]}
0 1088
256
ethash_value
first_block_n
Mapping
1600
512
R E G 1600 0
R E
θ
{Seed, Cmix}
175
1600 G
1
out_iota[1599:512] out_iota[511:0]
8'h01
RC0 RC1
Counter24 up
5
…
1600
MUX 24:1
out_iota
ι.χ.π.ρ
1600
KeccakRound 8
RC23
Fig. 2. The architecture of Keccak256 2stage
1 and message in 2. Because of the above advantages, we decided to follow the two stages of pipelined architecture to build and optimize the hardware design. At this point, a problem arose when we started to build the design, that is when doing Mapping for the first block value of the input message. The first block value is always XORed with the output of the Iota function, which is risky if the output of the Iota function is a non-zero number. It will affect this first block value before it is passed into the loop KeccakRound. In the worst case, it causes errors in the calculation process, which leads to the resulting Z hash value being no longer accurate. However, the output XOR of each message in with the output Iota is correct with the theory of the Keccak function built based on the Sponge construction [2] if the block is the second or later block of the message in. To solve the above problem, in the article [1], the author proposes to modify the first block value when hashing a new message in. Through the research process, we proposed a new method that does not need to modify the first block. Instead, we will reset a particular value for the REG1 register, which is behind the Theta function and in front of the Rho function. Then, every time reset, this value will go through four functions Rho → Pi → Chi → Iota giving the result 0. This method does not use any logic gates to modify the first block, thus reducing hardware resources compared to the idea presented in [1]. In our proposed design, from the moment the first block of a message in is passed into the design, it will take 49 clock cycles to output that block. If a message in consists of n blocks, it will take n × 49 clock cycles to produce the hash of that message in. In case the system processes two message in at the same time, when we get the hash value of message in 1, we need to add one clock to produce the hash value of message in 2. Thus, the total number of cycles to produce the hash Z value of two message in is determined by the following Eq. (2): Latency = n × 49 + 1(cycles)
(2)
176
D. K. Lam et al. full_size
Seed
32
512
Sel_mux_hmix_01
Cal_Mix 32
32
M0 U X1
Sel_mux _hmix_02
1024
REG_Headmix2 0
REG_Headmix1
MUX
1
Sel Counter64
1024 32
32 0
MUX
0
1
...
1
32 63
Enc
6 count
MUX
32
1024
32
REG_Mix2
Mix
P = FNV(Headmix^count, Mix)
Headmix
32
text
REG01
1024
32
REG_Mix1
Index = mod_3stage(P, full_size>>7)7) > 7
>> 2
32
Block[30:23]
32
R E G 02
Block[22:15]
R E G 03
Block[14:7]
R E G 04