Proceedings of Data Analytics and Management: ICDAM 2023, Volume 4 (Lecture Notes in Networks and Systems, 788) 9819965527, 9789819965526

This book includes original unpublished contributions presented at the International Conference on Data Analytics and Ma

201 109 16MB

English Pages 568 [550] Year 2023

Table of contents :
ICDAM 2023 Steering Committee Members
Preface
Contents
Editors and Contributors
Deep Spectral Feature Representations Via Attention-Based Neural Network Architectures for Accented Malayalam Speech—A Low-Resourced Language
1 Introduction
2 Related Work
3 Methodology
3.1 Data Collection
3.2 Data Preprocessing
3.3 Accented Model Construction
3.4 Conclusion
References
Improving Tree-Based Convolutional Neural Network Model for Image Classification
1 Introduction
1.1 Contribution of the Research Work
2 Literature Review
2.1 Previous Work
2.2 Contribution
3 Methodology
3.1 Overview
3.2 Dataset
3.3 1D Convolutions and Strides
3.4 Removal of Max Pooling
3.5 Leaky ReLU
3.6 Model Architecture
4 Results and Conclusion
References
Smartphone Malware Detection Based on Enhanced Correlation-Based Feature Selection on Permissions
1 Introduction
1.1 Motivation
1.2 Contributions
2 Related Work
3 Proposed Methodology
3.1 Datasets
3.2 Feature Extraction
3.3 Feature-Feature Correlation with ENMRS
3.4 Feature-Class Correlation Measure: crRelevance
3.5 Proposed Feature Selection Technique: ECFS
3.6 Machine Learning Techniques Used
4 Results and Discussion
4.1 n 1 equals 0.1n1=0.1 and n 1 equals 0.9n1=0.9
4.2 n 1 equals 0.2n1=0.2 and n 2 equals 0.8n2=0.8
4.3 n 1 equals 0.3n1=0.3 and n 2 equals 0.7n2=0.7
4.4 n 1 equals 0.4n1=0.4 and n 2 equals 0.6n2=0.6
4.5 n 1 equals 0.5n1=0.5 and n 2 equals 0.5n2=0.5
4.6 n 1 equals 0.6n1=0.6 and n 2 equals 0.4n2=0.4
4.7 n 1 equals 0.7n1=0.7 and n 2 equals 0.3n2=0.3
4.8 n 1 equals 0.8n1=0.8 and n 2 equals 0.2n2=0.2
4.9 n 1 equals 0.9n1=0.9 and n 2 equals 0.1n2=0.1
5 Conclusion
References
Fake News Detection Using Ensemble Learning Models
1 Introduction
2 Related Works
3 Proposed Methodology
3.1 Dataset Description and Data Preprocessing
3.2 Feature Extraction
3.3 Algorithms
3.4 Evaluation Metrics
3.5 Web Application
4 Results and Discussion
5 Conclusion
References
Ensemble Approach for Suggestion Mining Using Deep Recurrent Convolutional Networks
1 Introduction
2 Related Work
3 Proposed Architecture
4 Experiments
4.1 Dataset and Pre-processing
4.2 Experimental Setup
5 Results and Discussion
6 Limitations
7 Conclusion
References
A CNN-Based Self-attentive Approach to Knowledge Tracing
1 Introduction
2 Related Works
3 Proposed Method
3.1 Architecture for the Proposed Model
4 Experimentations
4.1 Dataset
4.2 Evaluation Methodology
5 Results and Discussion
6 Conclusion and Future Work
References
LIPFCM: Linear Interpolation-Based Possibilistic Fuzzy C-Means Clustering Imputation Method for Handling Incomplete Data
1 Introduction
2 Related Work
3 Proposed Methodology
4 Experimental Framework
4.1 Dataset Description
4.2 Missing Value Simulation
4.3 Evaluation Criteria
5 Experimental Results and Discussion
6 Conclusion and Future Work
References
Experimental Analysis of Two-Wheeler Headlight Illuminance Data from the Perspective of Traffic Safety
1 Introduction
2 Literature Review
3 Methodology
3.1 Methodology
3.2 Experimental Setup
4 Results and Discussion
5 Conclusion and Future Scope
References
Detecto: The Phishing Website Detection
1 Introduction
2 Literature Review
2.1 Methodologies for Phishing Website Detection
2.2 Dataset
2.3 Feature Extraction
2.4 Machine Learning Algorithm
3 Result
4 Limitation
5 Conclusion
References
Synergizing Voice Cloning and ChatGPT for Multimodal Conversational Interfaces
1 Introduction
2 Related Works
3 The Proposed System
3.1 Voice-Enabled ChatGPT
3.2 Voice Cloning Model
4 Methodology
4.1 Voice-Enabled ChatGPT
5 Voice Cloning
6 Result and Discussion
7 Conclusion
8 Future Scope
References
A Combined PCA-CNN Method for Enhanced Machinery Fault Diagnosis Through Fused Spectrogram Analysis
1 Introduction
2 Proposed Methodology
2.1 Dataset
2.2 Data Preprocessing
2.3 Fusion
2.4 CNN
3 Results and Discussion
4 Conclusion
References
FPGA-Based Design of Chaotic Systems with Quadratic Nonlinearities
1 Introduction
2 Design Methodology
2.1 Mathematical Representation
3 Results
3.1 Synthesis Results
3.2 Simulation Results
4 Conclusion
References
A Comprehensive Survey on Replay Strategies for Object Detection
1 Introduction
2 Object Detectors
3 Continual Learning and Catastrophic Forgetting
4 Replay Strategies for Object Detection
5 Conclusions
References
Investigation of Statistical and Machine Learning Models for COVID-19 Prediction
1 Introduction
2 Related Work
3 Methodology
3.1 Data Collection
3.2 Data Preprocessing
3.3 Performance Metrics
4 Algorithms Used
4.1 Statistical Model
4.2 Machine Learning Algorithms
5 Result Analysis
6 Conclusion
References
SONAR-Based Sound Waves’ Utilization for Rocks’ and Mines’ Detection Using Logistic Regression
1 Introduction
2 Literature Review
3 Proposed Work
4 Implementation Analysis
5 Conclusion and Future Scope
References
A Sampling-Based Logistic Regression Model for Credit Card Fraud Estimation
1 Introduction
2 Literature Review
3 Proposed Model
4 Result Analysis and Discussion
5 Conclusion and Future Scope
References
iFlow: Powering Lightweight Cross-Platform Data Pipelines
1 Introduction
2 Literature Review
3 Proposed Methodology
4 Result Analysis and Discussion
5 Conclusion and Future Work
References
Developing a Deep Learning Model to Classify Cancerous and Non-cancerous Lung Nodules
1 Introduction
2 Related Work
3 Proposed Method
3.1 Dataset
3.2 Model Architecture
4 Results and Discussion
5 Conclusion
References
Concrete Crack Detection Using Thermograms and Neural Network
1 Introduction
1.1 Related Work
2 Experiment Design
2.1 Simulation Dataset Creation
2.2 Camera and Concrete Blocks Specifications
2.3 Compression-Exposed Concrete Data Collection
2.4 Simulation Dataset Model
2.5 Laboratory Dataset Model
3 Results and Analysis
3.1 Simulation Dataset Model Results
3.2 Laboratory Dataset Model Results
3.3 The Challenges of Using Thermal Images
4 Conclusion
References
Wind Power Prediction in Mediterranean Coastal Cities Using Multi-layer Perceptron Neural Network
1 Introduction
2 Material and Method
2.1 Study Area and Dataset
2.2 MLPNN Model
2.3 Statistical Indices (SI)
3 Results and Discussions
4 Conclusions
References
Next Generation Intelligent IoT Use Case in Smart Manufacturing
1 Introduction
2 Literature Review
2.1 Research Objectives of This Study
2.2 Research Methodology
3 Next Generation Technology Development
4 Defining ‘4*S Model’
4.1 Conceptualization of ‘4*S Model’
5 Challenges in Smart Manufacturing
6 Advantages in Smart Manufacturing
6.1 Direct Cost Savings
6.2 Indirect Cost Savings
7 Limitations of This Study
8 Conclusion
References
Forecasting Financial Success App: Unveiling the Potential of Random Forest in Machine Learning-Based Investment Prediction
1 Introduction
2 Literature Review
3 Concept
3.1 Financial Investment
3.2 Machine Learning
4 Methodology
5 Results
6 Discussions
7 Limitations
8 Conclusion and Future Scope
References
Integration of Blockchain-Enabled SBT and QR Code Technology for Secure Verification of Digital Documents
1 Introduction
2 Literature Review
3 Methodology
4 Performance Analysis
4.1 Time
4.2 Scalability
4.3 Authentication and Security
4.4 Automation
5 Conclusion and Future Scope
References
Time Series Forecasting of NSE Stocks Using Machine Learning Models (ARIMA, Facebook Prophet, and Stacked LSTM)
1 Introduction
2 Literature Review
3 Dataset Description
4 Data Preparation
5 Assessment Metric
6 Models
6.1 ARIMA
6.2 Facebook Prophet
6.3 LSTM
7 Proposed Methodology
8 Observations and Results
9 Result Analysis
10 Limitations
11 Conclusion
12 Social Impact
13 Future Scope
References
Analysis of Monkey Pox (MPox) Detection Using UNETs and VGG16 Weights
1 Introduction
2 Previous Works
3 Methodology
3.1 VGG16
3.2 CNN
3.3 Custom CNN
3.4 UNET
4 Implementation Analysis
4.1 Data Preprocessing
4.2 Extracting Features
4.3 Measures of Performance
5 Discussion
6 Conclusion
7 Future Enhancement
References
Role of Robotic Process Automation in Enhancing Customer Satisfaction in E-commerce Through E-mail Automation
1 Introduction
2 Review of Literature
3 The E-learning Environment
4 Need for Robotic Process Automation
5 RPA Implementation
5.1 Payment Management
5.2 Moodle Account Creation
5.3 E-mail Automation
6 Discussions
7 Conclusion
References
Gene Family Classification Using Machine Learning: A Comparative Analysis
1 Introduction
1.1 Problem Statement
1.2 Machine Learning in Bioinformatics
1.3 Motivation
2 Literature Survey
3 Proposed Work
3.1 Architecture
3.2 Implementation of Machine Learning Algorithms
4 Implementation
4.1 k-mer Counting
4.2 Dataset Description
4.3 Limitations and Challenges
5 Assessment Metrics
6 Comparative Analysis
7 Result Analysis
8 Conclusion and Future Scope
References
Dense Convolution Neural Network for Lung Cancer Classification and Staging of the Diseases Using NSCLC Images
1 Introduction
2 Related Work
2.1 Lung Lesion Classification Using Artificial Neural Network
2.2 K-Nearest Neighbor Classification Model for Lung Lesion Classification
3 Current Approach
3.1 Image Preprocessing
3.2 Image Segmentation—Gradient Vector Flow (GAV)
3.3 Feature Extraction—ABCD Rule
3.4 Dense Convolution Neural Network
4 Experimental Results
5 Conclusion
References
Sentiment Analysis Using Bi-ConvLSTM
1 Introduction
2 Literature Survey
3 Problem Statement
4 Proposed Methodology
4.1 Preprocessing
4.2 Feature Extraction
4.3 Classification
5 Results and Discussions
5.1 Dataset Description
5.2 Evaluation Metrics
5.3 Performance Evaluation
6 Conclusion
References
A New Method for Protein Sequence Comparison Using Chaos Game Representation
1 Introduction
2 Proposed Method
2.1 Classification of Amino Acids
2.2 Proposed Method Using Chaos Game Theory on Hexagonal Model
2.3 Algorithm of Proposed Method
3 Results and Analysis
3.1 Comparison Analysis of Proposed Model with Earlier Approaches
4 Conclusion
References
Credit Card Fraud Detection and Classification Using Deep Learning with Support Vector Machine Techniques
1 Introduction
2 Related Works
3 Methodology
3.1 MLP
3.2 Support Vector Machine (SVM)
4 Result Analysis and Discussion
4.1 Performance Matrices
4.2 Dataset
4.3 Result Discussion
5 Conclusion
References
Prediction of Criminal Activities Forecasting System and Analysis Using Machine Learning
1 Introduction
2 Literature Review
3 Proposed Method
3.1 Face Recognition
4 Results
4.1 Face Recognition
5 Crime Detection
6 Conclusion and Future Works
References
Comparing Techniques for Digital Handwritten Detection Using CNN and SVM Model
1 Introduction
2 Related Work
3 Methodology
3.1 Dataset
3.2 Support Vector Machine
3.3 Multilayered Perceptron
3.4 Convolutional Neural Network
3.5 Visualization
4 Implementation
4.1 Preprocessing
4.2 Support Vector Machine
4.3 Multilayered Perceptron
4.4 Convolutional Neural Network
5 Result
6 Conclusion
7 Future Enhancement
References
Optimized Text Summarization Using Abstraction and Extraction
1 Introduction
2 Literature Survey
3 Proposed Methodology
3.1 Preprocessing
3.2 Algorithm
4 Experimental Results and Analysis
5 Conclusion
References
Mall Customer Segmentation Using K-Means Clustering
1 Introduction
2 Literature Review
3 Research Methodology
4 Result Analysis
5 Conclusion
6 Future Direction
References
Modified Local Gradient Coding Pattern (MLGCP): A Handcrafted Feature Descriptor for Classification of Infectious Diseases
1 Introduction
1.1 Limitation of the Related Work
1.2 Contributions to the Current Work
2 Proposed Methodology
2.1 Modified Local Gradient Coding Pattern
3 Experimental Design
3.1 Dataset Description
3.2 Experimentation
4 Results and Discussions
4.1 Discussions and Performance Comparison
5 Conclusions and Future Scope
References
Revolutionising Food Safety Management: The Role of Blockchain Technology in Ensuring Safe and High-Quality Food Products
1 Introduction
1.1 Background and Context of the Problem
1.2 Purpose and Significance of the Study
1.3 Research Questions and Objectives
1.4 Overview of the Paper
2 Literature Review
2.1 Overview of BCT
2.2 Existing Research of Blockchain in the Food Industry
2.3 Advantages and Limitations of BCT
3 Proposed Food Safety System
3.1 Farmers
3.2 Processors
3.3 Distributors
3.4 Retailers
3.5 Consumers
4 Discussion
4.1 Summary of the Main Findings
4.2 Answers of Research Questions
5 Conclusions
References
Securing the E-records of Patient Data Using the Hybrid Encryption Model with Okamoto–Uchiyama Cryptosystem in Smart Healthcare
1 Introduction
2 Related Works
3 Proposed Methodology
3.1 Okamoto–Uchiyama Cryptosystem
3.2 Proposed Hybrid Technique
3.3 Process for Encrypting Alpha or Alphanumeric Sensed Data
3.4 Process for Decrypting Alpha or Alphanumeric Sensed Data
3.5 Process for Obfuscating Numerical Detected Data
3.6 Technique for De-obfuscating Numerical Sensed Data
4 Results and Discussion
4.1 Breast Cancer Analysis
4.2 Details of Wisconsin Breast Cancer Dataset
5 Conclusion
References
Cuttlefish Algorithm-Based Deep Learning Model to Predict the Missing Data in Healthcare Application
1 Introduction
2 Related Works
3 Proposed System
3.1 Bidirectional LSTM Unit
3.2 Multi-criteria Decision Analysis (MCFA)
3.3 Multi-head Self-attention Apparatus
3.4 Outline of Self-attentive Encoding and Decoding in a Bidirectional Perspective
4 Results and Discussion
4.1 Data Set Description of Public Data Sets
4.2 Performance Measures
5 Conclusion
References
Drowsiness Detection System Using DL Models
1 Introduction
1.1 Research Objectives
1.2 Target Group
1.3 The Proposed System
1.4 Novelty of the Project
1.5 Why Will this Solution Work?
2 Related Work
3 Proposed System
3.1 The Detection of Facial Landmark
3.2 Detection of ROI
3.3 Classification of Eyes Using CNN
3.4 Score Calculation Algorithm
4 Proposed System
4.1 Dataset Description
4.2 Preprocessing
5 Algorithmic Steps
5.1 Face Detection
5.2 Eye Extraction
5.3 Eye Classification
5.4 EAR Calculation (Eye Aspect Ratio)
5.5 Drowsiness Detection
6 Flow/Block Diagram
7 Result
8 Conclusion and Future Scope
References
Author Index

Recommend Papers

Proceedings of Data Analytics and Management: ICDAM 2023, Volume 3 (Lecture Notes in Networks and Systems, 787) 9819965497, 9789819965496

This book includes original unpublished contributions presented at the International Conference on Data Analytics and Ma

108 101 18MB Read more

Proceedings of Data Analytics and Management: ICDAM 2022 9811976147, 9789811976148

This book includes original unpublished contributions presented at the International Conference on Data Analytics and Ma

442 95 24MB Read more

Proceedings of International Conference on Data Analytics and Insights, ICDAI 2023 (Lecture Notes in Networks and Systems, 727) 9819938775, 9789819938773

The book is a collection of peer-reviewed best selected research papers presented at the International Conference on Dat

113 66 23MB Read more

ICT: Smart Systems and Technologies: Proceedings of ICTCS 2023, Volume 4 (Lecture Notes in Networks and Systems, 878) 9819994888, 9789819994885

This book contains best selected research papers presented at ICTCS 2023: Eighth International Conference on Information

117 7 12MB Read more

Data Analytics in System Engineering: Proceedings of 7th Computational Methods in Systems and Software 2023, Vol. 4 (Lecture Notes in Networks and Systems, 935) 3031548191, 9783031548192

These proceedings offer an insightful exploration of integrating data analytics in system engineering. This book highlig

116 100 78MB Read more

Proceedings of Data Analytics and Management: ICDAM 2021, Volume 2 (Lecture Notes on Data Engineering and Communications Technologies, 91) 9811662843, 9789811662843

This book includes original unpublished contributions presented at the International Conference on Data Analytics and Ma

109 35 21MB Read more

Proceedings of Data Analytics and Management: ICDAM 2021, Volume 1 (Lecture Notes on Data Engineering and Communications Technologies, 90) 9811662886, 9789811662881

This book includes original unpublished contributions presented at the International Conference on Data Analytics and Ma

112 39 19MB Read more

Data Analytics in System Engineering: Proceedings of 7th Computational Methods in Systems and Software 2023, Vol. 3 (Lecture Notes in Networks and Systems) 3031535510, 9783031535512

These proceedings offer an insightful exploration of integrating data analytics in system engineering. This book highlig

118 85 40MB Read more

Proceedings of the Future Technologies Conference (FTC) 2023, Volume 4 (Lecture Notes in Networks and Systems, 816) 3031474473, 9783031474477

This book is a collection of thoroughly well-researched studies presented at the Eighth Future Technologies Conference.

113 45 32MB Read more

ICT Systems and Sustainability: Proceedings of ICT4SD 2023, Volume 1 (Lecture Notes in Networks and Systems, 765) 981995651X, 9789819956517

This book proposes new technologies and discusses future solutions for ICT design infrastructures, as reflected in high-

122 56 19MB Read more

Proceedings of Data Analytics and Management: ICDAM 2023, Volume 4 (Lecture Notes in Networks and Systems, 788)
9819965527, 9789819965526

Author / Uploaded
Abhishek Swaroop (editor)
Zdzislaw Polkowski (editor)
Sérgio Duarte Correia (editor)
Bal Virdee (editor)

0 0 0
Like this paper and download? You can publish your own PDF file online for free in a few minutes! Sign Up

File loading please wait...

Citation preview

Lecture Notes in Networks and Systems 788

Abhishek Swaroop Zdzislaw Polkowski Sérgio Duarte Correia Bal Virdee Editors

Proceedings of Data Analytics and Management ICDAM 2023, Volume 4

Lecture Notes in Networks and Systems Volume 788

Series Editor Janusz Kacprzyk , Systems Research Institute, Polish Academy of Sciences, Warsaw, Poland Advisory Editors Fernando Gomide, Department of Computer Engineering and Automation—DCA, School of Electrical and Computer Engineering—FEEC, University of Campinas—UNICAMP, São Paulo, Brazil Okyay Kaynak, Department of Electrical and Electronic Engineering, Bogazici University, Istanbul, Türkiye Derong Liu, Department of Electrical and Computer Engineering, University of Illinois at Chicago, Chicago, USA Institute of Automation, Chinese Academy of Sciences, Beijing, China Witold Pedrycz, Department of Electrical and Computer Engineering, University of Alberta, Alberta, Canada Systems Research Institute, Polish Academy of Sciences, Warsaw, Poland Marios M. Polycarpou, Department of Electrical and Computer Engineering, KIOS Research Center for Intelligent Systems and Networks, University of Cyprus, Nicosia, Cyprus Imre J. Rudas, Óbuda University, Budapest, Hungary Jun Wang, Department of Computer Science, City University of Hong Kong, Kowloon, Hong Kong

The series “Lecture Notes in Networks and Systems” publishes the latest developments in Networks and Systems—quickly, informally and with high quality. Original research reported in proceedings and post-proceedings represents the core of LNNS. Volumes published in LNNS embrace all aspects and subfields of, as well as new challenges in, Networks and Systems. The series contains proceedings and edited volumes in systems and networks, spanning the areas of Cyber-Physical Systems, Autonomous Systems, Sensor Networks, Control Systems, Energy Systems, Automotive Systems, Biological Systems, Vehicular Networking and Connected Vehicles, Aerospace Systems, Automation, Manufacturing, Smart Grids, Nonlinear Systems, Power Systems, Robotics, Social Systems, Economic Systems and other. Of particular value to both the contributors and the readership are the short publication timeframe and the world-wide distribution and exposure which enable both a wide and rapid dissemination of research output. The series covers the theory, applications, and perspectives on the state of the art and future developments relevant to systems and networks, decision making, control, complex processes and related areas, as embedded in the fields of interdisciplinary and applied sciences, engineering, computer science, physics, economics, social, and life sciences, as well as the paradigms and methodologies behind them. Indexed by SCOPUS, INSPEC, WTI Frankfurt eG, zbMATH, SCImago. All books published in the series are submitted for consideration in Web of Science. For proposals from Asia please contact Aninda Bose ([email protected]).

Abhishek Swaroop · Zdzislaw Polkowski · Sérgio Duarte Correia · Bal Virdee Editors

Proceedings of Data Analytics and Management ICDAM 2023, Volume 4

Editors Abhishek Swaroop Department of Information Technology Bhagwan Parshuram Institute of Technology New Delhi, Delhi, India Sérgio Duarte Correia Polytechnic Institute of Portalegre Portalegre, Portugal

Zdzislaw Polkowski Jan Wyzykowski University Polkowice, Poland Bal Virdee Centre for Communications Technology London Metropolitan University London, UK

ISSN 2367-3370 ISSN 2367-3389 (electronic) Lecture Notes in Networks and Systems ISBN 978-981-99-6552-6 ISBN 978-981-99-6553-3 (eBook) https://doi.org/10.1007/978-981-99-6553-3 © The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 This work is subject to copyright. All rights are solely and exclusively licensed by the Publisher, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed. The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use. The publisher, the authors, and the editors are safe to assume that the advice and information in this book are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or the editors give a warranty, expressed or implied, with respect to the material contained herein or for any errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional claims in published maps and institutional affiliations. This Springer imprint is published by the registered company Springer Nature Singapore Pte Ltd. The registered company address is: 152 Beach Road, #21-01/04 Gateway East, Singapore 189721, Singapore Paper in this product is recyclable.

ICDAM 2023 Steering Committee Members

Patrons Prof. (Dr.) Don MacRaild, Pro-Vice Chancellor, London Metropolitan University, London Prof. (Dr.) Wioletta Palczewska, Rector, The Karkonosze State University of Applied Sciences in Jelenia Góra, Poland Prof. (Dr.) Beata Tel˛az˙ ka, Vice-Rector, The Karkonosze State University of Applied Sciences in Jelenia Góra

General Chairs Prof. Dr. Janusz Kacprzyk, Polish Academy of Sciences, Systems Research Institute, Poland Prof. Dr. Karim Ouazzane, London Metropolitan University, London Prof. Dr. Bal Virdee, London Metropolitan University, London Prof. Cesare Alippi, Polytechnic University of Milan, Italy

Honorary Chairs Prof. Dr. Aboul Ella Hassanien, Cairo University, Egypt Prof. Dr. Vaclav Snasel, Rector, VSB-Technical University of Ostrava, Czech Republic Prof. Chris Lane, London Metropolitan University, London

v

vi

ICDAM 2023 Steering Committee Members

Conference Chairs Prof. Dr. Vassil Vassilev, London Metropolitan University, London Dr. Pancham Shukla, Imperial College London, London Prof. Dr. Mak Sharma, Birmingham City University, London Dr. Shikun Zhou, University of Portsmouth Dr. Magdalena Baczy´nska, Dean, The Karkonosze State University of Applied Sciences in Jelenia Góra, Poland Dr. Zdzislaw Polkowski, Adjunct Professor KPSW, The Karkonosze State University of Applied Sciences in Jelenia Góra Prof. Dr. Abhishek Swaroop, Bhagwan Parshuram Institute of Technology, Delhi, India Prof. Dr. Anil K. Ahlawat, Dean, KIET Group of Institutes, India

Technical Program Chairs Dr. Shahram Salekzamankhani, London Metropolitan University, London Dr. Mohammad Hossein Amirhosseini, University of East London, London Dr. Sandra Fernando, London Metropolitan University, London Dr. Qicheng Yu, London Metropolitan University, London Prof. Joel J. P. C. Rodrigues, Federal University of Piauí (UFPI), Teresina—PI, Brazil Dr. Ali Kashif Bashir, Manchester Metropolitan University, UK Dr Rajkumar Singh Rathore, Cardiff Metropolitan University, UK

Conveners Dr. Ashish Khanna, Maharaja Agrasen Institute of Technology (GGSIPU), New Delhi, India Dr. Deepak Gupta, Maharaja Agrasen Institute of Technology (GGSIPU), New Delhi, India

Publicity Chairs Dr. Józef Zaprucki, Prof. KPSW, Rector’s Proxy for Foreign Affairs, The Karkonosze State University of Applied Sciences in Jelenia Góra Dr. Umesh Gupta, Bennett University, India Dr. Puneet Sharma, Assistant Professor, Amity University, Noida

ICDAM 2023 Steering Committee Members

vii

Dr. Deepak Arora, Professor and Head (CSE), Amity University, Lucknow Campus João Matos-Carvalho, Lusófona University, Portugal

Co-conveners Mr. Moolchand Sharma, Maharaja Agrasen Institute of Technology, India Dr. Richa Sharma, London Metropolitan University, London

Preface

We hereby are delighted to announce that The London Metropolitan University, London, in collaboration with The Karkonosze University of Applied Sciences, Poland, Politécnico de Portalegre, Portugal, and Bhagwan Parshurm Institute of Technology, India, has hosted the eagerly awaited and much coveted International Conference on Data Analytics and Management (ICDAM 2023). The fourth version of the conference was able to attract a diverse range of engineering practitioners, academicians, scholars, and industry delegates, with the reception of abstracts including more than 7000 authors from different parts of the world. The committee of professionals dedicated toward the conference is striving to achieve a high-quality technical program with tracks on data analytics, data management, big data, computational intelligence, and communication networks. All the tracks chosen in the conference are interrelated and are very famous among present-day research community. Therefore, a lot of research is happening in the above-mentioned tracks and their related sub-areas. More than 1200 full-length papers have been received, among which the contributions are focused on theoretical, computer simulation-based research, and laboratory-scale experiments. Among these manuscripts, 190 papers have been included in the Springer proceedings after a thorough two-stage review and editing process. All the manuscripts submitted to the ICDAM 2023 were peer-reviewed by at least two independent reviewers, who were provided with a detailed review pro forma. The comments from the reviewers were communicated to the authors, who incorporated the suggestions in their revised manuscripts. The recommendations from two reviewers were taken into consideration while selecting a manuscript for inclusion in the proceedings. The exhaustiveness of the review process is evident, given the large number of articles received addressing a wide range of research areas. The stringent review process ensured that each published manuscript met the rigorous academic and scientific standards. It is an exalting experience to finally see these elite contributions materialize into the four book volumes as ICDAM proceedings by Springer entitled “Proceedings of Data Analytics and Management: ICDAM 2023”. ICDAM 2023 invited four keynote speakers, who are eminent researchers in the field of computer science and engineering, from different parts of the world. In

ix

x

Preface

addition to the plenary sessions on each day of the conference, seventeen concurrent technical sessions are held every day to assure the oral presentation of around 190 accepted papers. Keynote speakers and session chair(s) for each of the concurrent sessions have been leading researchers from the thematic area of the session. The delegates were provided with a book of extended abstracts to quickly browse through the contents, participate in the presentations, and provide access to a broad audience of the audience. The research part of the conference was organized in a total of 22 special sessions. These special sessions provided the opportunity for researchers conducting research in specific areas to present their results in a more focused environment. An international conference of such magnitude and release of the ICDAM 2023 proceedings by Springer has been the remarkable outcome of the untiring efforts of the entire organizing team. The success of an event undoubtedly involves the painstaking efforts of several contributors at different stages, dictated by their devotion and sincerity. Fortunately, since the beginning of its journey, ICDAM 2023 has received support and contributions from every corner. We thank them all who have wished the best for ICDAM 2023 and contributed by any means toward its success. The edited proceedings volumes by Springer would not have been possible without the perseverance of all the steering, advisory, and technical program committee members. All the contributing authors owe thanks from the organizers of ICDAM 2023 for their interest and exceptional articles. We would also like to thank the authors of the papers for adhering to the time schedule and for incorporating the review comments. We wish to extend my heartfelt acknowledgment to the authors, peer reviewers, committee members, and production staff whose diligent work put shape to the ICDAM 2023 proceedings. We especially want to thank our dedicated team of peer reviewers who volunteered for the arduous and tedious step of quality checking and critique on the submitted manuscripts. We wish to thank our faculty colleague Mr. Moolchand Sharma for extending their enormous assistance during the conference. The time spent by them and the midnight oil burnt is greatly appreciated, for which we will ever remain indebted. The management, faculties, administrative, and support staff of the college have always been extending their services whenever needed, for which we remain thankful to them. Lastly, we would like to thank Springer for accepting our proposal for publishing the ICDAM 2023 conference proceedings. Help received from Mr. Aninda Bose, the acquisition senior editor, in the process has been very useful. New Delhi, India Polkowice, Poland Portalegre, Portugal London, UK

Abhishek Swaroop Zdzislaw Polkowski Sérgio Duarte Correia Bal Virdee

Contents

Deep Spectral Feature Representations Via Attention-Based Neural Network Architectures for Accented Malayalam Speech—A Low-Resourced Language . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Rizwana Kallooravi Thandil, K. P. Mohamed Basheer, and V. K. Muneer

1

Improving Tree-Based Convolutional Neural Network Model for Image Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Saba Raees and Parul Agarwal

15

Smartphone Malware Detection Based on Enhanced Correlation-Based Feature Selection on Permissions . . . . . . . . . . . . . . . . . . Shagun, Deepak Kumar, and Anshul Arora

29

Fake News Detection Using Ensemble Learning Models . . . . . . . . . . . . . . . Devanshi Singh, Ahmad Habib Khan, and Shweta Meena Ensemble Approach for Suggestion Mining Using Deep Recurrent Convolutional Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Usama Bin Rashidullah Khan, Nadeem Akhtar, and Ehtesham Sana A CNN-Based Self-attentive Approach to Knowledge Tracing . . . . . . . . . . Anasuya Mithra Parthaje, Akaash Nidhiss Pandian, and Bindu Verma LIPFCM: Linear Interpolation-Based Possibilistic Fuzzy C-Means Clustering Imputation Method for Handling Incomplete Data . . . . . . . . . Jyoti, Jaspreeti Singh, and Anjana Gosain

53

67 77

87

Experimental Analysis of Two-Wheeler Headlight Illuminance Data from the Perspective of Traffic Safety . . . . . . . . . . . . . . . . . . . . . . . . . . . 101 Aditya Gola, Chandra Mohan Dharmapuri, Neelima Chakraborty, S. Velmurugan, and Vinod Karar Detecto: The Phishing Website Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115 Ashish Prajapati, Jyoti Kukade, Akshat Shukla, Atharva Jhawar, Amit Dhakad, Trapti Mishra, and Rahul Singh Pawar xi

xii

Contents

Synergizing Voice Cloning and ChatGPT for Multimodal Conversational Interfaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131 Shruti Bibra, Srijan Singh, and R. P. Mahapatra A Combined PCA-CNN Method for Enhanced Machinery Fault Diagnosis Through Fused Spectrogram Analysis . . . . . . . . . . . . . . . . . . . . . 141 Harshit Rajput, Hrishabh Palsra, Abhishek Jangid, and Sachin Taran FPGA-Based Design of Chaotic Systems with Quadratic Nonlinearities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151 Kriti Suneja, Neeta Pandey, and Rajeshwari Pandey A Comprehensive Survey on Replay Strategies for Object Detection . . . . 163 Allabaksh Shaik and Shaik Mahaboob Basha Investigation of Statistical and Machine Learning Models for COVID-19 Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 181 Joydeep Saggu and Ankita Bansal SONAR-Based Sound Waves’ Utilization for Rocks’ and Mines’ Detection Using Logistic Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 191 Adrija Mitra, Adrita Chakraborty, Supratik Dutta, Yash Anand, Sushruta Mishra, and Anil Kumar A Sampling-Based Logistic Regression Model for Credit Card Fraud Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 201 Prapti Patra, Srijal Vedansh, Vishisht Ved, Anup Singh, Sushruta Mishra, and Anil Kumar iFlow: Powering Lightweight Cross-Platform Data Pipelines . . . . . . . . . . . 211 Supreeta Nayak, Ansh Sarkar, Dushyant Lavania, Nittishna Dhar, Sushruta Mishra, and Anil Kumar Developing a Deep Learning Model to Classify Cancerous and Non-cancerous Lung Nodules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 225 Rishit Pandey, Sayani Joddar, Sushruta Mishra, Ahmed Alkhayyat, Shaid Sheel, and Anil Kumar Concrete Crack Detection Using Thermograms and Neural Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 237 Mabrouka Abuhmida, Daniel Milne, Jiping Bai, and Ian Wilson Wind Power Prediction in Mediterranean Coastal Cities Using Multi-layer Perceptron Neural Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 253 Youssef Kassem, Hüseyin Çamur, and Abdalla Hamada Abdelnaby Abdelnaby Next Generation Intelligent IoT Use Case in Smart Manufacturing . . . . . 265 Bharati Rathore

Contents

xiii

Forecasting Financial Success App: Unveiling the Potential of Random Forest in Machine Learning-Based Investment Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 279 Ashish Khanna, Divyansh Goyal, Nidhi Chaurasia, and Tariq Hussain Sheikh Integration of Blockchain-Enabled SBT and QR Code Technology for Secure Verification of Digital Documents . . . . . . . . . . . . . . . . . . . . . . . . . 293 Ashish Khanna, Devansh Singh, Ria Monga, Tarun Kumar, Ishaan Dhull, and Tariq Hussain Sheikh Time Series Forecasting of NSE Stocks Using Machine Learning Models (ARIMA, Facebook Prophet, and Stacked LSTM) . . . . . . . . . . . . . 303 Prabudhd Krishna Kandpal, Shourya, Yash Yadav, and Neelam Sharma Analysis of Monkey Pox (MPox) Detection Using UNETs and VGG16 Weights . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 321 V. Kakulapati Role of Robotic Process Automation in Enhancing Customer Satisfaction in E-commerce Through E-mail Automation . . . . . . . . . . . . . . 333 Shamini James, S. Karthik, Binu Thomas, and Nitish Pathak Gene Family Classification Using Machine Learning: A Comparative Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 343 Drishti Seth, KPA Dharmanshu Mahajan, Rohit Khanna, and Gunjan Chugh Dense Convolution Neural Network for Lung Cancer Classification and Staging of the Diseases Using NSCLC Images . . . . . . . . . . . . . . . . . . . . 361 Ahmed J. Obaid, S. Suman Rajest, S. Silvia Priscila, T. Shynu, and Sajjad Ali Ettyem Sentiment Analysis Using Bi-ConvLSTM . . . . . . . . . . . . . . . . . . . . . . . . . . . . 373 Durga Satish Matta and K. Saruladha A New Method for Protein Sequence Comparison Using Chaos Game Representation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 389 Debrupa Pal, Sudeshna Dey, Papri Ghosh, Subhram Das, and Bansibadan Maji Credit Card Fraud Detection and Classification Using Deep Learning with Support Vector Machine Techniques . . . . . . . . . . . . . . . . . . . 399 Fatima Adel Nama, Ahmed J. Obaid, and Ali Abdulkarem Habib Alrammahi Prediction of Criminal Activities Forecasting System and Analysis Using Machine Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 415 Mahendra Sharma and Laveena Sehgal

xiv

Contents

Comparing Techniques for Digital Handwritten Detection Using CNN and SVM Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 431 M. Arvindhan, Shubham Upadhyay, Avdeep Malik, Sudeshna Chakraborty, and Kimmi Gupta Optimized Text Summarization Using Abstraction and Extraction . . . . . 445 Harshita Patel, Pallavi Mishra, Shubham Agarwal, Aanchal Patel, and Stuti Hegde Mall Customer Segmentation Using K-Means Clustering . . . . . . . . . . . . . . 459 Ashwani, Gurleen Kaur, and Lekha Rani Modified Local Gradient Coding Pattern (MLGCP): A Handcrafted Feature Descriptor for Classification of Infectious Diseases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 475 Rohit Kumar Bondugula and Siba K. Udgata Revolutionising Food Safety Management: The Role of Blockchain Technology in Ensuring Safe and High-Quality Food Products . . . . . . . . . 487 Urvashi Sugandh, Swati Nigam, and Manju Khari Securing the E-records of Patient Data Using the Hybrid Encryption Model with Okamoto–Uchiyama Cryptosystem in Smart Healthcare . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 499 Prasanna Kumar Lakineni, R. Balamanigandan, T. Rajesh Kumar, V. Sathyendra Kumar, R. Mahaveerakannan, and Chinthakunta Swetha Cuttlefish Algorithm-Based Deep Learning Model to Predict the Missing Data in Healthcare Application . . . . . . . . . . . . . . . . . . . . . . . . . . 513 A. Sasi Kumar, T. Rajesh Kumar, R. Balamanigandan, R. Meganathan, Roshan Karwa, and R. Mahaveerakannan Drowsiness Detection System Using DL Models . . . . . . . . . . . . . . . . . . . . . . 529 Umesh Gupta, Yelisetty Priya Nagasai, and Sudhanshu Gupta Author Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 543

Editors and Contributors

About the Editors Prof. (Dr.) Abhishek Swaroop completed his B.Tech. (CSE) from GBP University of Agriculture and Technology, M.Tech. from Punjabi University Patiala, and Ph.D. from NIT Kurukshetra. He has industrial experience of 8 years in organizations like Usha Rectifier Corporations and Envirotech Instruments Pvt. Limited. He has 22 years of teaching experience. He has served in reputed educational institutions such as Jaypee Institute of Information Technology, Noida, Sharda University Greater Noida, and Galgotias University Greater Noida. He has served at various administrative positions such as Head of the Department, Division Chair, NBA Coordinator for the university, and Head of training and placements. Currently, he is serving as Professor and HoD, Department of Information Technology in Bhagwan Parshuram Institute of Technology, Rohini, and Delhi. He is actively engaged in research. He has more than 60 quality publications, out of which eight are SCI and 16 Scopus. Prof. (Dr.) Zdzislaw Polkowski is Adjunct Professor at Faculty of Technical Sciences at the Jan Wyzykowski University, Poland. He is also Rector’s Representative for International Cooperation and Erasmus Program and Former Dean of the Technical Sciences Faculty during the period of 2009–2012 His area of research includes management information systems, business informatics, IT in business and administration, IT security, small medium enterprises, CC, IoT, big data, business intelligence, and block chain. He has published around 60 research articles. He has served the research community in the capacity of Author, Professor, Reviewer, Keynote Speaker, and Co-editor. He has attended several international conferences in the various parts of the world. He is also playing the role of Principal Investigator. Prof. Sérgio Duarte Correia received his Diploma in Electrical and Computer Engineering from the University of Coimbra, Portugal, in 2000, the master’s degree in Industrial Control and Maintenance Systems from Beira Interior University, Covilhã, Portugal, in 2010, and the Ph.D. in Electrical and Computer Engineering from the

xv

xvi

Editors and Contributors

University of Coimbra, Portugal, in 2020. Currently, he is Associate Professor at the Polytechnic Institute of Portalegre, Portugal. He is Researcher at COPELABS— Cognitive and People-centric Computing Research Center, Lusófona University of Humanities and Technologies, Lisbon, Portugal, and Valoriza—Research Center for Endogenous Resource Valorization, Polytechnic Institute of Portalegre, Portalegre, Portugal. Over past 20 years, he has worked with several private companies in the field of product development and industrial electronics. His current research interests are artificial intelligence, soft computing, signal processing, and embedded computing. Prof. Bal Virdee graduated with a B.Sc. (Engineering) Honors in Communication Engineering and M.Phil. from Leeds University, UK. He obtained his Ph.D. from University of North London, UK. He was worked as Academic at Open University and Leeds University. Prior to this, he was Research and Development Electronic Engineer in the Future Products Department at Teledyne Defence (formerly Filtronic Components Ltd., Shipley, West Yorkshire) and at PYE TVT (Philips) in Cambridge. He has held numerous duties and responsibilities at the university, i.e., Health and Safety Officer, Postgraduate Tutor, Examination’s Officer, Admission’s Tutor, Short Course Organizer, Course Leader for M.Sc./M.Eng. Satellite Communications, B.Sc. Communications Systems, and B.Sc. Electronics. In 2010, he was appointed as Academic Leader (UG Recruitment). He is Member of ethical committee and Member of the school’s research committee and research degrees committee.

Contributors Abdalla Hamada Abdelnaby Abdelnaby Faculty of Engineering, Mechanical Engineering Department, Near East University, Nicosia, North Cyprus, Cyprus Mabrouka Abuhmida University of South Wales, Cardiff, UK Parul Agarwal Department of Computer Science and Engineering, School of Engineering Sciences and Technology, Jamia Hamdard University, New Delhi, India Shubham Agarwal School of Information Technology, VIT University, Vellore, India Nadeem Akhtar Department of Computer Engineering and Interdisciplinary Centre for Artificial Intelligence, Aligarh Muslim University, Aligarh, Uttar Pradesh, India Ahmed Alkhayyat Faculty of Engineering, The Islamic University, Najaf, Iraq Ali Abdulkarem Habib Alrammahi National University of Science and Technology, Thi-Qar, Nasiriyah, Iraq

Editors and Contributors

xvii

Yash Anand Kalinga Institute of Industrial Technology, Deemed to be University, Bhubaneswar, India Anshul Arora Delhi Technological University, Rohini, Delhi, New Delhi, India M. Arvindhan School of Computing Science and Engineering, Galgotias University, Greater Noida, Uttar Pradesh, India Ashwani Chitkara University Institute of Engineering and Technology, Rajpura, Punjab, India Jiping Bai University of South Wales, Cardiff, UK R. Balamanigandan Department of Computer Science and Engineering, Saveetha School of Engineering, Saveetha Institute of Medical and Technical Sciences, Chennai, Tamil Nadu, India Ankita Bansal Netaji Subhas University of Technology, Dwarka, India Shaik Mahaboob Basha N.B.K.R. Institute of Science and Technology, Affiliated to Jawaharlal Nehru Technological University Anantapur, Vidyanagar, Ananthapuramu, Andhra Pradesh, India Shruti Bibra SRM Institute of Science and Technology, Ghaziabad, India Rohit Kumar Bondugula AI Lab, School of Computer and Information Sciences, University of Hyderabad, Hyderabad, India Hüseyin Çamur Faculty of Engineering, Mechanical Engineering Department, Near East University, Nicosia, North Cyprus, Cyprus Adrita Chakraborty Kalinga Institute of Industrial Technology, Deemed to be University, Bhubaneswar, India Neelima Chakraborty CSIR—Central Road Research Institute, New Delhi, India Sudeshna Chakraborty School of Computing Science and Engineering, Galgotias University, Greater Noida, Uttar Pradesh, India Nidhi Chaurasia Maharaja Agrasen Institute of Technology, Guru Gobind Singh Indraprastha, University Delhi, Delhi, India Gunjan Chugh Department of Artificial Intelligence and Machine Learning, Maharaja Agrasen Institute of Technology, Delhi, India Subhram Das Narula Institute of Technology, Kolkata, India Sudeshna Dey Narula Institute of Technology, Kolkata, India Amit Dhakad Medi-Caps University, Indore, India Nittishna Dhar Kalinga Institute of Industrial Technology, Deemed to be University, Bhubaneswar, India

xviii

Editors and Contributors

KPA Dharmanshu Mahajan Department of Artificial Intelligence and Machine Learning, Maharaja Agrasen Institute of Technology, Delhi, India Chandra Mohan Dharmapuri G. B. Pant Government Engineering College, New Delhi, India Ishaan Dhull Department of Computer Science and Engineering, Maharaja Agrasen Institute of Technology, GGSIPU, Delhi, India Supratik Dutta Kalinga Institute of Industrial Technology, Deemed to be University, Bhubaneswar, India Sajjad Ali Ettyem National University of Science and Technology, Thi-Qar, Iraq Papri Ghosh Narula Institute of Technology, Kolkata, India Aditya Gola G. B. Pant Government Engineering College, New Delhi, India Anjana Gosain USICT, Guru Gobind Singh Indraprastha University, New Delhi, India Divyansh Goyal Maharaja Agrasen Institute of Technology, Guru Gobind Singh Indraprastha, University Delhi, Delhi, India Kimmi Gupta School of Computing Science and Engineering, Galgotias University, Greater Noida, Uttar Pradesh, India Sudhanshu Gupta SCSET, Bennett University, Greater Noida, Uttar Pradesh, India Umesh Gupta SCSET, Bennett University, Greater Noida, Uttar Pradesh, India Stuti Hegde School of Information Technology, VIT University, Vellore, India Shamini James Kalasalingam Academy of Research and Education, Krishnankoil, Tamil Nadu, India Abhishek Jangid Department of Electronics and Communication Engineering, Delhi Technological University, Delhi, India Atharva Jhawar Medi-Caps University, Indore, India Sayani Joddar Kalinga Institute of Industrial Technology, Deemed to Be University, Bhubaneswar, India Jyoti USICT, Guru Gobind Singh Indraprastha University, New Delhi, India V. Kakulapati Sreenidhi Institute of Science and Technology, Yamnampet, Ghatkesar, Hyderabad, Telangana, India Prabudhd Krishna Kandpal Department of Artificial Intelligence and Machine Learning, Maharaja Agrasen Institute of Technology, Delhi, India Vinod Karar CSIR—Central Road Research Institute, New Delhi, India

Editors and Contributors

xix

S. Karthik Kalasalingam Academy of Research and Education, Krishnankoil, Tamil Nadu, India Roshan Karwa Department of CSE, Prof Ram Meghe Institute of Technology and Research, Badnera-Amravati, India Youssef Kassem Faculty of Engineering, Mechanical Engineering Department, Near East University, Nicosia, North Cyprus, Cyprus; Faculty of Civil and Environmental Engineering, Near East University, Nicosia, North Cyprus, Cyprus; Near East University, Energy, Environment, and Water Research Center, Nicosia, North Cyprus, Cyprus Gurleen Kaur Chitkara University Institute of Engineering and Technology, Rajpura, Punjab, India Ahmad Habib Khan Delhi Technological University, New Delhi, India Usama Bin Rashidullah Khan Interdisciplinary Centre for Artificial Intelligence, Aligarh Muslim University, Aligarh, India Ashish Khanna Maharaja Agrasen Institute of Technology, Guru Gobind Singh Indraprastha, University Delhi, Delhi, India; Department of Computer Science and Engineering, Maharaja Agrasen Institute of Technology, GGSIPU, Delhi, India Rohit Khanna Department of Artificial Intelligence and Machine Learning, Maharaja Agrasen Institute of Technology, Delhi, India Manju Khari School of Computer and System Sciences, Jawaharlal Nehru University, New Delhi, India Jyoti Kukade Medi-Caps University, Indore, India Anil Kumar DIT University, Dehradun, India; Tula’s Institute, Dehradun, India Deepak Kumar Delhi Technological University, Rohini, Delhi, New Delhi, India Tarun Kumar Department of Computer Science and Engineering, Maharaja Agrasen Institute of Technology, GGSIPU, Delhi, India Prasanna Kumar Lakineni Department of CSE, GITAM School of Technology, GITAM University, Visakhapatnam, India Dushyant Lavania Kalinga Institute of Industrial Technology, Deemed to be University, Bhubaneswar, India R. P. Mahapatra SRM Institute of Science and Technology, Ghaziabad, India R. Mahaveerakannan Department of Computer Science and Engineering, Saveetha School of Engineering, Saveetha Institute of Medical and Technical Sciences, Chennai, Tamil Nadu, India

xx

Editors and Contributors

Bansibadan Maji National Institute of Technology, Durgapur, India Avdeep Malik School of Computing Science and Engineering, Galgotias University, Greater Noida, Uttar Pradesh, India Durga Satish Matta Department of Computer Science Puducherry Technological University, Puducherry, India

and

Engineering,

Shweta Meena Delhi Technological University, New Delhi, India R. Meganathan Department of Computer Science and Engineering, Koneru Lakshmaiah Education Foundation, Vaddeswaram, AP, India Daniel Milne University of South Wales, Cardiff, UK Pallavi Mishra School of Information Technology, VIT University, Vellore, India Sushruta Mishra Kalinga Institute of Industrial Technology, Deemed to be University, Bhubaneswar, India Trapti Mishra Medi-Caps University, Indore, India Adrija Mitra Kalinga Institute of Industrial Technology, Deemed to be University, Bhubaneswar, India K. P. Mohamed Basheer Sullamussalam Science College, Areekode, Kerala, India Ria Monga Department of Computer Science and Engineering, Maharaja Agrasen Institute of Technology, GGSIPU, Delhi, India V. K. Muneer Sullamussalam Science College, Areekode, Kerala, India Yelisetty Priya Nagasai SCSET, Bennett University, Greater Noida, Uttar Pradesh, India Fatima Adel Nama Faculty of Computer Science and Mathematics, University of Kufa, Kufa, Iraq Supreeta Nayak Kalinga Institute of Industrial Technology, Deemed to be University, Bhubaneswar, India Swati Nigam Department of Computer Science, Faculty of Mathematics and Computing, Banasthali Vidyapith, Banasthali, India Ahmed J. Obaid Faculty of Computer Science and Mathematics, University of Kufa, Kufa, Iraq; Department of Computer Technical Engineering, Technical Engineering College, Al-Ayen University, Thi-Qar, Iraq Debrupa Pal Narula Institute of Technology, Kolkata, India; National Institute of Technology, Durgapur, India Hrishabh Palsra Department of Electronics and Communication Engineering, Delhi Technological University, Delhi, India

Editors and Contributors

xxi

Neeta Pandey Delhi Technological University, Delhi, India Rajeshwari Pandey Delhi Technological University, Delhi, India Rishit Pandey Kalinga Institute of Industrial Technology, Deemed to Be University, Bhubaneswar, India Akaash Nidhiss Pandian Delhi Technological University, New Delhi, India Anasuya Mithra Parthaje Delhi Technological University, New Delhi, India Aanchal Patel School of Information Technology, VIT University, Vellore, India Harshita Patel School of Information Technology, VIT University, Vellore, India Nitish Pathak Bhagwan Parshuram Institute of Technology (BPIT), GGSIPU, New Delhi, India Prapti Patra Kalinga Institute of Industrial Technology, Deemed to Be University, Bhubaneswar, India Rahul Singh Pawar Medi-Caps University, Indore, India Ashish Prajapati Medi-Caps University, Indore, India Saba Raees Department of Computer Science and Engineering, School of Engineering Sciences and Technology, Jamia Hamdard University, New Delhi, India T. Rajesh Kumar Department of Computer Science and Engineering, Saveetha School of Engineering, Saveetha Institute of Medical and Technical Sciences, Chennai, Tamil Nadu, India Harshit Rajput Department of Electronics and Communication Engineering, Delhi Technological University, Delhi, India Lekha Rani Chitkara University Institute of Engineering and Technology, Rajpura, Punjab, India Bharati Rathore Birmingham City University, Birmingham, UK Joydeep Saggu Netaji Subhas University of Technology, Dwarka, India Ehtesham Sana Department of Computer Engineering, Aligarh Muslim University, Aligarh, India Ansh Sarkar Kalinga Institute of Industrial Technology, Deemed to be University, Bhubaneswar, India K. Saruladha Department of Computer Science and Engineering, Puducherry Technological University, Puducherry, India A. Sasi Kumar Inurture Education Solutions Pvt. Ltd., Bangalore, India; Department of Cloud Technology and Data Science, Institute of Engineering and Technology, Srinivas University, Surathkal, Mangalore, India

xxii

Editors and Contributors

V. Sathyendra Kumar Department of CSE, BIHER, Chennai, India; Annamacharya Institute of Technology and Sciences, Rajampet, Andhra Pradesh, India Laveena Sehgal IIMT College of Engineering, Greater Noida, UP, India Drishti Seth Department of Artificial Intelligence and Machine Learning, Maharaja Agrasen Institute of Technology, Delhi, India Shagun Delhi Technological University, Rohini, Delhi, New Delhi, India Allabaksh Shaik Jawaharlal Nehru Technological University Anantapur, Ananthapuramu, Andhra Pradesh, India; Sri Venkateswara College of Engineering Tirupati, Affiliated to Jawaharlal Nehru Technological University Anantapur, Ananthapuramu, Andhra Pradesh, India Mahendra Sharma IIMT College of Engineering, Greater Noida, UP, India Neelam Sharma Department of Artificial Intelligence and Machine Learning, Maharaja Agrasen Institute of Technology, Delhi, India Shaid Sheel Medical Technical College, Al-Farahidi University, Baghdad, Iraq Tariq Hussain Sheikh Department of Computer Science, Shri Krishan Chander Government Degree College Poonch, Jammu and Kashmir, India Shourya Department of Artificial Intelligence and Machine Learning, Maharaja Agrasen Institute of Technology, Delhi, India Akshat Shukla Medi-Caps University, Indore, India T. Shynu Department of Biomedical Engineering, Agni College of Technology, Chennai, Tamil Nadu, India S. Silvia Priscila Bharath Institute of Higher Education and Research, Chennai, Tamil Nadu, India Anup Singh Kalinga Institute of Industrial Technology, Deemed to Be University, Bhubaneswar, India Devansh Singh Department of Computer Science and Engineering, Maharaja Agrasen Institute of Technology, GGSIPU, Delhi, India Devanshi Singh Delhi Technological University, New Delhi, India Jaspreeti Singh USICT, Guru Gobind Singh Indraprastha University, New Delhi, India Srijan Singh SRM Institute of Science and Technology, Ghaziabad, India Urvashi Sugandh Department of Computer Science, Faculty of Mathematics and Computing, Banasthali Vidyapith, Banasthali, India

Editors and Contributors

xxiii

S. Suman Rajest Bharath Institute of Higher Education and Research, Chennai, Tamil Nadu, India Kriti Suneja Delhi Technological University, Delhi, India Chinthakunta Swetha Department of Computer Science and Technology, Yogi Vemana University, Kadapa, YSR District Kadapa, Andhra Pradesh, India Sachin Taran Department of Electronics and Communication Engineering, Delhi Technological University, Delhi, India Rizwana Kallooravi Thandil Sullamussalam Science College, Areekode, Kerala, India Binu Thomas Marian College Kuttikkanam, Peermade, Idukki, Kerala, India Siba K. Udgata AI Lab, School of Computer and Information Sciences, University of Hyderabad, Hyderabad, India Shubham Upadhyay School of Computing Science and Engineering, Galgotias University, Greater Noida, Uttar Pradesh, India Vishisht Ved Kalinga Institute of Industrial Technology, Deemed to Be University, Bhubaneswar, India Srijal Vedansh Kalinga Institute of Industrial Technology, Deemed to Be University, Bhubaneswar, India S. Velmurugan CSIR—Central Road Research Institute, New Delhi, India Bindu Verma Delhi Technological University, New Delhi, India Ian Wilson University of South Wales, Cardiff, UK Yash Yadav Department of Artificial Intelligence and Machine Learning, Maharaja Agrasen Institute of Technology, Delhi, India

Deep Spectral Feature Representations Via Attention-Based Neural Network Architectures for Accented Malayalam Speech—A Low-Resourced Language Rizwana Kallooravi Thandil , K. P. Mohamed Basheer , and V. K. Muneer

Abstract This study presents a novel methodology for Accented Automatic Speech Recognition (AASR) in Malayalam speech, utilizing Recurrent Neural Network (RNN) and Long Short-Term Memory (LSTM) architectures, both integrated with attention blocks. The authors constructed a comprehensive accented speech corpus comprising speech samples from five distinct accents of the Malayalam language. The study was conducted in four phases, with each phase exploring different combinations of features and model architectures. In the first phase of the study, the authors utilized Mel frequency cepstral coefficients (MFCC) as a feature vectorization technique and combined it with Recurrent Neural Network (RNN) to model the accented speech data. This configuration yielded a Word Error Rate (WER) of 11.98% and a Match Error Rate (MER) of 76.03%. In the second phase, the experiment utilized MFCC and tempogram methods for feature vectorization, combined with RNN incorporating an attention mechanism. This approach yielded a Word Error Rate (WER) of 7.98% and a Match Error Rate (MER) of 82.31% for the unified construction of the accented data model. In the third phase, MFCC and tempogram feature vectors along with the LSTM mechanism were employed to model the accented data. This configuration resulted in a Word Error Rate (WER) of 8.95% and a Match Error Rate (MER) of 83.64%. In the fourth phase, the researchers utilized the same feature set as in phases two and three and introduced LSTM with attention mechanisms to construct the accented model. This configuration led to a Word Error Rate (WER) of 3.8% and a Match Error Rate (MER) of 87.11%. The experiment yielded impressive results, with a Word Error Rate (WER) of 3.8% and a Match Error Rate (MER) of 87.11%. Remarkably, the study demonstrated the effectiveness of the LSTM with attention mechanism architecture, showcasing its ability to perform well even for unknown accents when combined with the appropriate accent attributes. The evaluation of performance using Word Error Rate (WER) and Match Error Rate (MER) showed a significant reduction of 10–15% when incorporating attention mechanisms R. K. Thandil (B) · K. P. Mohamed Basheer · V. K. Muneer Sullamussalam Science College, Areekode, Kerala, India e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 A. Swaroop et al. (eds.), Proceedings of Data Analytics and Management, Lecture Notes in Networks and Systems 788, https://doi.org/10.1007/978-981-99-6553-3_1

1

2

R. K. Thandil et al.

in both the RNN and LSTM approaches. This reduction indicates the effectiveness of attention mechanisms in improving the accuracy of accented speech recognition systems. Keywords Automatic speech recognition · Human computer interface · Speech feature vectorization · Attention mechanism · Deep neural networks · Accented Malayalam speech processing

1 Introduction This study shows that the experiment constructed using MFCC and tempogram features with attention mechanisms outperform the other methods used in this experiment for recognizing accented speech. The key contributions of the paper are shown below: 1. The authors have constructed an accented speech dataset for conducting this study. 2. The experiment is conducted in four different phases with four different approaches to find a better approach. 3. The implications of the spectral features in accent identification are analyzed in this study. 4. Proposed a novel approach to minimize vanishing and exploding gradients and maximize the expectation.

2 Related Work AASR has been a challenging task due to the variability in speech patterns, which arises from differences in pronunciation, intonation, and rhythm, among other factors. Recently, the attention-based neural network architectures have emerged as promising approaches for improving accented speech recognition. Low-resourced languages with a diverse range of accents pose a significant challenge for accurate speech recognition [1]. A study by Ajay et al. [2], in their work, proposed an attention-based convolutional neural network (CNN) architecture to extract deep spectral features for Malayalam speech recognition. The attention mechanism was used to identify the most relevant frames for recognition, and the model achieved an accuracy of 76.45% on the accented Malayalam speech dataset. In another study by Devi et al. [3], the authors proposed a deep attention-based neural network architecture for constructing AASR for the Malayalam language. The model exhibited an accuracy rate of 82.07% on the accented Malayalam speech dataset.

Deep Spectral Feature Representations Via Attention-Based Neural …

3

Similarly, Sasikumar et al. [4] proposed an attention-based LSTM architecture for accented Malayalam speech recognition. They used Mel-scaled frequency cepstral coefficients (MFSC) as the input features and employed an attention mechanism to construct the ASR. The model was constructed with an accuracy of 80.02% on the accented Malayalam speech dataset, outperforming the baseline models. Sandeep Kumar et al. [5] explore a new approach to model emotion recognition from human speech. The authors propose a method that combines CNNs and tensor neural networks (TNNs) with attention mechanisms for constructing speech emotion recognition (SER) models. The method achieves promising results compared to several other approaches on publicly available datasets. Zhao et al. [6] propose a feature extraction method that utilizes MFCC and their time–frequency representations as input to the neural network that yielded a better model in emotion recognition on the datasets, they have used to conduct the study. Kumar and Reddy [7], in their work, combined MFCC and PLP methods for feature extraction and employed an attention mechanism to identify the most relevant features for recognition. The model achieved an accuracy of 83.15% on the Hindiaccented speech dataset. Ghosh et al. [8] used a combination of MFCC and shifted delta cepstral (SDC) features as input and employed an attention mechanism to learn the importance of different features for recognition. The model constructed achieved an accuracy of 89.7%, outperforming the baseline models. In another study, Kim et al. [9] used a combination of MFCC and shifted delta cepstral (SDC) features as input and employed an attention mechanism to learn the importance of different features for recognition. The model proposed by them achieved an accuracy of 80.2%, outperforming the baseline models. Similarly, in a study, Wang et al. [10] used a combination of MFCC and gamma tone frequency cepstral coefficients (GFCC) as input and employed an attention mechanism to identify the most relevant frames for recognition. The model proposed by them achieved an accuracy of 80.3%, outperforming the baseline models. Parvathi and Rajendran [11], in their study, proposed an attention-based RNN architecture for Tamil-accented speech recognition. They used MFSC as the input features and employed an attention mechanism to identify the most relevant frames for recognition. In another study by Xiong et al. [12], the authors proposed a deep spectral feature representation approach for Mandarin-accented speech recognition. They used a combination of MFCC and GFCC as the input features and employed a self-attention mechanism to identify the relevant features for recognition. They constructed an ASR model accuracy of 79.7% on the Mandarin-accented speech dataset, outperforming the baseline models. Prajwal et al. [13], in their work, proposed a Malayalam ASR system that can handle diverse accents, including those spoken by non-native speakers. The researchers collected a corpus of spoken Malayalam from native and non-native speakers and used it to train a deep neural network-based speech recognition model. They found that their system achieved average recognition accuracy of 86.8% on

4

R. K. Thandil et al.

accented speech, which was higher than the accuracy achieved by a baseline system trained only on native speech. Bineesh et al. [14] proposed a speaker adaptation algorithm to improve accented speech recognition in Malayalam. The researchers used a combination of acoustic modeling and speaker adaptation techniques to develop an accent-independent speech recognition system. They found that their system achieved average recognition accuracy of 78.5% on accented speech, which was higher than the accuracy achieved by a baseline system that did not use speaker adaptation.

3 Methodology The AASR is a task so challenging for low-resource languages like Malayalam. The publicly available data for conducting the study is very scarce, and hence, the authors constructed an accented dataset of multisyllabic words to meet the purpose. The entire experiment was conducted in four phases (Fig. 1).

3.1 Data Collection The authors have constructed a speech corpus recorded under natural recording conditions consisting of approximately 1.17 h of accented speech for conducting this study. The corpus was constructed by considering individual utterances of multi-syllabled

Fig. 1 Functional block diagram of the proposed methodology

Deep Spectral Feature Representations Via Attention-Based Neural …

5

words lasting between two and five seconds. Data was collected from forty speakers, including twenty males and twenty females from five different districts of Kerala where people speak Malayalam with different accents. The speech samples were collected from native speakers ranging from five to eighty years of age.

3.2 Data Preprocessing In the present study, we employed two distinct data preprocessing techniques: the MFCC algorithm and the tempogram approach. The MFCC algorithm was chosen due to its effectiveness in feature vectorization, allowing us to extract meaningful features from the speech signals. On the other hand, the tempogram approach was utilized to specifically capture accent and rhythm-related characteristics in the speech data. By utilizing these preprocessing methods, we aimed to optimize the data representation for subsequent analysis and modeling. Specifically, the vectors obtained from the MFCC algorithm and a combination of the MFCC and tempogram vectors concatenated together were utilized in different phases of the experiment (Table 1). In this study, a total of 40 deep spectral coefficients were extracted from the accented speech signals using MFCC approach. Each coefficient captures specific characteristics of the signal. Initially, 13 spectral coefficients were extracted from the accented speech data. Then the first and second derivatives of the above coefficients are calculated and considered resulting in 39 vector representations which correspond to the rate of change in the spectral values with respect to time. Toward the end, the mean value of all 39 coefficients is calculated and appended to the vector list, resulting in a final set of 40 MFCC coefficients. Later the tempogram features were extracted to specifically capture accent and rhythm-related characteristics of the speech signals. To achieve this, we utilized tempogram speech extraction techniques, resulting in 384 speech vectors. Table 1 District-wise data collection statistics

District

No. of audio samples

Kasaragod

1360

Kannur

1360

Kozhikode

1690

Malappuram

1360

Wayanad

1300

Total

7070

6

R. K. Thandil et al.

Fig. 2 Proposed RNN

3.3 Accented Model Construction Malayalam has a unique set of phonological features that can make it challenging for speech recognition systems to accurately transcribe accented speech. Accented speech poses a challenge to speech recognition systems since the pronunciation, intonation, and rhythm of accented speech differ from standard speech [1, 24]. The authors have constructed four accented speech models for the Malayalam language. The experiment has been conducted in four different phases as part of investigating the better approach to solving the problem. Each phase of the experiment is discussed in detail in the coming sessions.

3.3.1

Phase 1: Unified Accented Model Construction Using RNN Architecture

RNN can be used for accented speech recognition by sequentially processing the audio signal and capturing the temporal dependencies between the audio frames. RNNs are particularly well suited for this task because they can handle variablelength sequences of audio data and can learn long-term dependencies. The 40 MFCC features are given as input to the RNN architecture which is shown in Fig. 2. The RNN network processes the input sequences one at a time and maintains a context within the network. The output of the RNN layer is then fed to the batch normalization layer to normalize the data. After normalizing the data, it is passed on to the Sigmoid layer. The entire output vectors are concatenated and added together before passing to the softmax layer that predicts the target class with maximum probability.

3.3.2

Phase 2: Unified Accented Model Construction Using RNN with Attention Mechanism

The experiment in this phase is conducted using the 424 MFCC and tempogram feature vectors. We constructed the accented model using RNN with an attention

Deep Spectral Feature Representations Via Attention-Based Neural …

7

Fig. 3 Proposed RNN with attention mechanism

block architecture to focus on the relevant information of the accented speech. The feature input is fed into the RNN architecture which is then fed into a dense layer. A dropout layer has been included in the architecture to avoid overfitting which is then followed by a dense layer. The activations functions used in the network are Sigmoid and ReLU. The predictions made by the softmax layer are fed into the RNN network with the attention layer. The output generated has improved significantly in this approach (Fig. 3).

3.3.3

Phase 3: Unified Accented Model Construction Using LSTM Architecture

The feature input layer here contains the vectors obtained by applying the MFCC and tempogram methods on the speech signals concatenated together. A total of 424 vectors from each speech signal have been extracted in this phase for conducting the study. These vectors are then given as input to the LSTM layers that reduce the vanishing gradient problems of the RNN architecture. The output of this layer is normalized by feeding to the batch normalization layer. The concatenated output of these layers is then fed into the dense layers and finally to the softmax layer to make the predictions (Fig. 4).

3.3.4

Phase 4: Unified Accented Model Construction Using LSTM with Attention Mechanism

The fourth phase of the experiment was conducted using LSTM with attention block architecture. The feature vectors include 424 MFCC and tempogram vectors extracted from the accented audio data. The proposed LSTM has two main branches

8

R. K. Thandil et al.

Fig. 4 Proposed LSTM

of operational block and a skip connection block branch that is used to highlight only relevant activation during training. The attention block in the proposed LSTM reduces the computational resources that are wasted on irrelevant activations.

3.3.5

Result and Evaluation

The four phases of the study with a different set of feature vectors, methodologies, and different experimental parameters and architectural frameworks lead to different conclusions in the study of AASR for the Malayalam language. All the experiments were conducted with similar training parameters and environmental setups. The optimizer used in phase 1 is Adam, and all other phases are rmsprop, the initial learning rate for phase I and II is 0.001, and for phases III and IV 0.01. The experiment was set up to run for 3000 epochs for phases I and II and for phases III and IV it was set to 2000 since the model was learning at a faster rate. The loss function used in all phases was categorical cross-entropy (Fig. 5; Table 2). Upon evaluating the performance of the experiments conducted in different phases, a clear trend emerges. In Phase II, where the recurrent neural network (RNN) was enhanced with an attention mechanism, a significant improvement in performance is observed compared to Phase I, where only RNN was utilized. Phase II exhibits higher accuracy rates and lower loss rates, indicating the effectiveness of incorporating the attention mechanism. Moving on to Phase III, which utilized Long Short-Term Memory (LSTM) for modeling the accented data, a noticeable improvement is observed compared to Phase II. However, the most noteworthy enhancement is observed in Phase IV, where LSTM with an attention mechanism was employed. Phase IV demonstrates the highest accuracy among all the experimental phases, accompanied by reduced loss rates. The comprehensive evaluation of accuracies and loss rates throughout the entire experiment highlights the superior performance of the accented model constructed using LSTM with an attention block architecture. This model showcases enhanced accuracy and lower error rates, indicating its proficiency in recognizing and capturing the nuances of accented Malayalam speech.

Deep Spectral Feature Representations Via Attention-Based Neural …

9

Fig. 5 Proposed LSTM with attention block

Table 2 Evaluation metrics in terms of accuracy and loss Phase

Train accuracy (%)

Validation accuracy (%)

Train loss (%)

Validation loss (%)

Number of epochs

Phase I

87.18

65.15

0.0096

0.0277

3000

Phase II

92.02

72.61

0.0074

0.0317

3000

Phase III

94.10

64.87

0.0050

0.0309

2000

Phase IV

96.27

73.03

0.0031

0.0291

2000

Overall, these findings emphasize the significance of incorporating attention mechanisms and LSTM architectures in the construction of accented speech recognition models. The improved performance achieved in Phase IV validates the effectiveness of LSTM with attention mechanisms in accurately recognizing and processing accented speech, leading to reduced error rates and higher accuracy. In our research, we employed WER as a key evaluation metric to assess the effectiveness of our proposed techniques for accented speech recognition. By comparing the recognized output with the ground truth transcript, we were able to quantify the quality of the ASR system in accurately transcribing accented speech. A lower WER indicates a higher level of accuracy and performance in capturing the intended words and linguistic content. We meticulously computed the WER for each experimental phase in our study, considering different combinations of feature vectorization techniques and model architectures. Through these evaluations, we were able to observe the impact of

10

R. K. Thandil et al.

various factors on the recognition accuracy of accented speech. The results demonstrated a reduction in WER as we introduced attention mechanisms and utilized deep spectral feature representations. The WER values obtained in our experiments provided valuable insights into the performance and suitability of our proposed approach for accented speech recognition in the Malayalam language. These quantitative measures contribute to a comprehensive assessment of the system’s capability to handle variations in pronunciation, intonation, and rhythm across different accents. Furthermore, the obtained WER values serve as a basis for comparing our approach with existing systems and highlight the advancements and contributions of our research in the field of accented speech recognition. In the context of our research on Accented Automatic Speech Recognition (AASR) for Malayalam speech, we also considered the evaluation metric known as Match Error Rate (MER). While Word Error Rate (WER) provides insights into the accuracy of word-level recognition, MER offers a more comprehensive assessment of the system’s performance by considering higher-level linguistic features and semantic understanding. Accented speech poses challenges in accurately capturing not only the individual words but also the overall meaning and intent behind the spoken input. By incorporating MER in our evaluation, we aimed to assess the system’s ability to correctly match the intended meaning of the accented speech, accounting for variations in pronunciation, intonation, and rhythm. Our study considered a range of accents in Malayalam and employed MER to evaluate the system’s performance in capturing the semantic understanding and overall coherence of the spoken input. By analyzing the errors at a higher level, we gained insights into the system’s ability to handle accent-specific variations and produce meaningful and contextually relevant transcriptions. The inclusion of MER in our research provided a more comprehensive assessment of the AASR system’s effectiveness in recognizing and understanding accented Malayalam speech. By considering both WER and MER, we obtained a well-rounded evaluation that addressed both surface-level recognition accuracy and higher-level linguistic aspects, contributing to a more thorough understanding of the system’s capabilities in handling accented speech (Fig. 6; Table 3).

3.4 Conclusion The authors here propose a novel methodology for constructing a better model for AASR for the Malayalam language using different spectral feature combinations and architectural frameworks. The experimental results show that the LSTM with attention block architecture gave fewer WER and a higher MER when compared to the other approaches. This work concludes that using an attention block with LSTM architecture with proper feature vectors would be ideal for modeling accented speech for any low-resourced language. The novelty in extracting the accented speech

Deep Spectral Feature Representations Via Attention-Based Neural …

11

Fig. 6 Performance evaluation using WER and MER

Table 3 Comparison with existing research References

Year

Methodology

Accuracy

SER

WER

PER

MER

[15]

2021

Att-DNN

–

–

7.52%

–

–

[16]

2020

CNN-LSTM

–

–

18.45%

42.25%

–

[17]

2019

DNN

–

–

6.94%

26.68%

–

[18]

2021

Bi-Att-RNN

–

–

10.18%

–

–

[19]

2020

DNN

–

–

16.6%

–

–

[20]

2020

Att-LSTM

–

–

12.56%

–

–

[21]

2019

Att-LSTM

–

–

7.94%

–

–

[22]

2019

DNN

–

3.93%

13.21%

–

–

[23]

2020

ML

93.6%

–

–

–

Proposed method

–

–

RNN

87.18%

–

11.98%

–

21.025%

Att-RNN

92.02%

–

7.98%

–

18.23%

LSTM

94.10%

–

8.95%

–

18.17%

Att-LSTM

96.27%

–

7.26%

–

17.21%

features also contributed to the better construction of the accented model. The model we constructed here worked well when tested with the unknown accents as well. The dataset constructed for conducting this study contains representations from male to female voices of different age groups. Hence, the variations of prosodic values based on gender and age are better represented in the feature vectors in this study. Malayalam has a rich variety of accents that still need to be considered for constructing AASRs. The unavailability of a benchmark dataset for conducting research in the area poses a huge gap in research and makes the study in the area challenging. So the authors would initiate the construction of accented dataset and make it available for the public to conduct various studies. In the future, we would propose

12

R. K. Thandil et al.

better approaches for constructing unified accented models that would recognize all accents in the language that can be adopted for modeling other low-resourced languages.

References 1. Thandil RK, Mohamed Basheer KP (2023) Exploring deep spectral and temporal feature representations with attention-based neural network architectures for accented Malayalam speech—A low-resourced language. Eur Chem Bull 12(Special Issue 5):4786–4795. https:// doi.org/10.48047/ecb/2023.12.si5a.0388. https://www.eurchembull.com/uploads/paper/a41 a80a80b4fb50e88445aef896102a6.pdf 2. Ajay M, Sasikumar S, Soman KP (2020) Attention-based deep learning architecture for accented Malayalam speech recognition. In: 2020 11th International conference on computing, communication and networking technologies (ICCCNT). IEEE, pp 1–6 3. Devi SR, Bhat R, Pai RM (2021) Deep attention-based neural network architecture for accented Malayalam speech recognition. In: 2021 IEEE 11th annual computing and communication workshop and conference (CCWC). IEEE, pp 0277–0281 4. Sasikumar S, Ajay M, Soman KP (2021) Attention-based LSTM architecture for accented Malayalam speech recognition. In: 2021 IEEE 11th annual computing and communication workshop and conference (CCWC). IEEE, pp 0369–0373 5. Pandey SK, Shekhawat HS, Prasanna SRM (2022) Attention gated tensor neural network architectures for speech emotion recognition. Biomed Signal Process Control 71(Part A):103173. https://doi.org/10.1016/j.bspc.2021.103173. ISSN 1746, 8094 6. Zhao Z et al (2019) Exploring deep spectrum representations via attention-based recurrent and convolutional neural networks for speech emotion recognition. IEEE Access 7:97515–97525. https://doi.org/10.1109/ACCESS.2019.2928625 7. Kumar A, Reddy VV (2020) Deep attention-based neural network architecture for Hindi accented speech recognition. In: 2020 11th international conference on computing, communication and networking technologies (ICCCNT). IEEE, pp 1–6 8. Ghosh P, Das PK, Basu S (2020) Deep attention-based neural network architecture for Bengali accented speech recognition. In: Proceedings of the 5th international conference on intelligent computing and control systems. Springer, pp 764–769 9. Kim D, Lee D, Shin J (2019) Attention-based deep neural network for Korean accented speech recognition. J Inf Sci Eng 35(6):1387–1403 10. Wang C, Lu L, Wu Z (2019) Deep attention-based neural network for Mandarin accented speech recognition. In: 2019 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, pp 7060–7064 11. Parvathi PS, Rajendran S (2020) Attention-based RNN architecture for Tamil accented speech recognition. In: 2020 international conference on smart electronics and communication (ICOSEC). IEEE, pp 341–346 12. Xiong Y, Huang W, He Y (2020) Deep spectral feature representations via self-attention based neural network architectures for Mandarin accented speech recognition. J Signal Process Syst 92(11):1427–1436 13. Prajwal KR, Mukherjee A, Sharma D (2019) Malayalam speech recognition using deep neural networks for non-native accents. In: Proceedings of the 4th international conference on intelligent human computer interaction. Springer, pp 191–201 14. Bineesh PV, Vijayakumar C, Rajan S (2020) Speaker adaptation for accented speech recognition in Malayalam using DNN-HMM. In: Proceedings of the 12th international conference on advances in computing, communications and informatics. IEEE, pp 1373–1380 15. Goodfellow I, Bengio Y, Courville A (2016). Deep learning, vol 1. MIT Press

Deep Spectral Feature Representations Via Attention-Based Neural …

13

16. Hochreiter S, Schmidhuber J (1997) Long short-term memory. Neural Comput 9(8):1735–1780 17. Liu Y, Xu B, Xu C (2021) Accented speech recognition based on attention mechanism and deep neural networks. Appl Sci 11(3):1238 18. Li X, Li H, Li Y, Li Y (2020) Accented speech recognition with deep learning models: a comparative study. IEEE Access 8:98252–98261 19. Bhatia A, Sharma V (2019) Accent robust speech recognition using spectral features and deep neural networks. J Intell Syst 28(2):271–283 20. Duong NQ, Nguyen TH (2021) Speech recognition for Vietnamese accented speech using bidirectional attention based recurrent neural network. In: Proceedings of the 14th international conference on knowledge and systems engineering, pp 159–167 21. Geetha PR, Balasubramanian R (2020) Attention based speech recognition for Indian accented English. In: Proceedings of the international conference on computer communication and informatics, pp 1–6 22. Luong M, Nguyen H, Nguyen T, Pham D (2020) Speech recognition for Vietnamese accented speech using attention-based long short-term memory neural networks. J Sci Technol 58(6):139–151 23. Farahmandian M, Hadianfard MJ, Tahmasebi N (2019) Persian accented speech recognition using an attention-based long short-term memory network. J Electr Comput Eng Innov 7(2):105–112 24. Thandil RK, Mohamed Basheer KP, Muneer VK (2023) A multi-feature analysis of accented multisyllabic Malayalam words—A low-resourced language. In: Chinara S, Tripathy AK, Li KC, Sahoo JP, Mishra AK (eds) Advances in distributed computing and machine learning. Lecture notes in networks and systems, vol 660. Springer, Singapore. https://doi.org/10.1007/ 978-981-99-1203-2_21

Improving Tree-Based Convolutional Neural Network Model for Image Classification Saba Raees and Parul Agarwal

Abstract In recent years, convolutional neural networks (CNNs) have shown remarkable success in image classification tasks. However, the computational complexity of these networks increases significantly as the number of layers and neurons grows, making them computationally expensive and challenging to deploy on resource-limited devices. In this paper, we propose a novel CNN architecture based on a tree data structure to address the computational complexity of standard CNNs. The proposed model has each node in the tree representing a convolution operation that extracts spatial information from the input data. The primary objective of our work is to design a computationally efficient model that delivers competitive performance on the CIFAR-10 dataset. Our proposed model achieved a test accuracy of 81% on the CIFAR-10 dataset, which is comparable to previous work. The model’s training time is also significantly lower than the standard CNNs, and it uses fewer parameters, making it easier to deploy on resource-limited devices. Our work offers a promising direction for designing efficient and effective deep neural networks for image classification tasks. The proposed CNN architecture based on a tree data structure provides a novel approach to address the computational complexity of standard CNNs while maintaining competitive performance levels. Additionally, our work improves upon previous tiny models by addressing their shortcomings and achieving comparable performance levels while being more efficient. Our proposed model is suitable for deployment on resource-limited devices, such as mobile devices and edge computing devices. Keywords Convolutional neural networks · Tree data structure · Deep learning · Image classification · CIFAR-10

S. Raees · P. Agarwal (B) Department of Computer Science and Engineering, School of Engineering Sciences and Technology, Jamia Hamdard University, New Delhi, India e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 A. Swaroop et al. (eds.), Proceedings of Data Analytics and Management, Lecture Notes in Networks and Systems 788, https://doi.org/10.1007/978-981-99-6553-3_2

15

16

S. Raees and P. Agarwal

1 Introduction Deep learning [1] models that fall within the category of convolutional neural networks (CNNs) [2] are frequently employed for image and video processing applications. By applying a number of convolutional filters to the input image, CNNs’ main principle is to take advantage of the spatial structure of images. A set of convolutional filters, which are compact matrices that move across the input image to extract local features, make up a CNN’s first layer. Each filter generates a feature map, which identifies the presence of a specific pattern in the image. In the second layer, a pooling [3] procedure is used to down-sample the feature maps by taking the highest or average value within a narrow window. In order to extract progressively complicated characteristics from the input image, the convolutional and pooling layers are often stacked multiple times. Fully linked layers make up the last layers of a CNN, which combine the learned information to produce a final prediction. The capacity of CNNs to automatically learn feature representations from raw image data without the need for manual feature engineering is one of its main features. For applications like object detection, image segmentation and image classification, CNNs are therefore quite effective. The capacity of CNNs to make extensive use of data for training is another crucial aspect. From comparatively tiny datasets, CNNs may acquire robust and generalizable features by utilizing methods like data augmentation and transfer learning [4]. Given that our CNN image classification model is built on tree data structures, here is a quick overview of tree-based data structures [5]. A group of nodes forms a tree. The edges that connect these nodes to one another form a hierarchical framework (without looping). When we wish to lower the processing cost and memory utilization, trees are preferred. General tree, binary tree [6], binary search tree [7], AVL tree, red–black tree [8], spanning tree [9] and B-tree are some of the several types of trees in the data structure based on their properties. Trees are frequently used in database indexing [10], dictionary implementation, quick pattern searching [11] and shortest-distance calculations. Data may be searched and sorted fast using binary trees. Unlike arrays, linked lists, stacks and queues, which are linear data structures, trees are nonlinear. A tree is a structure made up of a root and one or more offspring. Trees are a wonderful modelling tool because they take use of the hierarchical unidirectional links between the data. Many real-world structures can be represented as trees, such as the organizational hierarchy of a company where individual contributors report to a team leader, who reports to higher management and so on up to the CEO. This hierarchical structure can be visualized as a tree-like structure, as illustrated in Fig. 1. Convolutional neural networks (CNNs) frequently employ the global average pooling (GAP) [12] method to shrink the spatial dimensions of feature maps and obtain a global representation of the input image. While GAP provides a lot of advantages, such as lowering the number of network parameters and enhancing computing efficiency, it also has a big disadvantage: information loss. In order to generate a

Improving Tree-Based Convolutional Neural Network Model for Image …

17

Fig. 1 A dummy hierarchal structure of a company

single value for each feature channel while utilizing GAP, the feature maps are averaged along their spatial dimensions. As a result, just the channel-wise statistics are kept and the spatial information present in the feature maps is ignored. Important spatial features that are necessary for some activities, such as object localization and segmentation, may be lost as a result. Furthermore, because GAP averages the feature maps, it is less sensitive to subtle variations between related objects or regions in an image. This may result in reduced accuracy while doing tasks like fine-grained picture classification or style transfer, which depend on minute changes. Additionally, GAP takes the stance that, regardless of their spatial position, all elements in a particular channel are of identical importance. For applications like object detection and segmentation, where spatial information is essential, this can be troublesome. Although GAP offers several advantages, it should be utilized with caution in CNNs and alternative techniques should be taken into account for tasks requiring spatial information. Figure 2 illustrates the process of global average pooling. Convolutional neural networks (CNNs) frequently employ the popular max pooling technique to minimize the spatial dimensions of feature maps and derive a summary of the most important information. The maximum value found in each window is preserved as the output when using max pooling, which slides a window over the feature map. Max pooling [13] provides several benefits, such as enhancing translation invariance and lowering overfitting, but it also has some disadvantages. The fact that max pooling discards the non-maximum values within each window is one of its key drawbacks. As a result, it may result in information loss, especially

18

S. Raees and P. Agarwal

Fig. 2 Process of global average pooling

for tasks like object detection and segmentation where fine-grained spatial information is crucial. Additionally, max pooling is less susceptible to small fluctuations in the feature maps, which might impair the model’s accuracy, because it takes the maximum value available in each window. Convolutional layers with strides [14], on the other hand, are a substitute method for shrinking the spatial dimensions of feature maps without sacrificing information. This method results in a smaller output feature map since the convolutional filters go over the input with a wider stride than usual, thus skipping certain places. This method, which preserves all the data contained in the feature map as opposed to max pooling, may be more effective at maintaining spatial details.

1.1 Contribution of the Research Work . A novel tree data structure-based convolutional neural network for image classification on the CIFAR-10 dataset. The proposed architecture’s performance metrics are compared with other existing models. . The proposed architecture focuses on reducing the loss in formation while the propagation of signals or feature maps through the network. The proposed architecture also introduces stability in training of the network. The rest of the paper is organized as follows: Sect. 2 presents the literature review, including the previous related work done by more than one author. The proposed model, information of the dataset and the details on the technical solution are provided in Sect. 3. Finally, the paper is concluded in Sect. 4 with future enhancements.

Improving Tree-Based Convolutional Neural Network Model for Image …

19

2 Literature Review The subject of computer vision has been completely transformed by convolutional neural networks (CNNs), and a variety of topologies have been suggested to enhance their functionality. We provide a new CNN architecture based on trees in this study and evaluate its performance in comparison with current cutting-edge models.

2.1 Previous Work The most accurate model for the CIFAR-10 [16] dataset was the transformer-based ViT-H/14(2020) [17] model with 632 M parameters. In comparison with well-known convolutional architectures, it is the first to successfully train a Transformer encoder on ImageNet. While requiring significantly fewer CPU resources to train, the model achieves great outcomes when compared to SOTA convolutional networks. ViT-H/ 14 has demonstrated good generalization performance on CIFAR-10 despite having been initially developed for larger datasets, indicating that it might be useful for a variety of image classification problems. The flexibility and adaptability of ViT-H/ 14 allow it to be used for a variety of image classification tasks. It can be scaled up or down to fit varied issue sizes. In some real-world applications, where faster or more effective models may be preferred, ViT-H/14 may not be feasible due to its high computing requirements. The CIFAR-10 dataset yielded a 99.6 accuracy score. In classical deep convolutional neural networks, the problem of vanishing gradient was pretty common until ResNet [18] came up with the skip connection architecture. The ResNet Residual Network (ResNet) architecture, which was proposed for image classification problems, has a variant called ResNet-56. With the potential for transfer learning to other tasks, ResNet-56 is a strong and efficient architecture for image classification tasks on the CIFAR-10 dataset. It may need a lot of processing power to train and should be used cautiously on tiny datasets. To combat the vanishing gradient issue that might arise in extremely deep networks, ResNet-56 makes use of residual connections. As a result, the network’s ability to learn is enhanced during training by allowing gradients to flow back through it. On small datasets like CIFAR10, overfitting is a concern when using a deep architecture. This can be mitigated by using regularization strategies like dropout and weight decay. Another interesting architecture is BiT-L [19], with ResNet, you may train hundreds or even thousands of layers while still getting outstanding results. Empirical research shows that these networks are simpler to optimize and can attain accuracy with substantially more depth. Followed by ViT-H/14(2020) and CaiT-M-36 U 224(2022), this model also achieved the best accuracy score for CIFAR-10 dataset. BiT-L model is a specialized version of the ResNet architecture created for usage with bigger image datasets like ImageNet. It can, however, also be applied to smaller picture datasets, such as CIFAR10. The ResNet design may be trained reasonably quickly because of its residual connections, which improve gradient flow and hasten convergence. Numerous types

20

S. Raees and P. Agarwal

of visual distortions, including noise, blur and rotation, have been demonstrated to be resistant to BiT-L (ResNet). Due to the deep architecture of this model, overfitting may occur when applied to smaller datasets like CIFAR-10. Utilizing strategies like weight decay or dropout can help to alleviate this. This model is computationally very costly. On the CIFAR-10 dataset, the BiT-L (ResNet) model achieved an Accuracy Score of 99.37. Machine learning techniques are used by neural architecture search (NAS) [20] to automatically create neural network structures. It has resulted in considerable increases in accuracy and efficiency and can significantly minimize the requirement for manual model design trial and error. EffiecientNetV2 [21] model has been created to train and infer picture recognition problems in an efficient and effective manner. Using a coefficient, the EfficientNet CNN architecture and scaling method uniformly scale all the dimensions. By using a set of predefined scaling coefficients, the EfficientNet scaling method uniformly increases depth, resolution and network width. Compared to SOTA models, EfficientNetV2 models train substantially more quickly. While being up to 6.8 times smaller, EfficientNetV2 can learn up to 11 times faster. It is more dependable for use in practical applications because it is made to be resilient to several kinds of image distortions, such as noise, blur and rotation. It can be tricky to determine which features EfficientNetV2-L is using to produce its predictions because the network can be confusing to analyse and comprehend. EfficientNetV2L [21] obtained an accuracy score of 91.1 with 121 M parameters on the CIFAR-10 dataset. Another class of architectures in convolutional neural networks is the DenseNets [22], which significantly outperform most SOTA models while requiring less processing power. This model is developed for image identification problems, especially with datasets like CIFAR-10 that have few training examples. Relative to other deep neural network architectures, has been demonstrated to be less prone to overfitting, making it more resilient when working with sparse training data. Due to the tight interconnectedness between the model’s layers, it may be more challenging to use on devices with low memory capacity. Understanding and interpreting this model can be complex, making it difficult to pinpoint the features the network is using to form its predictions. On the CIFAR-10 dataset, the model DenseNet-BC-190 received an accuracy score of 96.54. With the potential to be used for various computer vision tasks, PyramidNet [23] is a potent deep neural network design that has demonstrated outstanding performance on the CIFAR-10 dataset. The findings on the CIFAR-10 dataset demonstrated that PyramidNet achieved SOTA performance with a much lower error rate than earlier SOTA models, proving the efficacy of the pyramid structure and other design decisions made in the architecture. PyramidNet’s pyramid structure enables significant accuracy increases while lowering computational costs and memory use. In comparison with previous models, PyramidNet employs a larger network, which may enhance its capacity to identify intricate elements in the data. PyramidNet training can be laborious and computationally costly, like with many deep neural network architectures, especially when employing larger datasets or better-resolution

Improving Tree-Based Convolutional Neural Network Model for Image …

21

images. If the model is too intricate or there aren’t enough training data, overfitting is a possibility. DINOv2 [24] is a new computer vision model that uses self-supervised learning to achieve results that match or surpass the standard approach used in the field. Selfsupervised learning is a powerful, flexible way to train AI models because it does not require large amounts of labelled data. DINOv2 does not require fine-tuning, providing high-performance features that can be directly used as inputs for simple linear classifiers. DINOv2 1100 M parameter model achieved 99.5% accuracy on validation set of CIFAR-10. Astroformer [25] is a hybrid transformer-convolutional neural network that uses relative attention, depth-wise convolutions and self-attention techniques. The model employs a careful selection of augmentation and regularization strategies, with a combination of mix-up and RandAugment for augmentation and stochastic depth regularization, weight decay and label smoothing for regularization. The authors find that strong augmentation techniques provide higher performance gains than stronger regularization. The model is effective in low-data regime tasks due to the careful selection of augmentation and regularization, great generalizability and inherent translational equivalence. It can learn from any collection of images and can learn features, such as depth estimation, that the current standard approach cannot. It attains an impressive accuracy score of 99.12%. In the past, the tree data structure-based convolutional networks were based on trinary trees for the initial layers meaning each node in the architecture will have exactly three child nodes. This network used max pooling to reduce downscale the size of the feature maps during convolutions and global average pooling in the end of the network to feed the output to the dense layers while generating less trainable parameters. This tiny and promising network had as little as 1.8 M parameters while achieving an accuracy of ~81% on the validation set of the CIFAR-10 dataset (Table 1). Table 1 Accuracy scores on CIFAR-10 dataset

References Model name

Accuracy scores

[15]

TBCNN

81.14

[17]

ViT-H/14

99.6

[18]

ResNet-56

88.8

[19]

BiT-L (ResNet)

99.37

[21]

EfficientNetV2-L

91.1

[22]

DenseNet (DenseNet-BC-190) 96.54

[23]

PyramidNet

97.14

[24]

DINOv2

99.5

[25]

Astroformer

99.12

22

S. Raees and P. Agarwal

2.2 Contribution In this paper, we aim to explore the effects of popular dimensionality techniques like 1D convolutions and convolutional layers learning themselves to scale down dimensions effectively, hence replacing the traditional techniques like global average pooling and max pooling, respectively.

3 Methodology 3.1 Overview The central concern of this paper is the information loss that occurs when using pooling layers in convolutional neural networks (CNNs). The primary objective of this research is to eliminate the need for pooling layers and instead develop a method for reducing the dimensions of feature maps while retaining all relevant information. This paper aims to address the critical issue of preserving spatial details in CNNs and proposes a novel approach to achieve this objective. In this section, we provide details about the dataset we use, the techniques we use to modify the model and discuss the network architectural details.

3.2 Dataset The CIFAR-10 dataset is a popular image dataset used in computer vision research. It was created by Krizhevsky et al. [16]. The dataset comprises 60,000 images that belong to ten different classes, including dogs, cats, horses, deer, aeroplanes, trucks, ships, birds, frogs and automobiles. These images are split into a set of training and testing images, with the training set containing 50,000 images and the remaining 10,000 used for testing the model’s generalization performance. The training dataset is balanced, with 5000 images for each class. Each image in the dataset is of size (32, 32) and has three colour channels (RGB).

3.3 1D Convolutions and Strides Convolution operations are applied to a wide range of data like 2D convolutions on images where it has pixel location, 3D convolutions on videos [26] where in addition to pixel location it also has the information of the time component and then there come 1D convolutions which can be applied on a sequence, which might be a signal [27–29]. The goal however remains the same across all of them. 1D convolutions

Improving Tree-Based Convolutional Neural Network Model for Image …

23

use kernels in a single dimension. These kernels are responsible for learning features and the patterns inside the sequence. It does it so by sliding across the sequence and taking sliding dot product for each value of the sequence. This results in a new sequence containing contextual information and is of the same size as the sequence before. The next concept in convolutions is strides, and they are the amount by which the kernel should jump after it completes a computation in the sequence. For instance, if you have a sequence of length 12 and a kernel of size 3, it will start the operation by aligning itself to the first three elements if the series and take a dot product to compute the result. To compute the next result it should in principle to the next data point and do the same operation but due to using strides, it skips the data points amounting to the value of strides provided. If we take stride to be equal to 2, it will result in the kernel starting from the 4th data point in the sequence. As you have noticed this process would result in a smaller resultant sequence, and hence, we achieve dimensionality reduction using stride convolutions. In our case, we have used 1-D convolutions just after the last 2-D convolution layer to reduce the dimensionality of the feature maps further while retaining important information.

3.4 Removal of Max Pooling In the recent studies leading to investigate generative models, it is found that models perform better when the convolution layers are allowed to learn how do down scale the feature maps by their own. Previously max pooling has been the norm, where we choose a patch of the desired size and the image or sequence is divided into these patches and we take the maximum as the input to the new feature map for that whole patch. We follow the footsteps of those studies and perform stride convolutions, eradicating max pooling completely out of the system.

3.5 Leaky ReLU Rectified linear unit most commonly known as ReLU has been a very widely used nonlinearity in neural networks and the tree-based convolutional neural networks use the same [30]. The functioning of ReLU is such that it takes a value and returns the same value if it is positive else it returns 0. This does provide sparsity in the feature maps, which creates a lasso-type regularization in the feature maps. This sparsity, however, seems to be good at first but as the training continues, the position at the feature map which has now become 0 will not receive any gradient for itself and hence the gradient dies there. That is where Leaky ReLU comes into play [31]. It provides a small gradient for the values greater than zero, hence allowing

24

S. Raees and P. Agarwal

gradients to flow across the entire feature maps, which allows relative weight updates to happen in the kernels corresponding to the locations that were zero before.

3.6 Model Architecture The previous model employed the traditional approach of using trees in its design, which resulted in a unique structure. Notably, the top convolutional layers consisted of three distinct blocks that utilized kernel sizes of 3, 5 and 7. This approach aimed to extract diverse information from the input images. The model architecture is shown in Fig. 3. To scale down the feature maps generated from the convolution layers directly above them, the model used max pooling. Another significant feature of the model was its use of channel-wise addition of the feature maps at the output of blocks after the top convolution layer. Overall, this design approach represents an innovative attempt to extract a wide range of relevant information from input images while optimizing performance through feature map scaling. The final layer of the model incorporated global average pooling (GAP) layers to effectively decrease the dimensionality of the feature maps before inputting them into the dense layers. As a result, the model was relatively shallow and significantly smaller in size compared to current state-of-the-art models, containing only 1.8 million parameters. This design approach

Fig. 3 TBCNN model architecture [15]

Improving Tree-Based Convolutional Neural Network Model for Image …

25

is an efficient means of reducing model complexity and optimizing performance in settings where computational resources are limited. Our proposed model follows a similar architecture with some modifications which will result in better data retention. We apply modifications mentioned in the previous sections to the network which results in a deeper network containing ~2.2 M parameters. The choice of optimizer [32] remains the same, Adam [33]. The learning rate is set ~0.0007, and we use a decay rate ~0.00006 for better convergence. We also replaced the additional layers with concatenate layers, so rather than adding feature maps together, we concatenate them channels-wise so that information is preserved and not lost in the process of addition [34]. Another small twerk we made is we added the batch normalization layer before the nonlinearity. This has been done while keeping in mind that if nonlinearity is applied before batch normalization, the resultant output would be relative to the positive values only, whereas this allows us to incorporate the negative values resulting from the convolution operation. The updated model architecture is shown in Fig. 4.

Fig. 4 Modified TBCNN model architecture

26

S. Raees and P. Agarwal

4 Results and Conclusion While the achieved validation accuracy of 81% on the CIFAR-10 dataset with just 2.4 million parameters is impressive, there are still some limitations to consider. Firstly, the model’s limited capacity due to the small number of parameters may not allow it to capture all the complex features and patterns in the dataset, leading to underfitting on the task. Secondly, the model’s performance may not generalize well to other datasets with different characteristics. Lastly, the CIFAR-10 dataset only contains ten classes, limiting the task scope of the model (Table 2). The proposed model in a similar training setup, surpassed the baseline TBCNN model with ease and achieved an accuracy score comparable to the previous model. The loss and accuracy curves for the model are presented in Fig. 5. These results suggest that the proposed model offers a promising solution for optimizing accuracy in image classification tasks, without sacrificing model complexity or computational efficiency. Using the early stopping call back, we were able to stop before the model overfits too much and the best weights were restored. It is noteworthy that the inclusion of the proposed modification resulted in a reduction in the learning capacity of the model, as evidenced by a decrease in training accuracy during the training process. By comparison, the previous model attained a training accuracy of 87%, indicating a higher model capacity than ours. However, the validation results obtained with our model demonstrate less overfitting and a more stable training process. These Table 2 Comparison of results of TBCNN and modified TBCNN model Model

Training accuracy

Validation accuracy

Parameters (M)

TBCNN

87

81.14

1.8

Modified TBCNN (Our model)

83

80.57

2.2

Fig. 5 Baseline performance metrics

Improving Tree-Based Convolutional Neural Network Model for Image …

27

outcomes offer promising evidence that our model may effectively balance learning capacity with generalization performance. In future, additional modifications to the model architecture could include the incorporation of skip connections to improve gradient flow throughout the network. Although we opted to keep the model relatively shallow and fast for practical considerations, a deeper network with a similar architecture may yield even better results. Therefore, further exploration of model depth and complexity may be warranted in future research endeavours.

References 1. Ramana K, Kumar MR, Sreenivasulu K, Gadekallu TR, Bhatia S, Agarwal P, Idrees SM (2022) Early prediction of lung cancers using deep saliency capsule and pre-trained deep learning frameworks. Front Oncol 12 2. Alzubaidi L, Zhang J, Humaidi AJ, Al-Dujaili A, Duan Y, Al-Shamma O, Santamaría J, Fadhel MA, Al-Amidie M, Farhan L (2021) Review of deep learning: concepts, CNN architectures, challenges, applications, future directions. J Big Data 8:1–74 3. Krizhevsky A, Sutskever I, Hinton G (2012) ImageNet classification with deep convolutional neural networks. In: Neural information processing systems, vol 25. https://doi.org/10.1145/ 3065386 4. Han D, Liu Q, Fan W (2018) A new image classification method using CNN transfer learning and web data augmentation. Expert Syst Appl 1(95):43–56 5. Dijkstra EW (1959) A note on two problems in connexion with graphs. Numer Math 1:269–271 6. Guibas LJ, Sedgewick R (1978) A dichromatic framework for balanced trees. In: 19th Annual symposium on foundations of computer science (SFCS 1978), Ann Arbor, MI, USA, pp 8–21. https://doi.org/10.1109/SFCS.1978.3 7. Hoare CAR (1961) Algorithm 64: Quicksort. Commun ACM 4(7):321–322. https://doi.org/ 10.1145/366622.366644 8. Zegour DE, Bounif L (2016) AVL and Red Black tree as a single balanced tree. 65–68. https:// doi.org/10.15224/978-1-63248-092-7-28 9. Cunha SDA (2022) Improved formulations and branch-and-cut algorithms for the angular constrained minimum spanning tree problem. J Comb Optim. https://doi.org/10.1007/s10878021-00835-w 10. Saringat M, Mostafa S, Mustapha A, Hassan M (2020) A case study on B-tree database indexing technique. https://doi.org/10.30880/jscdm.2020.01.01.004. 11. Liu L, Zhang Z (2013) Similar string search algorithm based on Trie tree. J Comput Appl 33:2375–2378. https://doi.org/10.3724/SP.J.1087.2013.02375 12. Gousia H, Shaima Q (2022) GAPCNN with HyPar: Global Average Pooling convolutional neural network with novel NNLU activation function and HYBRID parallelism. Front Comput Neurosci 16:1004988. https://doi.org/10.3389/fncom.2022.1004988.ISSN:1662-5188 13. Wang S-H, Satapathy SC, Anderson D, Chen S-X, Zhang Y-D (2021) Deep fractional max pooling neural network for COVID-19 recognition. Front Public Health 9(2021):726144. https://doi.org/10.3389/fpubh.2021.726144. ISSN 2296-2565 14. Radford A, Metz L, Chintala S (2016) Unsupervised representation learning with deep convolutional generative adversarial networks. In: Proceedings of the international conference on learning representations (ICLR) 15. Ansari AA, Raees S, Nafisur R (2022) Tree based convolutional neural networks for image classification. https://eudl.eu/doi/10.4108/eai.24-3-2022.2318997 16. Krizhevsky A (2012) Learning multiple layers of features from tiny images. University of Toronto

28

S. Raees and P. Agarwal

17. Dosovitskiy A, Beyer L, Kolesnikov A, Weissenborn D, Zhai X, Unterthiner T, Dehghani M, Minderer M, Heigold G, Gelly S, Uszkoreit J, Houlsby N (2020) An image is worth 16 × 16 words: transformers for image recognition at scale 18. He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: 2016 IEEE conference on computer vision and pattern recognition (CVPR), Las Vegas, NV, USA, pp 770–778. https://doi.org/10.1109/CVPR.2016.90 19. Kolesnikov A, Beyer L, Zhai X, Puigcerver J, Yung J, Gelly S, Houlsby N (2020) Big transfer (bit): general visual representation learning. In: Computer vision–ECCV 2020: 16th European conference, Glasgow, UK, 23–28 Aug 2020, proceedings, Part V 16. Springer, pp 491–507 20. Zoph B, Le QV (2018) Efficient neural architecture search via parameter sharing. J Mach Learn Res (JMLR) 19:1–45 21. Tan M, Le Q (2021) Efficientnetv2: smaller models and faster training. In: International conference on machine learning. PMLR 22. Huang G, Liu Z, Van Der Maaten L, Weinberger KQ (2017) Densely connected convolutional networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4700–4708 23. Han D, Kim J, Kim J (2017) Deep pyramidal residual networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), pp 5927–5935 24. Oquab M, Darcet T, Moutakanni T, Vo H, Szafraniec M, Khalidov V, Fernandez P, Haziza D, Massa F, El-Nouby A, Assran M, Ballas N, Galuba W, Howes R, Huang P-Y, Li S-W, Misra I, Rabbat M, Sharma V, Bojanowski P (2023) DINOv2: learning robust visual features without supervision 25. Dagli R (2023) Astroformer: More Data Might not be all you need for Classification. arXiv: 2304.05350 26. Rana S, Gaj S, Sur A, Bora PK (2016) Detection of fake 3D video using CNN. In: 2016 IEEE 18th international workshop on multimedia signal processing (MMSP), Montreal, QC, Canada, pp 1–5. https://doi.org/10.1109/MMSP.2016.7813368 27. Kiranyaz S, Avci O, Abdeljaber O, Ince T, Gabbouj M, Inman DJ (2021) 1D convolutional neural networks and applications: a survey. Mech Syst Signal Process 151:107398 28. Kiranyaz S, Ince T, Abdeljaber O, Avci O, Gabbouj M (2019) 1-D convolutional neural networks for signal processing applications. In: ICASSP 2019—2019 IEEE international conference on acoustics, speech and signal processing (ICASSP), Brighton, UK, pp 8360–8364 29. Markova M (2022) Convolutional neural networks for forex time series forecasting. AIP Conf Proc 2459:030024. https://doi.org/10.1063/5.0083533 30. Agarap AF (2018) Deep learning using rectified linear units (ReLU). arXiv:abs/1803.08375. n. pag 31. Xu B, Wang N, Chen T, Li Mu (2015) Empirical evaluation of rectified activations in convolutional network 32. Shaziya H (2020) A study of the optimization algorithms in deep learning. https://doi.org/10. 1109/ICISC44355.2019.9036442 33. Kingma D, Ba J (2014) Adam: a method for stochastic optimization. In: International conference on learning representations 34. Cengil E, Çınar A (2022) The effect of deep feature concatenation in the classification problem: an approach on COVID-19 disease detection. Int J Imaging Syst Technol. 32(1):26–40. https:// doi.org/10.1002/ima.22659. Epub 2021 Oct 10. PMID: 34898851; PMCID: PMC8653237

Smartphone Malware Detection Based on Enhanced Correlation-Based Feature Selection on Permissions Shagun, Deepak Kumar, and Anshul Arora

Abstract In the present day, smartphones are becoming increasingly ubiquitous, with people of all ages relying on them for daily use. The number of app downloads continues to skyrocket, with 1.6 million apps downloaded every hour in 2022, amounting to a staggering total of 142.6 billion downloads. Google Play outpaces iOS with 110.1 billion downloads compared to iOS’s 32.6 billion. Given the growing threat of malware applications for Android users, it is essential to quickly and effectively identify such apps. App permissions represent a promising approach to malware detection, particularly for Android users. Researchers are actively exploring various techniques for analyzing app permissions to enhance the accuracy of malware detection. Overall, understanding the importance of app permissions in identifying potentially harmful apps is a critical step in protecting smartphone users from malware threats. In our paper, we implemented enhanced correlation-based feature selection (ECFS) technique to predict whether an app is malicious or nonmalicious which uses both feature-feature and feature-class correlation scores, i.e., ENMRS and crRelevance for computation. We then predicted the accuracy with various machine learning techniques on the basis of ECFS scores and found the highest accuracy as 92.25 % for .n 1 and .n 2 values as 0.9 and 0.1, respectively. This accuracy is achieved by random forest ML technique. Keywords ECFS · ECFS Scores · Smartphones · ML Techniques

1 Introduction Over the past decade, smartphones have experienced an unprecedented rise in popularity, transforming from niche gadgets to ubiquitous necessities in our modern world. With each passing year, smartphones have become more affordable, technologically advanced, and accessible to a wider range of users, leading to explosive growth in adoption rates. According to industry reports, the global smartphone market is Shagun (B) · D. Kumar · A. Arora Delhi Technological University, Rohini, Delhi, New Delhi 110042, India e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 A. Swaroop et al. (eds.), Proceedings of Data Analytics and Management, Lecture Notes in Networks and Systems 788, https://doi.org/10.1007/978-981-99-6553-3_3

29

30

Shagun et al.

projected to continue its upward trajectory, with an estimated 5.5 billion smartphone users by 2025. The global smartphone market size was valued at USD 457.18 billion in 2021 and is projected to grow from USD 484.81 billion in 2022 to USD 792.51 billion by 2029, exhibiting a CAGR of 7.3% during the forecast period. This phenomenal growth can be attributed to several factors, including the increasing demand for mobile Internet access, the proliferation of social media, the rise of e-commerce, and the integration of smartphones into various aspects of our daily lives. From communication and entertainment to productivity and beyond, smartphones have become an indispensable tool for people of all ages and backgrounds. As smartphones continue to evolve with new features and capabilities, such as augmented reality, artificial intelligence, and 5G connectivity, their popularity is expected to continue growing in the foreseeable future, shaping the way we live, work, and connect in a rapidly changing digital landscape. The versatility of smartphones is a key factor that contributes to their widespread appeal. They have become an all-inone device that seamlessly integrates various aspects of our lives into a single device. In fact, for many people, smartphones have become the primary means of accessing the Internet, checking emails, and staying connected with the world. In conclusion, the adaptability of smartphones, their affordability, and the continuous evolution of technology are key factors that have contributed to the widespread popularity of smartphones. They have become indispensable companions in our modern lives, offering versatility, convenience, and accessibility that appeal to a wide range of consumers. As technology continues to advance, smartphones are likely to remain a dominant force in the realm of consumer electronics, shaping the way we live, work, and connect in the digital age. When it comes to mobile operating systems, Android has gained significant popularity in recent years, emerging as a dominant force in the smartphone market. Android stands out as the undeniable front-runner when considering the global usage of mobile operating systems. According to recent findings by Stat counter, Android commands a staggering 71.45% of the worldwide market share, while iOS trails behind with a 27.83% share. Together, these two giants account for over 99% of the total market share, leaving scant room for other contenders like Samsung and KaiOS, which collectively make up less than 1% of the market. These numbers clearly highlight the indomitable dominance of Android and iOS as the preeminent mobile operating systems that remain unrivaled in the industry.1 The ability to customize and personalize the user experience has been a significant draw for many Android users. Moreover, Android’s seamless integration with Google services, such as Google Drive, Google Maps, and Google Assistant, has also played a pivotal role in its widespread adoption. Finally, Android’s compatibility with a wide range of third-party devices and accessories, such as smartwatches, smart TVs, and smart home devices, has further cemented its position as a preferred choice for tech-savvy users who seek seamless connectivity across different devices. Overall, Android’s flexibility, affordability, customization options, and compatibility have contributed to its growing popularity and market dominance in the realm of mobile operating systems. 1

https://www.appmysite.com/blog/android-vs-ios-mobile-operating-system-market-sharestatistics-you-must-know/.

Smartphone Malware Detection Based on Enhanced …

31

1.1 Motivation Android has emerged as the primary target for malware apps due to several factors. First and foremost, Android’s widespread adoption as the most widely used mobile operating system makes it an attractive target for cybercriminals seeking a large user base to exploit. Additionally, the open-source nature of Android allows for customization and flexibility, but it also means that potential vulnerabilities can be exploited by malicious actors. The decentralized nature of the Android app ecosystem, with multiple app stores and varying levels of app review processes, can also create opportunities for malware to slip through the cracks. Furthermore, the diverse hardware and software configurations across different Android devices can make it challenging to implement uniform security measures. Lastly, the popularity of thirdparty app stores and the availability of apps outside of the official Google Play Store can increase the risk of downloading malware-laden apps. Collectively, these factors make Android the biggest target for malware apps, necessitating robust security measures to safeguard users’ devices and data. During 2022, the worldwide number of malware attacks reached 5.5 billion, an increase of two percent compared to the preceding year. In recent years, the highest number of malware attacks was detected in 2018, when 10.5 billion such attacks were reported across the globe.2 Malware, or malicious software, can pose various risks to Android devices. Some potential risks of having malware on an Android device may include: • Data Theft: Malware can be designed to steal sensitive information from your Android devices, such as passwords, credit card numbers, and personal data. • Unauthorized Charges: Some types of malware, such as premium-rate SMS malware, can send text messages to premium-rate numbers, resulting in unauthorized charges on your mobile bill. • Spread to other devices: Some malware can spread to other devices on the same network or through infected apps. • Financial Loss: Some types of malware, such as ransomware, can encrypt your files and demand a ransom for their release. Efforts to develop effective techniques for detecting malware in application stores are critical due to the dynamic nature of permission usage in apps. Common issues with permission feature-based detection methods include: • Variability in permission usage across apps, making it challenging to establish consistent correlations with malicious behavior. • False positives, as benign apps may use permissions in similar ways to malicious apps. By leveraging innovative techniques, the proposed work aims to provide a fresh perspective on detecting malicious apps in Android. The novelty of our proposed work can be best described on the basis of the statistical selection procedure that we 2

https://www.statista.com/statistics/873097/malware-attacks-per-year-worldwide/.

32

Shagun et al.

have adopted, there are many works that use only feature-class correlation adoption method, but we used both feature-feature and feature-class correlation based on enhanced correlation-based feature selection (ECFS). The preliminary results based on the work we did were satisfactory but need more evaluation on evaluating for various values of .n 1 and .n 2 in future work.

1.2 Contributions In this paper, we have used a statistical feature selection technique called enhanced correlation-based feature selection (ECFS) which uses feature-feature correlation scores evaluated using ENMRS and feature-class correlation scores evaluated using crRelevance. The ECFS method was introduced by the authors of [1] for using these correlations effectively to extract relevant feature subsets from multi-class gene expression and other machine learning datasets. We have adopted this feature selection technique in our paper for a multi-binary form of data which has features named as different permissions needed by malicious and benign apps. Moreover, the objects that were required by our adopted method are in the form of malicious and non-malicious application names, and as for the multi-class parameter we have defined Class A for non-malicious applications and Class B for malicious applications. The following points summarize the contributions of this work. • We extracted permissions from malicious and non-malicious applications. • We evaluated ENMRS scores for both malicious and non-malicious applications. • We evaluated crRelevance scores for both malicious and non-malicious applications. • Next, we evaluated ECFS scores for different values of .n 1 and .n 2 . • Lastly, we evaluated the accuracy for each combination of .n 1 and .n 2 using various machine learning techniques. • We concluded our paper by noting the highest accuracy achieved is 92.25 % for the combination .n 1 = 0.9 and .n 2 = 0.1 with the random forest technique.

2 Related Work In this section, we shall embark on an intriguing expedition, delving into the depths of preexisting or interconnected studies conducted in this specialized domain. There are numerous studies in the literature that focus on detecting intrusions or anomalies in the desktop domain [2–4]. However, since we aim to build a malware detector for Android OS, hence, we focus on the discussion of Android malware. Some of the Android malware detection techniques have analyzed dynamic network traffic features such as [5–8]. Since we work on static detection, hence, we limit our discussion

Smartphone Malware Detection Based on Enhanced …

33

mostly to static detection techniques. The authors in [1] proposed a permissionensemble-based mechanism to detect Android malware with permission combinations. The authors in [9] introduced a permission-based malware detection system and reimplemented Juxtapp for malware and piracy detection. Performance is evaluated on a dataset with original, pirated, and malware-infected applications. The authors in [10] introduced DynaMalDroid, a dynamic analysis-based framework for detecting malicious Android apps. It employs system call extraction and three modules: dynamic analysis, feature engineering, and detection. The authors in [11] developed a new method for Android application analysis that involved using static analysis to collect important features and passing them to a functional API deep learning model. Li et al. [12] described a reliable Android malware classifier using factorization machine architecture and app feature extraction. Their results showed that interactions among features were critical to revealing malicious behavior patterns. Qiu et al. [13] proposed Multiview Feature Intelligence (MFI) for detecting evolving Android malware with similar capabilities. MFI extracts features via reverse engineering to identify specific capabilities from known malware groups and detect new malware with the same capability. The authors in [14] proposed a hybrid deep learning-based malware detection method, utilizing convolutional neural networks and bidirectional long short-term memory (BiLSTM) to accurately detect long-lasting malware. The authors in [15] introduced a malware capability annotation (MCA) to detect securityrelated functionalities of discovered malware. The authors in [16] proposed a malware detection mechanism using transparent artificial intelligence. This approach leverages app attributes to distinguish harmful from harmless malware. Khalid and Hussain [17] analyzed the impact of dynamic analysis categories and features on Android malware detection. Using filter and wrapper methods identifies the most significant categories and list important features within them. The authors in [18] introduced SHERLOCK, a deep learning algorithm that uses self-supervision and the ViT model to identify malware. The authors in [19] identified and ranked permissions commonly found in normal and malicious apps. Li et al. [20] proposed a stealthy backdoor that is triggered when a specific app is introduced and demonstrated the attack on common malware detectors. The authors in [21] introduced AndroOBFS, a released obfuscated malware dataset spanning three years (2018–2020). It consisted of 16,279 real-world malware samples across six obfuscation categories, providing valuable temporal information. The authors in [22] proposed AdMat, a framework that uses an adjacency matrix to classify Android apps as images. This enables the convolutional neural network to differentiate between benign and malicious apps. Canfora et al. [23] designed LEILA, a tool that uses model checking to verify Java bytecode and detect Android malware families. Yousefi-Azar et al. [24] proposed Byte2vec, which improves static malware detection by embedding semantic similarity of byte-level codes into feature and context vectors. It allows for binary file feature representation and selection, enhancing malware detection capabilities. The authors in [25] presented Alterdroid, a dynamic analysis approach for detecting obfuscated malware components within apps. It works by creating modified versions of the original app and observing the behavioral differences. Eom et al. [26] used three feature selection methods to build a machine

34

Shagun et al.

learning-based Android malware detector, showing its effectiveness on the Malware Genome Project dataset and their own collected data. Zhang and Jin [27] proposed a process for Android malware detection using static analysis and ensemble learning. Dissanayake et al. [28] study evaluates K-nearest neighbor (KNN) algorithm’s performance with different distance metrics and Principal Component Analysis (PCA). Results show improved classification accuracy and efficiency with the right distance metric and PCA. The authors in [29] focused on detecting Android malware in APK files by analyzing obfuscation techniques, permissions, and API calls. They highlighted the challenges faced by traditional antivirus software in detecting these malware variants. Amenova et al. [30] proposed a CNN-LSTM deep learning approach for Android malware detection, achieving high accuracy through efficient feature extraction. Mantoro et al. [31] employed dynamic analysis using the mobile security framework to detect obfuscated malware. It showcases the effectiveness of dynamic analysis in detecting various types of malware. The authors in [32] compared state-ofthe-art mobile malware detection methods, addressing Android malware and various detection classifiers. It provided insights into the progress of the Android platform and offered a clear understanding of the advancements in malware detection. The authors in [33] proposed a framework, FAMD, for fast Android malware detection based on a combination of multiple features. The original feature set is constructed by extracting permissions and Dalvik opcode sequences from samples. Awais et al. [34] introduced ANTI-ANT, a unique framework that detects and prevents malware on mobile devices. It used three detection layers, static and dynamic analysis, and multiple classifiers. Islam et al. [35] investigated the effectiveness of unigram, bigram, and trigram with stacked generalization and found that unigram has the highest detection rate with over 97% accuracy compared to bigram and trigram. The authors in [36– 39] have analyzed various manifest file components such as permissions, intents, and hardware features for Android malware detection. To the best of our knowledge, no other existing work has used enhance correlationbased feature selection method on permissions feature for Android malware detection. We explain the methodology in detail in the next section.

3 Proposed Methodology We explain the system design in various sub-phases described below.

3.1 Datasets Our study involved the use of two datasets, one comprising normal apps and the other containing malicious apps. The dataset for normal apps was collected from the Google Play Store, while the dataset for malicious apps was obtained from the

Smartphone Malware Detection Based on Enhanced …

35

AndroZoo.3 It is important to note that our study solely focused on apps in the Google Play Store and did not consider apps available on other platforms. The AndroZoo website is a growing library of Android apps collected from various sources, including the official Google Play app market. It also contains a collection of app-related metadata aimed at facilitating research on Android devices. The library currently contains 15,097,876 unique APKs, which have been scanned by multiple antivirus programs to identify malicious software. Each software in the dataset has over 20 different types of metadata, including VirusTotal reports. Our dataset consisted of 111,010 applications, with 55,505 labeled as malicious and the remaining 55,505 labeled as normal.

3.2 Feature Extraction Feature extraction is a vital process in Android malware analysis as it helps in identifying the characteristics of malware and distinguishing it from benign applications. Android permissions are commonly used as features for building a probable model for Android malware analysis. Permissions related to the network, reading privacy, receiving and sending SMS, dialing, and others are considered dangerous permissions and are used to distinguish between malicious and benign applications. Hence, we have selected permissions as the feature for experiments in this proposed work. Android permission extraction is a crucial process used to detect potential malware by extracting and analyzing permissions from Android apps. We followed a static approach to extract permissions that involve decompiling the app’s APK file using tools such as Apktool, JADX, or Androguard to extract the manifest file, which contains details about the app’s rights. The permission declarations are then extracted from the manifest file using XML parsing libraries. We had a total of 129 unique permissions from both datasets.

3.3 Feature-Feature Correlation with ENMRS To assess the ECFS scores in our dataset, we utilized the effective normalized mean residue similarity (ENMRS) measure, an extension of the normalized mean residue similarity (NMRS) measure. While Pearson’s correlation coefficient is a commonly employed correlation measure, we selected NMRS as it exclusively focuses on detecting shifting correlation rather than scaling correlation. However, both NMRS and Pearson’s correlation coefficient are highly sensitive to atypical or noisy values, potentially leading to the exclusion of significant features from the optimal feature subset. To overcome this limitation, we substituted the object mean with object local means in ENMRS. These 3

https://androzoo.uni.lu.

36

Shagun et al.

local means are computed by averaging the element with its neighboring elements, both to the left and right. This particular characteristic is crucial in feature-feature correlation analysis when there is correlation within a subset of homogeneous objects. In our study, we considered only a single left and right neighbor and opted for the single neighborhood scheme in our local mean computation. ENMRS quantifies the similarity between pairs of objects. .d1 = [a1 , a2 , . . . , an ] and .d2 = [b1 , b2 , . . . , bn ] can be defined as follows: .ENMRS (d1 , d2 ) | ∑n | | | i =1 ai − almean(i ) − bi + blmean(i) |) (∑n |( )| ∑n | =1− |bi − blmean(i) | | ai − almean(i) | , 2 × max i=1

i=1

where a = (ai−1 + ai + ai+1 ) /3 if, 1 < i < n, .aImean(i) = (ai + ai+1 ) /2 if, i = 1, .aImean(i) = (ai−1 + ai ) /2 if, i = n. . Imean(i)

3.4 Feature-Class Correlation Measure: crRelevance The crRelevance measure assesses how well a feature can differentiate between different class labels, specifically in our case, malware and normal, and provides a value within the [0, 1] range. A class range refers to a range of values for a feature where all objects share the same class label. This range is determined by assigning a consecutive range of values to a feature with identical class labels. The crRelevance measure is built upon four definitions that establish the theoretical foundation of crRelevance. The first definition states that for a feature with values corresponding to n objects or instances in the dataset, a class range can be defined as a range where all objects within that range have the same class label. The second definition defines the cardinality of a class range as the number of objects in the given range for the feature. The third definition describes the class-cardinality of class A as the number of objects with the class label A. The fourth definition pertains to the core class range of class A, which represents the highest class range for class A. (A) is defined as follows. crRelevance .class fi crRelevanceclass (A) = fi

.

rcard(ccrange(A)) ccard(A)

For dataset D, the core class relevance of a feature . f i ∈ F can be defined as the highest crRelevance for a given class . Ai . Mathematically, crRelevance of a feature . f i , crRelevance .( f i ), for a dataset with .n classes . A 1 , . A 2 , . . . , A n can be defined as follows. ( ) class Aj .crRelevance ( f i ) = max1≤ j≤n crRelevance . f f i

Smartphone Malware Detection Based on Enhanced …

37

3.5 Proposed Feature Selection Technique: ECFS The proposed method for feature selection utilizes ENMRS and crRelevance to calculate an ECFS value for each pair of features, which falls within the range of 0–1. The method ensures that a higher ECFS value corresponds to a stronger crRelevance score (representing feature-class correlation) and a lower ENMRS score (representing feature-feature correlation). This is achieved by subtracting the ENMRS value for the feature pair from 1 and adding it to the average crRelevance score. The constants .n 1 and .n 2 are multiplied with the computed feature-feature and feature-class components, respectively, to scale the range from [0, 2] to [0, 1] and control their influence on the ECFS score. The method selects a user-defined number of features by iteratively choosing the next highest unprocessed feature pair that shares at least one common feature and includes the common feature(s) in the selected subset. The resulting subset of selected features is presented as the output. ENMRS is directly calculated between each pair of features, while the crRelevance of individual features is averaged to obtain the crRelevance value for the feature pair. This approach ensures that a higher ECFS value, within the range of 0 to 1, reflects a stronger feature-class correlation and a weaker feature-feature correlation Table 10. crRelevance score is used to obtain the final ECFS value. ECFS value of a pair of features . f 1 , f 2 can be computed as follows: ECFS ( f 1 , f 2 ) = (n 1 × (1 − ENMRS ( f 1 , f 2 ))) + (n 2 × avgRelevance ( f 1 , f 2 )) where .n 1 and .n 2 are constants such that .n 1 + n 2 = 1, and .avgRelevance ( f 1 , f 2 ) = crRelevance( f 1 )+crRelevance( f 2 ) . To adjust the range from [0, 2] to [0, 1] and regulate the 2 impact of feature-feature and feature-class correlations on the ECFS score, the equation involves multiplying the computed feature-feature and feature-class components by the constants .n 1 and .n 2 . These constants are chosen such that .n 1 + n 2 = 1. When both .n 1 and .n 2 are set to 0.5, both components contribute equally to the score. If .n 1 is greater than 0.5, the result will have a larger contribution from ENMRS, representing feature-feature correlation. Conversely, if .n 2 is greater than 0.5, the result will have a larger contribution from crRelevance or feature-class correlation.

3.6 Machine Learning Techniques Used We used the following machine learning techniques to evaluate the efficiency of the ECFS scores for different values of .n 1 and .n 2 . • Decision Tree: Decision trees are easy to interpret and can handle both categorical and numerical data.

38

Shagun et al.

• Support Vector Machine: SVM is particularly useful when the number of features is high, and the data is not linearly separable. It can handle both linear and nonlinear classification by using different types of kernels. • Logistic Regression: Logistic regression is a simple yet powerful algorithm that can handle binary and multi-class classification problems and provides interpretable results in terms of the contribution of each feature to the prediction. • Random Forest: Random forest is a popular algorithm that is known for its high accuracy and robustness to overfitting. It can handle both classification and regression problems and can provide insights into feature importance. • K-Nearest Neighbor Classifier (KNN): KNN is a simple and intuitive algorithm that does not make any assumptions about the underlying distribution of the data. It can handle both classification and regression problems and can adapt to changes in the data. • Gaussian Naive Bayes: Gaussian Naive Bayes is a fast and efficient algorithm that can handle high-dimensional data. It is particularly useful when the number of features is much larger than the number of samples. • Perceptron: Perceptron is a simple and efficient algorithm that can handle linearly separable binary classification problems. It can converge quickly and is computationally efficient. • SGD Classifier: SGD Classifier is a fast and scalable algorithm that can handle large datasets with high-dimensional features. It is particularly useful for online learning and can adapt to changes in the data. In conclusion, each of these machine learning algorithms has its strengths and weaknesses, and the choice of algorithm depends on the specific problem and the characteristics of the data. It is important to understand the underlying assumptions, limitations, and trade-offs of each algorithm before applying it to real-world problems.

4 Results and Discussion As discussed in the above section, we can apply different values of .n 1 and .n 2 with a constraint that their sum leads up to one. Hence, we have used different combinations of .n 1 and .n 2 and we have summarized the results from each of the combinations in the subsections described below.

4.1 .n1 = 0.1 and .n1 = 0.9 From Table 1, we conclude that for .n 1 = 0.1 and .n 2 = 0.9, the highest accuracy, i.e., 9.21%, is obtained with random forest and the lowest accuracy, i.e., 70.16 %, is obtained by perceptron. As .n 1 .> .n 2 , the accuracy scores are more inclined toward the crRelevance value of the ECFS score.

Smartphone Malware Detection Based on Enhanced …

39

Table 1 ML accuracy results for .n 1 = 0.1 and .n 2 = 0.9 ML accuracy .n 1

.n 2

0.1

0.9

ML model Decision tree

0.1

0.9

SVM

0.1

0.9

Logistic regression

0.1

0.9

Random forest

0.1

0.9

KNeighbors classifier

0.1

0.9

Gaussian NB

0.1

0.9

Perceptron

0.1

0.9

SGD classifier

Accuracy scores 0.8762 0.8744 0.8675 0.8819 0.8788 0.8650 0.8788 0.8625 0.8681 0.8681 0.8100 0.8197 0.8125 0.8184 0.8150 0.7881 0.8038 0.8125 0.7994 0.8019 0.7838 0.8000 0.8038 0.7925 0.8031 0.8956 0.8888 0.8988 0.8919 0.8938 0.8806 0.9038 0.8844 0.8888 0.8950 0.8694 0.8669 0.8669 0.8675 0.8744 0.8575 0.8681 0.8631 0.8706 0.8738 0.6844 0.7006 0.7013 0.7044 0.6913 0.6819 0.7075 0.7131 0.7006 0.7038 0.7394 0.7400 0.7369 0.7463 0.7356 0.7713 0.5388 0.7550 0.5031 0.7500 0.7881 0.8044 0.8131 0.7963 0.8044 0.7825 0.8006 0.8019 0.7944 0.8056

Accuracy (%) 87.21

81.51

79.88

89.21

86.78

69.88

70.16

79.91

40

Shagun et al.

4.2 .n1 = 0.2 and .n2 = 0.8 From Table 2, we conclude that for .n 1 = 0.2 and .n 2 = 0.8, the highest accuracy, i.e., 87.90%, is obtained with random forest and the lowest accuracy, i.e., 68.08 % is obtained by perceptron. As .n 1 .> .n 2 , the accuracy scores are more inclined toward the crRelevance value of the ECFS score.

4.3 .n1 = 0.3 and .n2 = 0.7 From Table 3, we conclude that for .n 1 = 0.3 and .n 2 = 0.7, the highest accuracy, i.e., 88.54%, is obtained by random forest and the lowest accuracy, i.e., 70.91 %, is obtained by Gaussian NB. As .n 1 .> .n 2 , the accuracy scores are more inclined toward the crRelevance value of the ECFS score.

4.4 .n1 = 0.4 and .n2 = 0.6 From Table 4, we conclude that for .n 1 = 0.4 and .n 2 = 0.6, the highest accuracy, i.e., 89.91%, is obtained by random forest and the lowest accuracy, i.e., 62.14 %, is obtained by perceptron. As .n 1 .> .n 2 , the accuracy scores are more inclined toward the crRelevance value of the ECFS score.

4.5 .n1 = 0.5 and .n2 = 0.5 From Table 5, we conclude that for .n 1 = 0.5 and .n 2 = 0.5, the highest accuracy, i.e., 87.80%, is obtained by the random forest and the lowest accuracy, i.e., 71.79 %, is obtained by Gaussian NB. As .n 1 .= .n 2 , the accuracy scores are balanced toward the crRelevance and ENMRS values of the ECFS score.

4.6 .n1 = 0.6 and .n2 = 0.4 From Table 6, we conclude that for .n 1 = 0.6 and .n 2 = 0.4, the highest accuracy, i.e., 90.96%, is obtained by Random forest and the lowest accuracy, i.e., 72.19 %, is obtained by Gaussian NB. As .n 1 .< .n 2 , the accuracy scores are more inclined toward the ENMRS value of the ECFS score.

Smartphone Malware Detection Based on Enhanced …

41

Table 2 ML accuracy results for .n 1 = 0.2 and .n 2 = 0.8 ML accuracy .n 1

.n 2

0.2

0.8

ML model Decision tree

0.2

0.8

SVM

0.2

0.8

Logistic regression

0.2

0.8

Random forest

0.2

0.8

KNeighbors classifier

0.2

0.8

Gaussian NB

0.2

0.8

Perceptron

0.2

0.8

SGD classifier

Accuracy scores 0.8581 0.8638 0.8713 0.8581 0.8650 0.8613 0.8463 0.8606 0.8488 0.8738 0.7778 0.7922 0.7850 0.7859 0.7900 0.7694 0.7581 0.7894 0.7744 0.7844 0.7688 0.7594 0.7881 0.7544 0.7931 0.8831 0.8856 0.8925 0.8694 0.8769 0.8769 0.8669 0.8769 0.8713 0.8906 0.8563 0.8631 0.8756 0.8488 0.8563 0.8400 0.8463 0.8588 0.8394 0.8738 0.7094 0.6919 0.7169 0.6931 0.7169 0.6944 0.6950 0.7131 0.6938 0.7256 0.6275 0.5019 0.5275 0.6813 0.7394 0.7325 0.7269 0.7625 0.7313 0.7775 0.7713 0.7619 0.7888 0.7681 0.7844 0.7813 0.7575 0.7731 0.7519 0.7944

Accuracy (%) 86.07

78.62

77.39

87.90

85.58

70.50

68.08

77.33

42

Shagun et al.

Table 3 ML accuracy results for .n 1 = 0.3 and .n 2 = 0.7 ML accuracy .n 1

.n 2

0.3

0.7

ML model Decision tree

0.3

0.7

SVM

0.3

0.7

Logistic regression

0.3

0.7

Random forest

0.3

0.7

KNeighbors classifier

0.3

0.7

Gaussian NB

0.3

0.7

Perceptron

0.3

0.7

SGD classifier

Accuracy scores 0.8656 0.8662 0.8563 0.8613 0.8631 0.8613 0.8606 0.8500 0.8456 0.8544 0.8016 0.8050 0.8000 0.7991 0.7938 0.7594 0.7813 0.7806 0.7825 0.7688 0.7769 0.7894 0.7625 0.7813 0.7725 0.8831 0.8894 0.9013 0.8881 0.8919 0.8838 0.8819 0.8869 0.8731 0.8750 0.8588 0.8662 0.8713 0.8538 0.8656 0.8588 0.8613 0.8588 0.8506 0.8494 0.6938 0.7106 0.7344 0.7150 0.7019 0.6994 0.7213 0.7069 0.7019 0.7063 0.7406 0.7863 0.7656 0.6881 0.7688 0.7531 0.7731 0.7525 0.6613 0.7388 0.7669 0.8019 0.7925 0.7938 0.7550 0.7925 0.7925 0.7725 0.7850 0.7863

Accuracy (%) 85.84

79.99

77.55

88.54

85.94

70.91

74.28

78.39

Smartphone Malware Detection Based on Enhanced …

43

Table 4 ML accuracy results for .n 1 = 0.4 and .n 2 = 0.6 ML accuracy .n 1

.n 2

0.4

0.6

ML model Decision tree

0.4

0.6

SVM

0.4

0.6

Logistic regression

0.4

0.6

Random forest

0.4

0.6

KNeighbors classifier

0.4

0.6

Gaussian NB

0.4

0.6

Perceptron

0.4

0.6

SGD classifier

Accuracy scores 0.8856 0.8681 0.8788 0.8762 0.8738 0.8712 0.8819 0.88 0.8831 0.8725 0.8281 0.8281 0.8203 0.8153 0.8278 0.7962 0.7938 0.7844 0.7875 0.7831 0.7831 0.8006 0.775 0.7925 0.7863 0.905 0.8994 0.9062 0.8988 0.8962 0.8906 0.8994 0.8981 0.9025 0.8944 0.8731 0.8813 0.88 0.8688 0.8681 0.8763 0.8869 0.865 0.8656 0.8688 0.7219 0.7113 0.7056 0.7238 0.7106 0.7256 0.7375 0.7088 0.725 0.7188 0.595 0.78 0.7625 0.7644 0.4369 0.6 0.4988 0.4981 0.7794 0.4988 0.8044 0.7956 0.7894 0.7663 0.7856 0.7913 0.8113 0.7863 0.7956 0.8119

Accuracy (%) 87.71

82.39

78.82

89.91

87.34

71.89

62.14

79.38

44

Shagun et al.

Table 5 ML accuracy results for .n 1 = 0.5 and .n 2 = 0.5 ML accuracy .n 1

.n 2

0.5

0.5

ML model Decision tree

0.5

0.5

SVM

0.5

0.5

Logistic regression

0.5

0.5

Random forest

0.5

0.5

KNeighbors classifier

0.5

0.5

Gaussian NB

0.5

0.5

Perceptron

0.5

0.5

SGD classifier

Accuracy scores 0.8556 0.8431 0.8506 0.8437 0.8594 0.8681 0.8513 0.8475 0.8650 0.8363 0.7981 0.7903 0.7984 0.8034 0.7934 0.7812 0.7750 0.7750 0.7700 0.7681 0.7800 0.8006 0.7644 0.7669 0.7725 0.8788 0.8775 0.8713 0.8575 0.8825 0.8931 0.8913 0.8781 0.8856 0.8644 0.8306 0.8431 0.8263 0.8194 0.8419 0.8419 0.8513 0.8250 0.8506 0.8306 0.7188 0.7188 0.7250 0.7138 0.7194 0.7156 0.7425 0.7019 0.7131 0.7100 0.7325 0.5006 0.7594 0.7775 0.7719 0.7844 0.7888 0.7550 0.7663 0.7581 0.7863 0.7863 0.7719 0.7688 0.7675 0.7856 0.8044 0.7669 0.7694 0.7788

Accuracy (%) 85.21

79.68

77.54

87.80

83.61

71.79

73.94

77.86

Smartphone Malware Detection Based on Enhanced …

45

Table 6 ML accuracy results for .n 1 = 0.6 and .n 2 = 0.4 ML accuracy .n 1

.n 2

0.6

0.4

ML model Decision tree

0.6

0.4

SVM

0.6

0.4

Logistic regression

0.6

0.4

Random forest

0.6

0.4

KNeighbors classifier

0.6

0.4

Gaussian NB

0.6

0.4

Perceptron

0.6

0.4

SGD classifier

Accuracy scores 0.8875 0.9 0.9075 0.8919 0.8819 0.8938 0.88 0.8831 0.89 0.8938 0.8528 0.8519 0.8422 0.8519 0.8553 0.8256 0.8013 0.8238 0.8125 0.805 0.7994 0.8069 0.8113 0.81 0.795 0.9113 0.91 0.9231 0.915 0.8969 0.9063 0.9013 0.9069 0.9125 0.9125 0.8925 0.8888 0.885 0.8931 0.8806 0.8894 0.8875 0.8794 0.8806 0.8819 0.7481 0.7013 0.7281 0.7081 0.7194 0.7081 0.745 0.7175 0.7275 0.7156 0.8331 0.5775 0.8156 0.81 0.7781 0.7675 0.8056 0.7763 0.7938 0.77 0.83 0.8069 0.8138 0.8169 0.7919 0.8075 0.8125 0.8138 0.8238 0.7906

Accuracy (%) 89.09

85.08

80.91

90.96

88.59

72.19

77.28

81.08

46

Shagun et al.

4.7 .n1 = 0.7 and .n2 = 0.3 From Table 7, we conclude that for .n 1 = 0.7 and .n 2 = 0.3, the highest accuracy, i.e., 91.64%, is obtained by Random forest and the lowest accuracy, i.e., 72.64 %, is obtained by Gaussian NB. As .n 1 .< .n 2 , the accuracy scores are more inclined toward the ENMRS value of the ECFS score.

4.8 .n1 = 0.8 and .n2 = 0.2 From Table 8, we conclude that for .n 1 = 0.8 and .n 2 = 0.2, the highest accuracy, i.e., 91.96%, is obtained by random forest and the lowest accuracy, i.e., 70.76 %, is obtained by perceptron. As .n 1 .< .n 2 , the accuracy scores are more inclined toward the ENMRS value of the ECFS score.

4.9 .n1 = 0.9 and .n2 = 0.1 From Table 9, we conclude that for .n 1 = 0.9 and .n 2 = 0.1, the highest accuracy, i.e., 92.25%, is obtained by random forest and the lowest accuracy, i.e., 71.20 %, is obtained by perceptron. As .n 1 .< .n 2 , the accuracy scores are more inclined toward the ENMRS value of the ECFS score.

5 Conclusion From Table 10, we conclude our paper by evaluating that the highest accuracy, i.e., 92.25 % was achieved by random forest ML technique for the values of .n 1 = 0.9 and .n 2 = 0.1. As .n 1 .> .n 2 , our accuracy results were more inclined toward the ENMRS values of the ECFS scores. Also in Table 10, we interpret the pattern as for higher .n 1 values, i.e., for the feature-feature correlation (ENMRS) factor, the accuracy increases. The accuracy decreases for higher .n 2 values meaning for feature-class correlation (crRelevance) scores. Thus, for higher .n 1 value and lower .n 2 value of the ECFS score the ML techniques have better accuracy. The preliminary results based on the work we done were satisfactory but need more evaluation on evaluating for various values of .n 1 and .n 2 in future work.

Smartphone Malware Detection Based on Enhanced …

47

Table 7 ML accuracy results for .n 1 = 0.7 and .n 2 = 0.3 ML accuracy .n 1

.n 2

0.7

0.3

ML model Decision tree

0.7

0.3

SVM

0.7

0.3

Logistic regression

0.7

0.3

Random forest

0.7

0.3

KNeighbors classifier

0.7

0.3

Gaussian NB

0.7

0.3

Perceptron

0.7

0.3

SGD classifier

Accuracy scores 0.9000 0.8981 0.8894 0.8894 0.9069 0.8969 0.8881 0.8856 0.9081 0.8875 0.8538 0.8547 0.8634 0.8588 0.8638 0.8138 0.7944 0.8038 0.8244 0.8269 0.8213 0.8206 0.8000 0.8294 0.8138 0.9150 0.9175 0.9038 0.9113 0.9294 0.9275 0.9144 0.9056 0.9219 0.9175 0.8894 0.8925 0.8813 0.8781 0.9025 0.9019 0.9025 0.8875 0.9006 0.8938 0.7244 0.7063 0.7250 0.7219 0.7413 0.7331 0.7263 0.7125 0.7375 0.7363 0.7425 0.7756 0.7619 0.5519 0.7975 0.8500 0.7981 0.8319 0.8656 0.5463 0.8056 0.7975 0.8088 0.8281 0.8263 0.8288 0.8250 0.7994 0.8300 0.8106

Accuracy (%) 89.50

85.89

81.48

91.64

89.30

72.64

75.21

81.60

48

Shagun et al.

Table 8 ML accuracy results for .n 1 = 0.8 and .n 2 = 0.2 ML accuracy .n 1

.n 2

0.8

0.2

.Decision

ML model tree

0.8

0.2

SVM

0.8

0.2

Logistic regression

0.8

0.2

Random forest

0.8

0.2

KNeighbors classifier

0.8

0.2

Gaussian NB

0.8

0.2

Perceptron

0.8

0.2

SGD classifier

Accuracy scores 0.8994 0.9156 0.8931 0.8912 0.9019 0.8969 0.9094 0.8950 0.9019 0.8931 0.8628 0.8641 0.8663 0.8741 0.8713 0.8088 0.8356 0.8244 0.8163 0.8294 0.8281 0.8250 0.8269 0.8319 0.8163 0.9169 0.9150 0.9169 0.9163 0.9288 0.9194 0.9244 0.9163 0.9225 0.9200 0.8938 0.8963 0.9000 0.8925 0.8988 0.8931 0.9100 0.8956 0.9075 0.8975 0.7150 0.7363 0.7500 0.7219 0.7494 0.7425 0.7419 0.7363 0.7469 0.7306 0.5588 0.8050 0.8113 0.7213 0.7375 0.7175 0.7988 0.8013 0.7281 0.3969 0.8081 0.8381 0.8100 0.8100 0.8363 0.8256 0.8238 0.8219 0.8306 0.8156

Accuracy (%) 89.975

86.7688

82.4250

91.9625

89.8500

73.7063

70.7625

82.2000

Smartphone Malware Detection Based on Enhanced …

49

Table 9 ML accuracy results for .n 1 = 0.9 and .n 2 = 0.1 ML accuracy .n 1

.n 2

0.9

0.1

.Decision

ML model tree

0.9

0.1

SVM

0.9

0.1

Logistic regression

0.9

0.1

Random forest

0.9

0.1

KNeighbors classifier

0.9

0.1

Gaussian NB

0.9

0.1

Perceptron

0.9

0.1

SGD classifier

Accuracy scores 0.9013 0.9056 0.9000 0.9100 0.9063 0.9069 0.9094 0.9019 0.8913 0.8969 0.8831 0.8797 0.8747 0.8813 0.8559 0.8188 0.8319 0.8263 0.8369 0.8269 0.8344 0.8313 0.8256 0.8200 0.8188 0.9113 0.9256 0.9256 0.9225 0.9319 0.9325 0.9256 0.9206 0.9194 0.9100 0.8925 0.9000 0.9156 0.9094 0.9063 0.9188 0.9019 0.9094 0.9088 0.8913 0.7388 0.7344 0.7381 0.7538 0.7381 0.7413 0.7450 0.7281 0.7388 0.7313 0.8500 0.7988 0.8388 0.5619 0.6325 0.8544 0.6206 0.8506 0.5619 0.5506 0.8163 0.8256 0.8250 0.8363 0.8200 0.8213 0.8294 0.8219 0.8175 0.8131

Accuracy (%) 90.2938

87.4938

82.7063

92.2500

90.5375

73.8750

71.2000

82.2625

Table 10 ML accuracy for different value of .n 1 and .n 2 Highest accuracy 0.1 and 0.2 and 0.3 and 0.4 and 0.5 and 0.6 and 0.7 and 0.8 and 0.9 and 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 Accuracy (%) 89.21 87.89 88.54 89.90 87.80 90.95 91.63 91.96 92.25

.n 1

and .n 2

50

Shagun et al.

References 1. Borah P, Ahmed HA, Bhattacharyya DK (2014) A statistical feature selection technique. Netw Model Anal Health Inform Bioinf 3:55 2. Sushmakar N, Oberoi N, Gupta S, Arora A (2022) An unsupervised based enhanced anomaly detection model using features importance. In: 2022 2nd international conference on intelligent technologies (CONIT), Hubli, India, pp 1–7 3. Raman SKJ, Arora A (2022) An enhanced intrusion detection system using combinational feature ranking and machine learning algorithms. In: 2022 2nd international conference on intelligent technologies (CONIT), Hubli, India, pp 1–8 4. Sharma Y, Sharma S, Arora A (2022) Feature ranking using statistical techniques for computer networks intrusion detection. In: 2022 7th international conference on communication and electronics systems (ICCES), Coimbatore, India, pp 761–765 5. Arora A, Garg S, Peddoju SK (2014) Malware detection using network traffic analysis in android based mobile devices. In: 2014 Eighth international conference on next generation mobile apps, services and technologies, Oxford, UK, pp 66–71 6. Arora A, Peddoju SK (2017) Minimizing network traffic features for android mobile malware detection. In: Proceedings of the 18th international conference on distributed computing and networking (ICDCN ’17). Association for Computing Machinery, New York, NY, USA, Article 32, pp 1–10 7. Arora A, Peddoju SK (2018) NTPDroid: a hybrid android malware detector using network traffic and system permissions. In: 2018 17th IEEE international conference on trust, security and privacy in computing and communications/12th IEEE international conference on big data science and engineering (TrustCom/BigDataSE), New York, NY, USA, pp 808–813 8. Arora A, Peddoju SK, Chouhan V, Chaudhary A (2018) Hybrid android malware detection by combining supervised and unsupervised learning. In: Proceedings of the 24th annual international conference on mobile computing and networking (MobiCom ’18). Association for Computing Machinery, New York, NY, USA, pp 798–800 9. Kumari N, Chen M (2022) Malware and piracy detection in android applications. In: 2022 IEEE 5th International conference on multimedia information processing and retrieval (MIPR), CA, USA, pp 306–311 10. Haidros Rahima Manzil H, and Naik MS (2022) DynaMalDroid: dynamic analysis-based detection framework for android malware using machine learning techniques. In: 2022 International conference on knowledge engineering and communication systems (ICKES), Chickballapur, India, pp 1–6 11. ˙Ibrahim M, Issa B, Jasser MB (2022) A method for automatic Android malware detection based on static analysis and deep learning. IEEE Access 10:117334–117352 12. Li C, Mills K, Niu D, Zhu R, Zhang H, Kinawi H (2019) Android malware detection based on factorization machine. IEEE Access 7:184008–184019 13. Qiu J et al (2023) Cyber Code Intelligence for Android malware detection. IEEE Trans Cybern 53(1):617–627 14. Haq IU, Khan TA, Akhunzada A (2021) A dynamic robust DL-based model for android malware detection. IEEE Access 9:74510–74521 15. Qiu J et al (2019) A3CM: automatic capability annotation for Android malware. IEEE Access 7:147156–147168 16. Alani MM, Awad AI (2022) PAIRED: an explainable lightweight Android malware detection system. IEEE Access 10:73214–73228 17. Khalid S, Hussain FB (2022) Evaluating dynamic analysis features for Android malware categorization. In: 2022 International wireless communications and mobile computing (IWCMC). Dubrovnik, Croatia, vol 2022, pp 401–406

Smartphone Malware Detection Based on Enhanced …

51

18. Seneviratne S, Shariffdeen R, Rasnayaka S, Kasthuriarachchi N (2022) Self-supervised vision transformers for malware detection. IEEE Access 10:103121–103135 19. Upadhayay M, Sharma A, Garg G, Arora A (2021) RPNDroid: Android malware detection using ranked permissions and network traffic. In: 2021 Fifth world conference on smart trends in systems security and sustainability (WorldS4), London, United Kingdom, pp 19–24 20. Li C et al (2022) Backdoor attack on machine learning based Android malware detectors. In: IEEE Transactions on dependable and secure computing, vol 19, no 5, pp 3357–3370 21. Kumar S, Mishra D, Panda B, Shukla SK (2022) AndroOBFS: time-tagged obfuscated Android malware dataset with family information. In: 2022 IEEE/ACM 19th international conference on mining software repositories (MSR), Pittsburgh, PA, USA, pp 454–458 22. Vu LN, Jung S (2021) AdMat: a CNN-on-matrix approach to Android malware detection and classification. IEEE Access 9:39680–39694 23. Canfora G, Martinelli F, Mercaldo F, Nardone V, Santone A, Visaggio CA (2019) LEILA: Formal tool for identifying mobile malicious behaviour. IEEE Trans Soft Eng 45(12):1230– 1252 24. Yousefi-Azar M, Hamey L, Varadharajan V, Chen S (2020) Byte2vec: malware representation and feature selection for Android. Comput J 63(1):1125–1138 25. Suarez-Tangil G, Tapiador JE, Lombardi F, Pietro RD (2016) Alterdroid: differential fault analysis of obfuscated smartphone malware. IEEE Trans Mobile Comput 15(4):789–802 26. Eom T, Kim H, An S, Park JS, Kim DS (2018) Android malware detection using feature selections and random forest. In: 2018 International conference on software security and assurance (ICSSA), Seoul, Korea (South), pp 55–61 27. Zhang X, Jin Z (2016) A new semantics-based android malware detection. In: 2016 2nd IEEE international conference on computer and communications (ICCC), Chengdu, pp 1412–1416 28. Dissanayake S, Gunathunga S, Jayanetti D, Perera K, Liyanapathirana C, Rupasinghe L An analysis on different distance measures in KNN with PCA for Android malware detection. In: 2022 22nd international conference on advances in ICT for emerging regions (ICTer), Colombo, Sri Lanka, pp 178–182 29. Hassan M, Sogukpinar I (2022) Android malware variant detection by comparing traditional antivirus. In: 2022 7th international conference on computer science and engineering (UBMK), Diyarbakir, Turkey, pp 507–511 30. Amenova S, Turan C Zharkynbek D (2022) Android malware classification by CNN-LSTM. In: 2022 International conference on smart information systems and technologies (SIST), NurSultan, Kazakhstan, pp 1–4 31. Mantoro T, Stephen D, Wandy W (2022) Malware detection with obfuscation techniques on android using dynamic analysis. In: 2022 IEEE 8th international conference on computing, engineering and design (ICCED), Sukabumi, Indonesia, pp 1–6 32. Jebin Bose S, Kalaiselvi R (2022) A state-of-the-art analysis of android malware detection methods. In: 2022 6th international conference on trends in electronics and informatics (ICOEI), Tirunelveli, India, pp 851–855 33. Bai H, Xie N, Di X, Ye Q (2020) FAMD: a fast multifeature Android malware detection framework, design, and implementation. IEEE Access 8:194729–194740 34. Awais M, Tariq MA, Iqbal J, Masood Y (2023) Anti-ant framework for android malware detection and prevention using supervised learning. In: 2023 4th International conference on advancements in computational sciences (ICACS), Lahore, Pakistan, pp 1–5 35. Islam T, Rahman S, Hasan M, Rahaman A, Jabiullah I (2020) Evaluation of N-gram based multi-layer approach to detect malware in Android. Procedia Comput Sci 171: 1074–1082 36. Arora A, Peddoju SK, Conti M (2020) PermPair: Android malware Detection using permission pairs. IEEE Trans Inf Forensics Secur 15:1968–1982 37. Khariwal K, Singh J, Arora A (2020) IPDroid: Android malware detection using intents and permissions. In: 2020 fourth world conference on smart trends in systems, security and sustainability (WorldS4), London, UK, pp 197–202 38. Garg G, Sharma A, Arora A (2021) SFDroid: Android malware detection using ranked static features. Int J Recent Technol Eng 10(1):142–152

52

Shagun et al.

39. Gupta S, Sethi S, Chaudhary S, Arora A (2021) Blockchain based detection of Android malware using ranked permissions. Int J Eng Adv Technol (IJEAT) 10(5):68–75

Fake News Detection Using Ensemble Learning Models Devanshi Singh, Ahmad Habib Khan, and Shweta Meena

Abstract People are finding it simpler to find and ingest news as a result of the information’s easy access, quick expansion, and profusion on social media and in traditional news outlets. However, it is becoming increasingly difficult to distinguish between true and false information, which has resulted in the proliferation of fake news. Fake news is a term that refers to comments and journalism that intentionally mislead readers. Additionally, the legitimacy of social media sites, where this news is primarily shared, is at stake. These fake news stories can have significant negative effects on society, so it is becoming increasingly important for researchers to focus on how to identify them. In this research paper, we have compared ensemble learning models for identifying fake news by analyzing a report’s accuracy and determining its veracity. The paper’s objective is to use natural language processing (NLP) and machine learning (ML) algorithms to identify false news based on the content of news stories. The algorithms like decision trees, random forests, AdaBoost, and XGBoost classification are used for the project. A web application has been developed using Python Flask framework to mitigate the challenges associated with identifying false information. Keywords Decision tree · Random forest · AdaBoost · XGBoost · Performance analysis

1 Introduction The concept of fake news and hoaxes existed prior to the rise of online media. However, the rapid expansion of availability of online news has made it challenging to distinguish between genuine information and fake news. Fake news has two key features: authenticity and objective. Authenticity means verifying the legitimacy of D. Singh · A. H. Khan · S. Meena (B) Delhi Technological University, Bawana Road, Rohini, New Delhi 110042, India e-mail: [email protected] URL: http://dtu.ac.in/ © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 A. Swaroop et al. (eds.), Proceedings of Data Analytics and Management, Lecture Notes in Networks and Systems 788, https://doi.org/10.1007/978-981-99-6553-3_4

53

54

D. Singh et al.

information, or in our case, fake news that comprises of false information that is often challenging to verify. The second feature of fake news is objective, or in simple terms, intent, which means that the information is intentionally created to mislead consumers, deceive the public into believing certain lies, and promote particular ideas or agendas. The widespread circulation of false news can have a negative impact on our society as a whole. Firstly, it can alter how people conceive and react to fake news. Secondly, the abundance of fake news can undermine public trust in the media, leading to skepticism and damaging the credibility of news sources. Thirdly, fake news can manipulate people into accepting biased and false narratives. Politicians and officials often manipulate and alter fake news for political purposes, influencing consumers, and promoting their own agendas [1]. On the one hand, social media titans like WhatsApp, Instagram, Facebook, and Twitter acknowledge that their works are misused, they are also accosted with the vast scope of their efforts. Fake news can take various forms, including fake user accounts containing fake content, photoshopped images, a skillfully created networkbased content intended to mislead or delude a specific group of people, as well as intellectually created stories that contain scientific or low-cost explanation of unresolved issues, that eventually leads to the proliferation of false information. As a result of the aforementioned characteristics, detecting fake news poses new and challenging problems. The work proposed here is done to tackle the same issue. We have developed a web application for determining whether a news article is real or misleading using ensemble learning algorithms. In the next section, we have thoroughly reviewed and discussed related works in this matter to understand the complexities of the problem in every domain. There we have talked about how fake news can very conveniently change the perspective of people. The third section elaborates on the research methodology where we have compared the classifiers for fake news detection. The section contains a brief description of the dataset used, data preprocessing, feature extraction, and evaluation metrics. We have also described the algorithms and their workings in the section. Fourth section discusses the results obtained and the last section concludes our paper. Main contribution: . The work gives an opportunity to research new areas in fake news detection. . The study compares ensemble-based learning methods. . It gives a new understanding of the veracity of the news that we digest on a daily basis.

2 Related Works This section includes review of fake news detection and how the problem of false news still persists and the effects of spread of fake news around the world. Fake news has become a major concern for society as it is generated for various reasons, including commercial interests and malicious agendas. The diffusion of fake news social media

Fake News Detection Using Ensemble Learning Models

55

platforms has created a double-edged sword for news consumers, because of the fact that not only does it grant people accessibility but also allows the spread of lowquality news with the intention of misleading the public. Since false news online is spreading faster than ever, it can not only alter people’s perception of reality but can also have an adverse effect on society. This makes detection of fake news a vital research area and has attracted significant attention in the past few years. The definition of social media includes websites, applications, or software that are specifically designed for creating and sharing content, social networking, open discussions in forums, microblogging [2, 3]. Some scholars believe that fake news may arise unintentionally, for example, due to a lack of education or inadvertent actions, such as what occurred in the case of the Nepal Earthquake [4, 5]. During the year 2020, a significant amount of fake news about health was circulated, posing a threat to global health. In February 2020, the World Health Organization (WHO) issued a cautionary statement regarding the COVID-19 pandemic, stating that it had caused a widespread “infodemic” consisting of both accurate and inaccurate information [1]. The prevalence of misleading news stories has been a persistent issue, with some claiming that it played a role in influencing the outcome of the 2016 United States Presidential Election [6]. There are many methods developed to tackle the problem of misinformation online, which has become rampant, especially with the sharing of articles that are not based on facts. This issue has caused problems majorly in politics and also in other areas such as sports, health, and science [7]. In response to the problem of fake news, researchers have focused on developing algorithms for detecting false news. One approach is to use machine learning algorithms such as decision trees, random forests, AdaBoost, and XGBoost. In one project, the researchers focused on news articles that were marked as fake or real. They employed the decision tree algorithm [8], which is a form of machine learning algorithm, to aid in detecting misleading news. Decision tree was one of the algorithms or in other words, classifiers, used by the researchers in their painstaking efforts to identify false information in news. In a different study, researchers used one such classifier being the random forest. To determine the actual result of the classifier, multiple random forests [9] were implemented, which assigned a score to each potential outcome, and the result with the most votes was chosen. The employment of machine learning algorithms for fake news detection shows promising results, but the challenges presented by fake news on social media require continued research and development of more effective solutions. In one research, a comparative analysis is done among SVM, Naïve Bayes, random forest, and logistic regression classifiers to detect fake news applying on different datasets. The results did not show very promising results in the case of SVM and logistic regression for some datasets [10]. In another study, a dataset consisting of 10,000 tweets and online posts of fake and real news concerning COVID-19 were analyzed. Machine learning algorithms like logistic regression (LR), K-nearest neighbor (KNN), linear support vector machine (LSVM), and stochastic gradient descent (SGD) were used in that work [11]. As far as we know, the suggested methodology in this paper has not been applied in any studies that have attempted to address this problem statement.

56

D. Singh et al.

3 Proposed Methodology In our work, we examine ensemble learning models such as the decision tree algorithm, the random forest algorithm (bagging), AdaBoost algorithm (boosting), and the XGBoost algorithm (boosting) on a fake news dataset that was obtained from Kaggle. The models are evaluated, and the accuracy of each of the models is calculated. Upon receiving the results, the algorithms are compared to rank their efficiency. The section includes brief descriptions of dataset description, data preprocessing, feature extraction, all the algorithms that we have evaluated, evaluation metrics, and a description of the web application that we have developed. The use-case diagram briefly displays all the processes involved in the application.

3.1 Dataset Description and Data Preprocessing The data preprocessing module is responsible for completing all necessary preprocessing tasks required to handle the training data and is collected from the primary dataset. This includes identifying any null or missing values in the training data and performing preprocessing operations such as tokenization, which involves breaking down a character string into smaller components like keywords, phrases, symbols, words, and other elements, and stemming, an NLP technique used to reduce a word to its base form or stem to recognize similar words. The dataset that we used in our work was obtained from Kaggle and was created by Clément Bisaillon. To prepare that dataset for determining the authenticity of those news articles, the process can be divided into several discrete steps. Initially, the data was meticulously examined to ensure its availability and accessibility. The data was then uploaded into CSV files. After that, it required preprocessing to improve its quality and ensure compatibility with the machine learning program. Data preprocessing eliminated unnecessary information like punctuation, including redundant commas, question marks, quotes, and trophes, as well as removing irrelevant columns, missing values, numeric text, and URLs from text.

3.2 Feature Extraction Due to the enormous amount of terms, words, and phrases included in documents, text categorization presents a challenge when working with high-dimensional data. As a result, the learning process is subject to a significant computational load. Additionally, the existence of duplicate and irrelevant features may adversely affect classifier’s performance and accuracy. Therefore, feature reduction is crucial to reduce the size of the text feature set and prevent the use of high-dimensional feature spaces. In this work, two distinct feature selection methods—term frequency (TF) and term

Fake News Detection Using Ensemble Learning Models

57

frequency-inverted document frequency (TF-IDF) were investigated. The following provides an explanation of these methods. Term Frequency (TF) Based on the frequency of words used in the papers, the term frequency (TF) technique determines how similar the documents are to one another. Word counts are contained in an equal-length vector that represents each document. After that, the vector is normalized so that each of its components adds up to one. The likelihood of the words appearing in the papers is then calculated from the word count. A word is given a value of one if it appears in a certain document, whereas it is given a value of zero if it does not. As a result, a group of words serve as a representation for each document. Term Frequency-Inverse Document Frequency (TF-IDF) A weighting measure that is frequently used in information retrieval and natural language processing is called the term frequency-inverted document frequency (TF-IDF). It is a statistical tool used to assess a term’s importance to a document inside a dataset. The frequency of a word in the corpus balances out the relevance of a term as it increases in frequency inside a document. The ability of IDF to lessen the weight of the term frequency while enhancing the uncommon ones is one of its key qualities. For instance, when using TF alone, frequently occurring words like “the” and “then” may dominate the frequency count. Employing IDF, though, lessens the significance of these phrases.

3.3 Algorithms The classifiers used in the study are decision tree, random forest, AdaBoost, and XGBoost. The algorithms and how they work are explained in this section. Decision Tree Algorithm Decision trees in simple terms can be defined as a divideand-conquer algorithm. In huge databases, it can be used to find characteristics and identify patterns that are crucial for classification and predictive modeling. These characteristics, along with their perceptive interpretation, become so important in the extraction of meaning from data. This is the reason why decision trees have been extensively used for predictive modeling applications for so many years. Decision trees have established a solid foundation in both the domains—machine learning and artificial intelligence. In decision tree, a tree-like structure is formed which almost resembles a flowchart. The top-most element is called root node. An internal node signifies a feature or an attribute. Whereas an edge signifies the rule of a decision. Finally, a leaf node signifies the outcome. Based on the value of the attribute, it learns to divide. This process is repeated until a final outcome is obtained. This is called recursive partitioning. It is a representation in the form of a schematic diagram that accurately mimics humanlevel thinking. The fundamental concept of decision tree algorithms is as follows: . For dividing the data into subsets, the best attribute is chosen. This is done with the help of attribute selection measures (ASM).

58

D. Singh et al.

Fig. 1 Decision tree structure

Fig. 2 Decision tree algorithm

. To divide the data into smaller groups, use that attribute as a decision node. . Recursively carry out this procedure for each child node to build the tree until one of the following conditions is satisfied: There are no more attributes to evaluate, no more data instances to split, or when all of the data tuples have the same value of an attribute (Figs. 1 and 2). Decision tree modeling has an advantage of interpretability of the constructed model which means that apart from the extraction of relevant patterns, and important features can also be identified. Because of this interpretability, the information related to the interclass relationships can be used for supporting future experiments and data analysis. The use of decision tree methods is applicable to many different fields. It can

Fake News Detection Using Ensemble Learning Models

59

Fig. 3 Working of random forest algorithm

be used to improve search engines, find new applications in the medical disciplines, find data, extract text, spot data gaps in a class, and replace statistical techniques. Numerous decision tree algorithms have been created in which both their accuracy and cost-effectiveness differ. Random Forest Algorithm Random forest is an advanced version of decision trees that use multiple trees to make predictions. The ultimate forecast is based on the individual trees’ majority decision. This method results in a low error rate, as the trees are less correlated with each other. The random forest algorithm operates through the following sequence of operations (Fig. 3). . Random samples are chosen from the provided dataset. . A decision tree is created for each of the selected samples, and a prediction result is obtained from each of them. . The predicted results are subjected to voting, where mode is used for classification problems and mean for regression problems. . The final prediction is determined by the prediction result which had the highest number of votes. The random forest algorithm addresses the limitations of decision tree algorithms by improving accuracy and mitigating overfitting. It also eliminates the requirement for complicated package configurations, like those necessary for Scikit-learn. Notable characteristics of the random forest algorithm include its heightened accuracy in comparison with decision trees, its ability to proficiently handle missing data, its capacity to generate accurate predictions without requiring hyper-parameter tuning, and its remedy for the overfitting issue related to decision trees. Additionally, each tree in the random forest algorithm selects a subset of features randomly at the splitting point [12, 13].

60

D. Singh et al.

Fig. 4 Working of AdaBoost algorithm

AdaBoost Algorithm AdaBoost also known as “Adaptive Boosting” is a machine learning algorithm that is used for classification and regression tasks. In this method, numerous “weak” classifiers are combined together to form a “strong” classifier. In each iteration, AdaBoost trains a new classifier on the dataset and then assigns higher weights to the samples that were misclassified by the previous classifiers. This way, subsequent classifiers focus on the difficult samples that the previous classifiers couldn’t classify accurately (Fig. 4). In AdaBoost, since we know that the weights are re-assigned, the ultimate “strong” classifier that is formed is a weighted average of the “weak” classifiers. Here, the weight of every classifier is proportional to its accuracy. The AdaBoost algorithm integrates all of the weak classifiers’ predictions for a new sample, weighting them according to their significance, to create a final prediction. AdaBoost has been widely used in real-world applications, and it is known for its simplicity and high accuracy. However, if the number of iterations is too large, it may overfit the training data and be sensitive to noisy data and outliers. XGBoost Algorithm XGBoost short for “Extreme Gradient Boosting” is a carefully designed package. It was created to be an efficient, flexible, scalable, and a portable distributed library. XGBoost is based on the framework called gradient boosting framework, which forms the foundation of the algorithm and implements machine learning algorithms. With the help of this library’s parallel tree boosting technology, a variety of data science issues can be solved quickly and precisely. XGBoost has gained significant popularity in recent years and is now a prominent tool in applied machine learning. It is also a famous tool in Kaggle competitions due to its high

Fake News Detection Using Ensemble Learning Models

61

Fig. 5 Working of XGBoost algorithm

scalability in almost all cases. Almost all the winning solutions used XGBoost for training their models. In essence, one can easily confer that XGBoost is a higher and a well-improved version of gradient boosted decision trees (GBM), which was designed to enhance speed and performance (Fig. 5). XGBoost features: . Regularized Learning: By refining the learned weights, regularization reduces the likelihood of overfitting. Models that use simple and predictive functions are given priority by the regularized objective. . Parallelization: The most time-consuming step in tree learning is sorting the data. Data is kept in in-memory units termed “blocks” to lower sorting costs. Data columns in each block are organized according to the associated feature value. Before training, this calculation only needs to be performed once and can be used again. Block sorting can be done separately and distributed among the CPU’s parallel threads. Multiple CPU cores are used to teach the model. The parallel collection of data for each column allows for the parallelization of the split finding. . Two more methods, shrinkage and column subsampling, are used for further avoiding overfitting. The first method, which is shrinkage, was presented by Friedman. It was seen that after every step of tree boosting, shrinkage scales freshly added weights by a factor .η, decreasing the impact of each tree and allowing future trees to enhance the model. The second one is column subsampling which speeds up the parallel algorithm’s calculations while preventing overfitting even more than the conventional row subsampling does.

62

D. Singh et al.

3.4 Evaluation Metrics To evaluate the accuracy of the algorithm in detecting fake news, various evaluation measures are used. Among these measures, the most commonly used one is the confusion matrix, which is used to evaluate classification tasks. By defining the task of detecting fake news as a classification problem, the measures of the confusion matrix can be used for evaluating its performance [12]: Precision =

Recall =

TP TP + FP

TP TP + FN

(1)

(2)

Accuracy =

TP + TN TP + TN + FP + FN

(3)

F1 Score =

2 · (Precision · Recall) Precision + Recall

(4)

where TP represents (True Positive) and TN represents (True Negative). FP represents (False Positive) and TN represents (False Negative) as given in Table 1. The evaluation measures are commonly used in machine learning algorithms to assess the effectiveness of a classifier from various perspectives. One of the most important measures is the accuracy metric, which indicates how close the predicted fake news is to the actual fake news. Precision suggests the proportion of correctly identified fake news among all the identified fake news, which is an important issue in fake news classification. However, recall is used for measuring the level of sensitivity, or the percentage of elucidated fake news accurately identified as fake, because the fake news dataset is frequently imbalanced and a higher level of precision can be obtained by making fewer positive predictions. It is important to note that higher values for recall, precision, and accuracy indicate better performance [14].

Table 1 Parameters of evaluation metrics Parameters True Positive (TP) True Negative (TN) False Positive (FP) False Negative (FN)

Description Prediction is positive and it’s true Prediction is negative but it’s true Prediction is positive but it’s false Prediction is negative and it’s false

Fake News Detection Using Ensemble Learning Models

63

Fig. 6 Use case diagram

3.5 Web Application We have built a web application using Python flask that uses machine learning algorithms—decision trees, random forests, AdaBoost, XGBoost classifiers for classifying the news as fake or real. Figure 6 shows the use-case diagram of the system.

4 Results and Discussion The findings of this study are explained in this section, as well as how different factors were used to gauge how well the research model performed. The evaluation outcomes for the fabricated information datasets have been illustrated through the utilization of the confusion matrix for the four algorithms. The four algorithms used for the detection are as follows:

64

D. Singh et al.

Fig. 7 a Decision tree, b Random forest, c AdaBoost, d XGBoost

. . . .

Decision tree Random forest AdaBoost XGBoost.

The confusion matrix is automatically obtained by Python code using the scikit-learn learn library when running the algorithm code on Kaggle. Figure 7 represents the obtained confusion matrices with their respective TP, FP, TN, and FN values. Table 2 shows all of the outcomes of the evaluation metrics that were used to correctly categorize the fake news. The results obtained after classification displayed the accuracy of the machine learning algorithms—decision tree, random forest, AdaBoost, and XGBoost classifier as 93.4%, 97.2%, 94.0%, and 97.6%, respectively. All the machine learning algorithms gave promising results in the detection of fake news with XGBoost giving the highest accuracy of 97.6% .

Fake News Detection Using Ensemble Learning Models

65

Table 2 Performance comparison of classification models Model Precision Recall F1-Score Decision tree Random forest AdaBoost XGBoost

0.93 0.97 0.94 0.98

0.93 0.97 0.94 0.98

0.93 0.97 0.94 0.98

Accuracy (%) 93.4 97.2 94.0 97.6

5 Conclusion The widespread distribution of false information on the Internet can have negative consequences for society, as it can cause confusion and mislead readers. Machine learning can be used to address this issue by predicting whether a news article is genuine or fake. Although different machine learning techniques have shown some success in distinguishing fake news from real news, it is challenging to classify fake news due to its constantly changing characteristics and elements. One key limitation is the constantly evolving nature of fake news, which poses a challenge for proper classification. Additionally, acquiring large and diverse datasets that include the vast landscape of fake news is another constraint that still remains a challenge. Some supervised learning algorithms based on ensemble methods such as random forest (Bagging) and XGBoost (Boosting), have been effective in detecting fake news. However, collecting more data and continuously training the models with that data is necessary to improve their accuracy. In the future, exploring ensemble methods in neural networks could further enhance the performance of fake news detection systems. Furthermore, it is critical to take into account the ethical implications and any biases related to automated fake news detection. Future studies should look into ways to overcome these issues, like creating algorithms that are fairness-aware and incorporating explainable AI technologies to make the decision-making process transparent and understandable. Advancements in machine learning and neural network-based methods can make a substantial contribution to the creation of efficient and dependable systems for identifying and combatting fake news in the online ecosystem by resolving these constraints and taking into account the future scope.

References 1. Pulido CM, Ruiz-Eugenio L, Redondo-Sama G, Villarejo-Carballido B (2020) A new application of social impact in social media for overcoming fake news in health. Int J Environ Res Public Health 17(7):2430 2. Economic and Social Research Council. Using Social media. Available at https://esrc.ukri.org/ research/impact-toolkit/social-media/using-social-media 3. Gil P (2019) Available at https://www.lifewire.com/what-exactly-is-twitter-2483331

66

D. Singh et al.

4. Tandoc EC Jr et al (2017) Defining fake news a typology of scholarly definitions. Digit J 1-17 5. Radianti J et al (2016) An overview of public concerns during the recovery period after a major earthquake: Nepal twitter analysis. In: HICSS’16 Proceedings of the 2016 49th Hawaii international conference on system sciences (HICSS), Washington, DC, USA. IEEE, pp 136– 145 6. Holan AD (2016) 2016 Lie of the year: fake news. Politifact, Washington, DC, USA 7. Lazer DMJ, Baum MA, Benkler Y et al (2018) The science of fake news. Science 359(6380):1094–1096 8. Kotteti CMM, Dong X, Li N, Qian L (2018) Fake news detection enhancement with data imputation. In: 2018 IEEE 16th international conference on dependable, autonomic and secure computing, 16th international conference on pervasive intelligence and computing, 4th international conference on big data intelligence and computing and cyber science and technology congress (DASC/PiCom/DataCom/CyberSciTech) 9. Ni B, Guo Z, Li J, Jiang M (2020) Improving generalizability of fake news detection methods using propensity score matching. Soc Inf Netw. https://arxiv.org/abs/2002 10. Choudhury D, Acharjee T (2023) A novel approach to fake news detection in social networks using genetic algorithm applying machine learning classifiers. Multimed Tools Appl 82:9029– 9045. https://doi.org/10.1007/s11042-022-12788-1 11. Malhotra R, Mahur A, Achint (2022) COVID-19 fake news detection system. In: 2022 12th International conference on cloud computing, data science & engineering (confluence), Noida, India, pp 428–433. https://doi.org/10.1109/Confluence52989.2022.9734144 12. Jehad Ali RK, Ahmad N, Maqsood I (2019) Random forests and decision trees 13. Yousif SA, Samawi VW, Elkaban I, Zantout R (2015) Enhancement of Arabic text classification using semantic relations of Arabic WordNet 14. Shuy K, Wangy S, Tang J, Liuy H (2019) Fake news detection on social media: a data mining perspective 15. Manzoor JS, Nikita (2019) Fake news detection using machine learning approaches: a systematic review. In: 2019 3rd International conference on trends in electronics and informatics (ICOEI), pp 230–234. https://doi.org/10.1109/ICOEI.2019.8862770

Ensemble Approach for Suggestion Mining Using Deep Recurrent Convolutional Networks Usama Bin Rashidullah Khan, Nadeem Akhtar, and Ehtesham Sana

Abstract The ability to extract valuable information from customer reviews and feedback is crucial for businesses in today’s social media landscape. Many companies and businesses use social media networks to offer and deliver a range of services to their customers as well as gather data on individual and customer opinions and thoughts. An efficient method for automatically obtaining creative concepts and suggestions from web sources is suggestion mining. For suggestion mining, we present in this paper an ensemble model called DRC_Net that integrates deep neural networks, recurrent neural networks, and convolutional neural networks. We evaluated our model using the SemEval-2019 dataset containing reviews from multiple domains. Our proposed model achieved better accuracy and F1-score than stateof-the-art models and performed well on Subtask A and Subtask B, representing in-domain and cross-domain validation. For Subtask A and Subtask B, the model receives F1-scores of 0.80 and 0.87, respectively. The model’s ability to perform well on cross-domain validation suggests that it can be applied to various domains and datasets. Keywords Suggestion mining · Online reviews · Ensemble deep learning

U. B. R. Khan (B) Interdisciplinary Centre for Artificial Intelligence, Aligarh Muslim University, Aligarh 202002, India e-mail: [email protected] N. Akhtar Department of Computer Engineering and Interdisciplinary Centre for Artificial Intelligence, Aligarh Muslim University, Aligarh, Uttar Pradesh 202002, India e-mail: [email protected] E. Sana Department of Computer Engineering, Aligarh Muslim University, Aligarh 202002, India e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 A. Swaroop et al. (eds.), Proceedings of Data Analytics and Management, Lecture Notes in Networks and Systems 788, https://doi.org/10.1007/978-981-99-6553-3_5

67

68

U. B. R. Khan et al.

1 Introduction Businesses, customers, and researchers can all benefit from the knowledge provided by opinionated text found on blogs, social networking sites, discussion forums, and reviews. It provides insights into the quality of products and services, as well as the experiences of customers. However, manually analysing large volumes of reviews can be a daunting task, particularly as e-commerce continues to grow. Automated approaches, such as natural language processing techniques, offer a promising solution to this problem. One important task in review analysis is suggestion mining, which involves identifying suggestions or recommendations made by reviewers. Suggestion mining has important applications in e-commerce, where businesses can use the suggestions to improve their products and services, and consumers can make more informed purchasing decisions and use services more effectively. Suggestion mining involves the automatic extraction of suggestion sentences or phrases from the online text where suggestions of interest are likely to appear [1]. These suggestions can be explicit or implicit, with explicit suggestions being unambiguously expressed in the text, and implicit suggestions providing additional information that helps readers classify them as suggestions [2]. In this research, a new architecture for online review suggestion mining is evaluated to address the challenge of analysing large volumes of reviews. The proposed architecture DRC_Net is the combination of deep neural networks (DNN), recurrent neural networks (RNN), and convolutional neural networks (CNN). Our architecture captures both temporal and spatial dependencies in the reviews, allowing for more accurate suggestion mining. The novelty of our approach lies in its ability to effectively capture the complex relationships between different parts of the reviews, which is critical to accurately identifying suggestions. We evaluate our architecture on the SemEval-2019 Task 9 dataset, which is a benchmark dataset for suggestion mining in reviews, which contains reviews from multiple domains. The task comprises two subtasks, A and B, with labelled data for Windows phones from software suggestion forums and hotel reviews, respectively. In Subtask A, the system is trained and tested in the same domain, while in Subtask B, the system is evaluated using test data from a domain other than the one for which training data is provided. Our approach outperforms existing methods on both subtasks, demonstrating the effectiveness of our architecture. The paper makes the following main contributions: 1. Introduction of a novel ensemble architecture, DRC_Net, which combines deep neural networks (DNN), recurrent neural networks (RNN), and convolutional neural networks (CNN) for suggestion mining in online reviews. 2. Evaluation of the proposed architecture on both Subtasks A and B of the SemEval2019 Task 9 dataset. This dataset serves as a benchmark for suggestion mining and contains reviews from various domains. 3. Comparison of the results obtained from the proposed architecture with those of existing studies in the field. This allows for a comprehensive understanding of the performance and effectiveness of the DRC_Net architecture.

Ensemble Approach for Suggestion Mining Using Deep Recurrent …

69

The rest of the paper is structured as follows: The relevant work in suggestion mining is outlined in Sect. 2, our proposed architecture is described in Sect. 3, our experimental setup is in Sect. 4, findings are shown in Sect. 5, limitations of the work are discussed in Sect. 6, and the paper is concluded with future research possibilities in Sect. 7.

2 Related Work Early studies in suggestion mining have focused on detecting wishes, advice, product defects, and improvements from online reviews. Some of the earliest works in this field involved wish detection from weblogs and online forums [3, 4]. In 2009, the concept of detecting suggestions in online portals was introduced [3], and the term ‘suggestion’ was later coined by Goldberg et al. [5] in 2013. However, their work was limited to suggestions for product improvement only. Subsequent studies by Negi [1] have explored various types of suggestions, ranging from customer-to-customer suggestions to open-domain suggestion mining. Rule-based approaches have also been used by some researchers, but deep learning approaches have shown more effectiveness in suggestion mining. LSTM, CNN or their variants have been widely used in text classification problems because they capture long-term dependencies and spatial features [6]. In previous research on suggestion mining from online reviews, several methods have been applied to extract suggestions from textual data on SemEval-2019 Task 9 dataset. For example, The SemEval-2019 Task 9 dataset was processed using the random multi-model deep learning (RMDL) method by Liu et al. [7]. They used GloVe embeddings as input features and chose the number of layers and nodes for each model at random. The outcome was chosen by a majority vote. In SemEval-2019 Task 9, Liu et al. [8] proposed an ensemble model that combined BERT with various task-specific modules, including CNN, GRU, and FFA. The BERT model was used for sentence perspective encoding, while the other modules were stacked over BERT to enhance the model’s performance. The proposed model achieved the highest scores in Subtasks A and B, outperforming individual subtasks. This approach demonstrated the effectiveness of combining different neural network architectures for suggestion mining tasks. With the use of WordNet hyponyms for the term ‘message,’ Alekseev et al. [9] evaluated several strategies for labelling unknown inputs and examined zero-shot learning algorithms for text categorization. Direct labelling of the content as either a suggestion or not was their strategy. They verify their work on both of SemEval-19 Task 9’s subtasks. Potomias et al. [10] developed a rule-based approach that utilized heuristic, lexical, and syntactic patterns to determine the degree of suggestion content in sentences. The weights of these patterns were used to rank the sentences. The rule-based classifier was then combined with R-CNN to enhance the performance of the model. The ensemble model attained the highest rank in Subtask B, indicating its state-of-the-art performance on cross-domain evaluation. In another work of ours [11], TCN was applied for suggestion mining. Two word

70

U. B. R. Khan et al.

embeddings, BERT and GloVe were combined to capture the semantic as well as contextual information within the sentence. SemEval-2019 Subtask A dataset was used to evaluate the model. Due to the dilation mechanism of the TCN, the proposed model achieves the best results. Ramesh et al. [12] were utilized Subtask A dataset to classify sentences. Their experiment involved feature selection techniques such as chi-square (CHI2), document frequency difference (DFD) and multivariate relative discriminative criterion (MRDC) to select important features and represent sentence vectors. Support vector machine and random forest algorithms were employed for classification, resulting in an accuracy of 83.47% for suggestion mining.

3 Proposed Architecture In this study, we propose an ensemble model DRC_Net for suggestion mining that builds upon the concept of random multimodal deep learning (RMDL) [12]. RMDL incorporates three distinct deep learning architectures, including convolutional neural networks (CNNs), recurrent neural networks (RNNs), and deep neural networks (DNNs). The final prediction is obtained from the majority voting on the results of each randomly generated model. In the DNN architecture, each layer is directly linked to the layers above and below it. This enables DNN to handle high-dimensional data with many input features and learn complex representations of sequential data [13]. On the other hand, the RNN architecture [14] also uses the prior data points in addition to the current input data. As a result, RNN can handle input sequences with varied lengths and learn and recall long-term dependencies. Finally, the CNN architecture has the ability to extract pertinent input information. CNN’s convolutional layers allow CNN to extract important features from sequential data automatically. Furthermore, CNN can capture local patterns and relationships within the input data [15]. Ensemble techniques can help to reduce variance and bias by combining the predictions of multiple deep learning models. Each individual model in the ensemble may have a high variance, but by combining their predictions, the overall variance can be reduced. Deep learning models can also suffer from bias, particularly when the training data is limited or biased towards certain examples. Ensemble techniques can help to reduce bias by combining the predictions of multiple deep learning models, each of which may have different sources of bias [16]. Figure 1 illustrates the relationship between the bias and variance and the model complexity (Fig. 2). DRC_Net extends the RMDL architecture by randomly generating three models of each type, in which we selected the best of each kind. The proposed system architecture is shown in Fig. 1. The ensemble model consists of three neural networks, DNN, RNN, and CNN. The DNN has two dense layers with 64 and 32 neurons, respectively, with each dense layer followed by a dropout layer having a dropout rate of 0.2. The RNN has GRU, LSTM, and GRU layers, with units 128, 128, and 64, respectively, and each layer is separated by dropout layers with a dropout rate of 0.5. The CNN has one convolutional 1D layer with kernel size 3, followed by

Ensemble Approach for Suggestion Mining Using Deep Recurrent …

71

Fig. 1 Relationship between the model’s complexity with the bias and variance

Fig. 2 System architecture of proposed model

maxpool1D and a dense layer of ten units. Three types of features are extracted from three different neural networks, complex (nonlinear), temporal and spatial. We used BERT word embeddings, which have been shown to be effective in handling unbalanced classes and achieving good performance in several text classification tasks, including suggestion mining [17]. The pre-trained DistilBERT model, a compact, quicker, and lighter transformer model based on BERT architecture generates the BERT embedding. Each model receives input from the DistilBERT model, which creates word embeddings with a dimension of 746. The output of each model is produced by a dense layer made up of a single unit and having a sigmoid activation

72

U. B. R. Khan et al.

function. Once the output of each model has been concatenated, the concatenated features are sent to the final dense layer, which has two neurons and SoftMax activation. Instead of using the voting process as RMDL does, the final output is taken from the last dense layer.

4 Experiments 4.1 Dataset and Pre-processing The SemEval-2019 Task 9 dataset was used to evaluate the proposed architecture. The training data for Subtask A consists of 8500 words, of which 2085 have been labelled as suggestions. It also includes a validation set and a test set, each of which contains 592 and 833 samples. All sentences are from the software suggestion forum for software developers from UserVoice, and further details can be found in [6]. The labelled dataset was obtained from a GitHub repository where suggestion sentences were labelled as 1 and non-suggestion were labelled as 0. With much fewer suggestion sentences than non-suggestion sentences, the dataset is extremely unbalanced. Subtask B, on the other hand, takes the cross-domain setting into account. As a result, doing supervised learning requires only test and validation data; no training data are available. So, we used training data of Subtask A to train our model for cross-domain validation. Validation data has 824 samples, and the test set has 808 samples collected from hotel reviews on the Tripadvisor website. The statistics of the dataset are shown in Table 1. In order to prepare the dataset for analysis, several pre-processing steps were undertaken. These included the elimination of tags, emojis, special characters, links, extra spaces, and words that were repeated in quick succession. The dataset included contraction terms like ‘can’t’ and ‘he’d’ve,’ which the tokenizer does not recognize, thus those words were stretched to their full form. After that, stemming was done to make the vocabulary smaller. As an example, the phrase ‘my app has a wp7 version and a wp8 version xap in the same submission’ was changed to ‘app wp7 version wp8 version xap submission,’ which might not make sense without stop words. It is noteworthy that stop words were left in the sentences because they provide critical contextual information. Table 1 SemEval-2019 Task 9 dataset Subtasks

Domain

Training

Validation

Test

Subtask A

Software development forums

8500

592

833

Subtask B

Hotel reviews

0

808

824

Ensemble Approach for Suggestion Mining Using Deep Recurrent …

73

4.2 Experimental Setup The proposed ensemble model combines the strengths of DNNs, RNNs, and CNNs for suggestion mining and is implemented using Keras and TensorFlow as a backend. The ensemble model consists of three individual models, one each for DNNs, RNNs, and CNNs. The experimental setup includes two approaches, one for the same domain and the other for cross-domain validation. Due to the high-class imbalance in the dataset, the evaluation of the performance of the model involves metrics that are based on positive and negative classes, such as precision, recall, and F1-score. Our architecture is built on top of the DistilBERT model, where the pre-processed data serves as the input and the output is the contextual word embedding which results in a 746-dimensional vector for each word. Subtask A The same dataset is used to train all three distinct models. The output of each model is then concatenated and sent into a classification layer that is fully connected. The result is a probability score indicating the likelihood that each input sentence represents a suggestion. With a learning rate of 1e-4 over 25 epochs and a batch size of 50, the model is trained using the Adam optimizer and sparse categorical cross-entropy and then verified using the same domain data. Subtask B The proposed ensemble model is trained on the Subtask A dataset for ten epochs with a batch size of 50 since there is no training data available for Subtask B. The model is then validated on the Subtask B validation set, which contains crossdomain data from hotel reviews. The validation helps to improve the model’s ability to generalize to cross-domain validation. After validation, the model is trained for a further ten epochs on the Subtask B validation set and tested on the given test set. The model is optimized using the same hyperparameters as in Subtask A.

5 Results and Discussion The proposed ensemble model, DRC_Net, has demonstrated exceptional performance in both Subtasks A and B, indicating the effectiveness of the model for both same domain and cross-domain scenarios. Table 2 displays the evaluation metrics for DRC_Net on the test set. The model completed Subtask A with precision, recall, and an F1-score of 0.80. The model’s precision, recall, and F1-score in Subtask B were 0.88, 0.87, and 0.87, respectively. The high precision and recall, which show the model’s capacity to perform well on both positive and negative classes, serve as evidence of its success on the unbalanced dataset. In addition to evaluating the proposed ensemble model’s performance, a comparison was made with previously published studies that utilized the same dataset for evaluation. Tables 3 and 4 display the comparison results for Subtask A and Subtask B. The proposed model, DRC_Net, was found to outperform RMDL on both subtasks,

74 Table 2 Results of the DRC_ Net on SemEval-2019 Task 9

Table 3 Comparison of DRC_Net with previous studies on Subtask A

U. B. R. Khan et al.

Subtask

Precision

Recall

F1-score

A

0.80

0.80

0.80

B

0.88

0.87

0.87

Models

Subtasks

F1-score

Alekseev et al. [9]

A

0.78

Liu et al. [7]

A

0.74

Potomias et al. [10]

A

0.74

Liu et al. [8]

A

0.78

Usama et al. [11]

A

0.82

DRC_Net

A

0.80

Bold means Proposed model Results

indicating that combining the strengths of DNN, RNN, and CNN architectures was more effective than simply using a voting method for individual models’ predictions. In order to provide more precise predictions, the suggested model was able to capture both spatial and temporal dependencies, by concatenating the features extracted from each model. These results show that on both the subtasks dataset of SemEval-2019 Task 9, the proposed ensemble model beat numerous state-of-the-art techniques for suggestion mining. Following the completion of trials on Subtasks A and B, a comparison of the proposed ensemble model with other models shows that it outperforms the other models in terms of F1-score, recall, and precision. The comparison results demonstrate that the proposed ensemble model is an effective approach for suggestion mining, outperforming several deep learning and ensemble models and achieving better results with state-of-the-art models in both subtasks. Table 4 Comparison of DRC_Net with previous studies on Subtask B

Models

Subtasks

F1-score

Alekseev et al. [9]

B

0.48

Liu et al. [7]

B

0.76

Potomias et al. [10]

B

0.85

Liu et al. [8]

B

0.85

DRC_Net

B

0.87

Bold means Proposed model Results

Ensemble Approach for Suggestion Mining Using Deep Recurrent …

75

6 Limitations The provided dataset used in this research is highly imbalanced and unstructured. Applying appropriate balancing techniques, such as oversampling, would be beneficial to address this issue and improve the model’s performance. The base models used in the proposed architecture are selected randomly. While this allows for experimentation, using more specific and targeted models tailored to the task at hand could potentially yield even better results. In this study, only contextual word embeddings from BERT are utilized. Although BERT embeddings are powerful, combining them with traditional word embeddings, such as word2vec or GloVe, could provide complementary information and potentially enhance the representation of the textual data. Exploring different combinations of traditional and contextual word embeddings may lead to improved results. By addressing these limitations, the proposed architecture can be enhanced to achieve better performance and reliability in suggestion mining tasks.

7 Conclusion In conclusion, our proposed ensemble model DRC_Net leverages the strengths of DNNs, RNNs, and CNNs and uses BERT word embeddings to improve the performance of suggestion mining. By randomly generating three models of each type and selecting the best of each kind, we aim to achieve better performance than using just one model. Our proposed ensemble model for suggestion mining has shown promising results in accurately identifying suggestions from online reviews. By combining the strengths of DNNs, RNNs, and CNNs, our model was able to learn complex representations of sequential data and capture local patterns and relationships within the input data, resulting in improved accuracy and F1-score compared to other related studies. Our model’s ability to perform well on cross-domain validation suggests that it can be applied to various domains and datasets. Additionally, our analysis of feature importance revealed that the extraction of spatial and temporal features with contextual information played a crucial role in figuring out whether or not a sentence is a suggestion. Our research highlights the effectiveness of ensemble models in improving the accuracy and robustness of deep learning models for suggestion mining tasks. However, further research is needed to explore the full potential and limitations of the proposed model. In future, there are several directions that can be taken to improve the proposed DRC_Net ensemble model for suggestion mining. One potential direction is to explore different ensemble techniques, such as bagging, stacking, and boosting, to further improve the accuracy and robustness of the model. Another direction for future research is to apply the DRC_Net model to an opendomain setting, where the model can be evaluated on a broader range of datasets and tasks, to assess its effectiveness in a more general context. By exploring these potential avenues for improvement, researchers can continue to advance the state of

76

U. B. R. Khan et al.

the art in the field of suggestion mining and improve the performance of models like DRC_Net.

References 1. Negi S (2019) Suggestion mining from text. Dissertation. National University of Ireland– Galway 2. Negi S, Buitelaar P (2017) Suggestion mining from opinionated text. In: Sentiment analysis in social networks, pp 129–139 3. Goldberg AB, Fillmore N, Andrzejewski D, Xu Z, Gibson B, Zhu X (2009) May all your wishes come true: a study of wishes and how to recognize them. In: Proceedings of human language technologies: the 2009 annual conference of the north american chapter of the association for computational linguistics, pp 263–271 4. Ramanand J, Bhavsar K, Pedanekar N (2010) Wishful thinking-finding suggestions and’buy’wishes from product reviews. In: Proceedings of the NAACL HLT 2010 workshop on computational approaches to analysis and generation of emotion in text, pp 54–61 5. Brun C, Hagege C (2013) Suggestion mining: detecting suggestions for improvement in users’ comments. Res Comput Sci 70(79):171–181 6. Negi S, Daudert T, Buitelaar P (2019) Semeval-2019 task 9: suggestion mining from online reviews and forums. In: Proceedings of the 13th international workshop on semantic evaluation, pp 877–887 7. Liu F, Wang L, Zhu X, Wang D (2019) Suggestion mining from online reviews using random multimodel deep learning. In: 2019 18th IEEE international conference on machine learning and applications (ICMLA). IEEE, pp 667–672 8. Liu J, Wang S, Sun Y (2019) Olenet at semeval-2019 task 9: Bert based multi-perspective models for suggestion mining. In: Proceedings of the 13th international workshop on semantic evaluation, pp 1231–1236. 9. Alekseev A, Tutubalina E, Kwon S, Nikolenko S (2021) Near-zero-shot suggestion mining with a little help from WordNet. In: Analysis of images, social networks and texts: 10th international conference, AIST 2021, Tbilisi, Georgia, 16–18 Dec 2021 10. Potamias RA, Neofytou A, Siolas G (2019) NTUA-ISLab at SemEval-2019 task 9: mining suggestions in the wild. In: Proceedings of the 13th international workshop on semantic evaluation, pp 1224–1230 11. Rashidullah Khan UB, Akhtar N, Kidwai UT, Siddiqui GA (2022) Suggestion mining from online reviews using temporal convolutional network. J Discrete Math Sci Cryptogr 25(7):2101–2110 12. Ramesh A, Reddy KP, Sreenivas M, Upendar P (2022) Feature selection technique-based approach for suggestion mining. In: Evolution in computational intelligence: proceedings of the 9th international conference on frontiers in intelligent computing: theory and applications (FICTA 2021). Springer, Singapore, pp 541–549 13. Kowsari K, Brown DE, Heidarysafa M, Meimandi KJ, Gerber MS, Barnes LE (2017) Hdltex: hierarchical deep learning for text classification. In: 2017 ICMLA. IEEE, pp 364–371 14. Rumelhart DE, Hinton GE, Williams RJ (1986) Learning representations by back-propagating errors. Nature 323(6088):533–536 15. Zhang Y, Wallace B (2015) A sensitivity analysis of (and practitioners’ guide to) convolutional neural networks for sentence classification. arXiv:1510.03820 16. Pretorius A, Bierman S, Steel SJ (2016) A bias-variance analysis of ensemble learning for classification. In: Annual proceedings of the south african statistical association conference, vol 2016, no. con-1, pp 57–64 17. Madabushi HT, Kochkina E, Castelle M (2020) Cost-sensitive BERT for generalisable sentence classification with imbalanced data. arXiv:2003.11563

A CNN-Based Self-attentive Approach to Knowledge Tracing Anasuya Mithra Parthaje, Akaash Nidhiss Pandian, and Bindu Verma

Abstract A key component in modern education is personalized learning, which involves the difficult challenge of precisely measuring knowledge acquisition. Deep knowledge tracing (DKT) is a recurrent neural network-based method used for this problem. However, the predictions from DKT can be inaccurate. To address this, a mathematical computation model called convolutional neural network (CNN) is leveraged to understand the DKT problem better. Our model identifies applicable knowledge concepts (KC) and predicts the student’s conceptual mastery based on them. SAKT, a method of self-attention that successfully manages data sparsity (sparse data representation), is introduced here. Our model outperforms modern knowledge tracing models, improving the average AUC by 5.56% through experiments on real-world datasets. This performance breakthrough has positive ramifications for personalized learning, improving the accuracy and efficacy of knowledge acquisition and ultimately resulting in more specialized and productive educational settings. Keywords DKT · CNN · AUC · SAKT

1 Introduction Online education has leveraged MOOCs and intelligent tutoring platforms to offer courses and exercises. Self-attentive techniques and data mining tools help forecast student performance based on KCs, which can represent an exercise, skill or concept [1]. Knowledge tracing is key to developing personalized learning that recognizes student strengths and weaknesses. Modelling students’ knowledge state over time is a challenging task due to the complexity of the human brain [2]. Knowledge tracing A. M. Parthaje (B) · A. N. Pandian · B. Verma Delhi Technological University, New Delhi, India e-mail: [email protected] B. Verma e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 A. Swaroop et al. (eds.), Proceedings of Data Analytics and Management, Lecture Notes in Networks and Systems 788, https://doi.org/10.1007/978-981-99-6553-3_6

77

78

A. M. Parthaje et al.

is a supervised sequence learning task formalized as predicting the next interaction, represented by X = (x 1 , x 2 , x 3 , …) and x 1 + 1. To enable students to adjust their practice, it is crucial for a system to identify their strengths and weaknesses. This can also help teachers and system creators suggest appropriate instructional materials and exercises. We represent the interaction between questions and answers as x t = (et , r t ), where the student attempts exercise et at timestamp t and indicates whether it is correct. In KT, the target is to predict the student’s ability to accurately respond to the following exercise, i.e. forecast p(r t+1 = 1 et+1 , X). Knowledge tracing tasks aim to predict aspects of a student’s interaction x 0 , …, x t based on their prior interactions in a given learning task [3]. In order to forecast the student’s performance based on that performance, this research suggests a method for extracting pertinent KCs from prior encounters. As KCs, we predicted how students would do on tasks using exercises. SAKT gives the previously completed exercises weights when estimating the student’s performance on a specific exercise. Across all datasets, SAKT outperforms cutting-edge KT techniques on average by 4.43% on the AUC. Due to the key element of SAKT, self-attention, our model is also orders of magnitude faster than RNN-based models. The rest of this work is structured as follows. In Sect. 2, modelling strategies applied to students are examined. Section 3 presents the suggested model. The datasets we used for our studies are described in Sect. 4, along with the findings. This paper’s conclusion and future study directions are covered in Sect. 5.

2 Related Works Within the area of machine learning (ML) research, the area of deep learning (DL) has only recently begun to gain momentum. ML systems use DL algorithms to discover multiple levels of representation and have shown empirical success in the AI applications of NLP and CV. The paper proposes a novel knowledge tracing model called deep knowledge tracing with student classification that is dynamic (DKT-DSC); according to every time step, the model divides students into various groups based on their capacity for learning. DKT architecture utilizes recurrent neural networks (RNNs) to process this information using data to predict student performance [4]. Classifying students as long-term memories improves knowledge tracing with RNNs, which are among the best approaches to knowledge tracing today. Recent deep learning models such as deep knowledge tracing (DKT) [5] summarize the student’s knowledge state using recurrent neural networks (RNNs). Memoryaugmented neural networks (MANs) were exploited in dynamic key-value memory networks (DKVMNs) [5]. The learning algorithm learns the correlation between a student’s knowledge state and underlying knowledge content, using two matrices, key and value. It is difficult to interpret the parameters of the DKT model [6]. DKVMN is simpler to comprehend than DKT since it unambiguously maintains key-value representation matrix and a KC matrix for representation. All these deep learning

A CNN-Based Self-attentive Approach to Knowledge Tracing

79

Fig. 1 An encoder/decoder connected by attention

models rely on RNNs, which cannot generalize sparse data because they are based on RNNs [7]. Transformer is a method based on the pure attention mechanism proposed in this paper [8]. The skills that students learn in the KT task are interconnected, and how well they succeed on a specific exercise depends on how well they perform on earlier activities that were similar to it. This research proposes a technique for predicting student performance on exercises by extracting relevant KCs from past interactions. SAKT, which assigns weights to previous exercises, outperforms the best KT methods by 4.43% on average across all datasets, and it is faster than RNN-based models because it uses self-attention. The encoder converts data into features, and the decoder produces context vectors that are interpreted by the deep CNN model shown in Fig. 1. LSTM networks summarize input sequences in internal state vectors, which are input into the first cell of the decoder network.

3 Proposed Method By analysing the previous interactions of a student et + 1, we are able to predict whether he/she will be able to provide the correct response for the next exercise. Figure 3 illustrates how the problem can be transformed into a sequential model. Models with inputs are convenient x 1 , x 2 , x 3 , …, x t −1 . Continuing with the exercise sequence one position ahead, e1 , e2 , e3 , …, et , in this correct response to exercises is the output. In the one-hot embedding and the feature extractor in the data, the math portion of the extraction of the data features is picked, and data output is labelled. It is a multi-head system that learns relevant information in various representative sub-spaces.

80

A. M. Parthaje et al.

3.1 Architecture for the Proposed Model 3.1.1

Embedding Layer

As a first step, the tuples containing questions and answers are converted into one-hot embeddings of real-value vectors as shown in Fig. 2. The proposed model outputs X = (e1 , e2 , e3 , …, et ), where n is the maximum length that may be processed by input sequence y = (y1 , y2 , y3 , …, yt ). Because the model works with fixed-length sequence inputs, if t is less than n, we repeatedly pad the sequence left by a question–answer pair. Figure 3 depicts the model’s general structure. The sequence (n, y1 ) is divided into n/t subsequences each of length n when t exceeds n. A model relies on all of these subsequences as inputs. The interaction embedding matrix, which includes the latent dimension d, is trained. For each element in the sequence, an embedding M si is obtained. A similar calculation is done for the exercises embedded in the set so that every exercise item in the set, ei , is embedded along the line. We added residual connections to the above modelling structure upon completing the self-attention layer as well as the feeding forward layer in order to train a more complex network structure [9]. In addition, we normalized the layers and each layer’s dropout.

Fig. 2 a SAKT network [12] estimates attention weights of previous elements at each timestamp by extracting keys, values and queries from the embedding layer. Attention weight is determined for the query element and its corresponding key element x t , j. b An embedding layer contains a student’s current activity and past interactions to embed the current question at timestamp t + 1, along with elements of prior interactions shown as an interaction embedding in the key and value space [12]

A CNN-Based Self-attentive Approach to Knowledge Tracing

81

Fig. 3 Proposed KTA model

3.1.2

Prediction Layer

Using the data of the students being able to correctly answer exercises or not, we create a prediction layer by passing the learned representation F through an activated Sigmoid network, as seen in Eq. 1. p = σ (F W + b)

(1)

Here, p represents the likelihood that a student will provide a response answer the exercise en correctly, and σ (z) = 1/(1 + e−z )

3.1.3

Network Training

Due to the fixed length of sequences in the self-attention model, input sequences are converted X = (x 1 , x 2 , …, x t X) and sequenced in order to feed it to our model, Knowledge tracing with attention (KTA). If it is of variable length, the model repeatedly adds padding to the left of X when it is less than l, as presented in Eq. 2. The sequence is partitioned if X exceeds l, and sequences of length l are divided into subsequences. Training is done to achieve the following objectives: The observed sequence should be minimized in terms of its negative log. Learning takes place with the parameters whenever p and r are compared, and it minimizes the cross-entropy loss between their communication. L=− ri log( pi ) + (1 − ri )log(1 − pi ) (2) i=1

82

3.1.4

A. M. Parthaje et al.

Feature Extraction

When learning the features of a new sample through a previous network, these representations are used for feature extraction in neural networks. After that, the newly trained classifier is applied to the features. This is followed by feeding the vectors into a feature extractor to capture latent dependencies between inputs. There are N identical blocks that make up the feature extractor. Two sub-layers are present in each block. Firstly, there is a mechanism for self-attention involving multiple heads [8]. Based on how similar each item in the input sequence is to the others, the global relationship is extracted using scaled dot-product attention. [8]. This model calculates attention h times, allowing it to compute this is known as multi-head learning because it involves obtaining relevant information from different representative sub-spaces.

3.1.5

Position Encoding

Using the position encoding layer of the self-attention neural network, we are able to encode the sequence order just like with a convolutional network or recurrent neural network. This layer has significant importance in this problem statement because a student’s knowledge is dependent on it, and it progressively changes with time. There should be no wavy transitions in a knowledge state at a particular time instance [5]. The position embedding parameter is learned during training so that P R ∈ R n×d can be incorporated. The ith row of the position embedding matrix helps generate the vector signifying interaction embedding for the ith element of the interaction sequence.

3.1.6

Prediction and Loss

The final decision is made using a Sigmoid function in the prediction stage. We won’t elaborate here on the prediction and optimization processes [5]. The embedding layer results in giving the embedded interaction input matrix and the embedded exercise matrix (E) as outputs.

4 Experimentations 4.1 Dataset A synthetic dataset and four real-world datasets produced from the ASSISTANTS online tutoring platform were used to evaluate the proposed model. 328,291 interactions from 124 skills are included in the dataset containing 4,417 students with 328,291 interactions. The prediction results are visualized using some of the students

A CNN-Based Self-attentive Approach to Knowledge Tracing

83

Table 1 Ablation study Architecture

Statics

ASSISTment 2015

ASSISTment 2009

Synthetic

Block

0.819

0.822

0.837

0.826

2 block

0.845

0.853

0.840

0.827

No PE

0.832

0.849

0.842

0.827

No RC

0.834

0.857

0.847

0.823

Dropout

0.840

0.851

0.845

0.832

Single

0.85

0.845

0.828

0.823

Predict

0.854

0.854

0.853

0.836

Bold means Proposed Algorithm Results

in this dataset [9]. The ASSIST2010 dataset of 19,917 student responses for 100 skills contains a total of 708,631 question-answering exchanges. As a result of a larger number of students, ASSIST2010 contains fewer interactions per skill and student than ASSIST2009. This 2017 ASSISTMENT dataset has been made available for the 2017 ASSISTMENT data mining competition. It has 102 skills, 942,816 interactions and 686 students, giving it a higher average record count per student. On all four datasets, except for the Simulated-5, our proposed model achieves excellent results according to Table 1. On ASSIST2015, KTA exceeds DKT+ by 10% more than DKT+. When compared with other models, our model achieves notable improvements in the F1 score as well. Using simulations of 4000 virtual students’ answering trajectories, the synthetic dataset was generated. A set of 50 exercises with varying difficulty levels are given to each student from five virtual concepts. Data from the assessment Challenge (ASSIST Challenge) 4 competition. There are 942,816 interactions in this dataset, 686 students and 102 skills. Since this dataset has a density of 0.81, it is the densest dataset available.

4.2 Evaluation Methodology A binary classification setting is used for the prediction task, i.e. correct and incorrect answers to exercises. In this case, the area under curve (AUC) metric is used to compare the performance. This paper compares our proposed KT model with DKT [10] DKT+ [5] and DKVMN [11], which are state-of-the-art KT methods. The introduction describes these methods. The model was trained on 80% of the dataset and then tested on the remaining 20%. d = 50, 100, 150, 200, the hidden state dimension, was used for all proposed methods. Hyperparameters reported in the competing papers were used for both approaches. We used the same procedure for weight initialization and optimization. TensorFlow was used for SAKT implementation with ADAM optimizer. The ASSISTChall dataset was processed in batches of 256 and the other datasets in batches of 128. Dropout rates of 0.2 were used for datasets with more records, such as ASSIST2015 and ASSISTChall, and for the rest,

84

A. M. Parthaje et al.

0.2. The number of sequences to be created was proportional to the number of exercise tags for each student. ASSISTChall and STATICS datasets use n = 500. ASSIST2009 datasets use n = 100 and 50, and the synthetic and ASSIST2015 datasets use n = 50. A reason for this is that there are methods for tracing attention-based knowledge in interpretable analyses.

5 Results and Discussion There may be possible drawbacks to the proposed CNN-based self-attentive knowledge tracing method. Given the complexity of CNNs and self-attention mechanisms, it might be difficult to comprehend which features the model consider most influential. Scalability and generalisability may be issues, and the model’s performance may be significantly influenced by the calibre and volume of training data. Additional factors to take into account are the capability of capturing intricate sequential connections and possible overemphasis on exercise order. Furthermore, the dataset used in our model lacks lengthy sequences, so the advantage of collecting long sequences is not utilized. All questions appear once, and they have the same length, so data dependencies are low. SAKT outperforms competing approaches by 3.16% on ASSIST2009 and 15.87% on ASSISTment2015 due to its attention mechanism. The proposed method performs similarly to ASSISTChall compared to DKT, but better than STATICS2011 by 2.16%. Attention weights visualization aids students in understanding pertinent exercises, so we calculate key and query attention weights across all sequences. We normalize attention layer weights, and each element of the relevance matrix represents the influence of relevant exercises. Synthetic is the dataset analysed because it contains hidden concepts that are known. Without self-attention, only the previous exercise affects subsequent exercise predictions. The default architecture performs significantly worse without attention blocks. Additional self-attention blocks increase model parameters, but in our case, this did not improve performance and complicated the model (Table 1). Residual connections have little impact on model performance, with removal even improving performance for the ASSISTment 2015 dataset. To regularize the model, especially with smaller datasets, we use dropout. Multiple attention heads capture different subspaces, and using just one head results in worse performance for all datasets. GPU training is significantly faster for SAKT (1.4 s/epoch) than DKT+ (65 s/epoch), DKT (45 s/epoch) and DKVMN (26 s/epoch) (as seen in Fig. 4), using an NVIDIA Titan V GPU for experiments.

A CNN-Based Self-attentive Approach to Knowledge Tracing

85

Fig. 4 Training and testing efficiency

6 Conclusion and Future Work In this paper, by examining the pertinent exercises from their prior interactions, we predict a student’s performance in the next exercise based on his interaction history (without using any RNNs). We have extensively tested our model on multiple realworld datasets, and we find that our method outperforms RNN-based methods by an order of magnitude. In order to capture these global dependency relationships directly, KTA models are presented, which compute the similarity between input items regardless of their length. Compared to existing models, our proposed model offers better predictions. As a result of our experiments, we have shown that the model we propose is better able to predict the future than current models.

References 1. Self J (1990) Theoretical foundations for intelligent tutoring systems. J Artif Intell Educ 1(4):3– 14 2. Chang H-S, Hsu H-J, Chen K-T (2015) Modeling exercise relationships in e-learning: a unified approach. In: Proceedings of the 8th international conference on educational data mining (EDM), pp 247–254 3. Corbett AT, Anderson JR (1994) Knowledge tracing: modeling the acquisition of procedural knowledge. User Model User-Adap Interact 4:253–278 4. Minn S, Lee JY, Yoon B, Rajendran J (2018) Deep knowledge tracing and dynamic student classification for knowledge tracing. In: 2018 IEEE international conference on data mining (ICDM), pp 933–938 5. Yeung C-K, Yeung D-Y (2018) Addressing two problems in deep knowledge tracing via prediction-consistent regularization. In: Proceedings of the fifth annual ACM conference on learning at scale, pp 97–106 6. Khajah M, Lindsey RV, Mozer MC (2016) How deep is knowledge tracing? arXiv:1604.02416

86

A. M. Parthaje et al.

7. Kang W-C, McAuley J (2018) Self-attentive sequential recommendation. In: 2018 IEEE International conference on data mining (ICDM), pp 197–206 8. Vaswani A, Shazeer N, Parmar, N, Uszkoreit J, Jones L, Gomez AN, Kaiser Ł, Polosukhin I (2017) Attention is all you need. In: Advances in neural information processing systems, pp 5998–6008 9. He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 770–778 10. Xiong X, Lange K, Xing EP (2016) Going deeper with deep knowledge tracing. In: International educational data mining society, pp 231–238 11. Poole B, Lahiri S, Raghu M, Sohl-Dickstein J, Ganguli S (2016) Exponential expressivity in deep neural networks through transient chaos. In: Advances in neural information processing systems, pp 3360–3368. 12. Zhang J, Liu Y, Zhang L, Xu H (2017) Dynamic key-value memory networks for knowledge tracing. In: Proceedings of the 26th international conference on World Wide Web, pp 765–774.

LIPFCM: Linear Interpolation-Based Possibilistic Fuzzy C-Means Clustering Imputation Method for Handling Incomplete Data Jyoti, Jaspreeti Singh, and Anjana Gosain

Abstract Dealing with missing values has been a major obstacle in machine learning. The occurrence of missing data is a significant problem that often results in a noticeable reduction in data quality. Therefore, effective handling of missing data is essential. This paper introduces a missing value imputation approach that utilizes possibilistic fuzzy c-means clustering and proposes a method called LIPFCM by combining the advantages of linear interpolation and fuzzy clustering techniques. The performance of the LIPFCM method is compared with five state-of-the-art imputation techniques using four commonly used real-world datasets from the UCI repository. The experimental results indicate that our proposed method performs significantly better than the existing imputation methods based on RMSE and MAE for these datasets. Furthermore, the robustness of the proposed approach has been experimentally validated on different missing ratios to analyze the impact of missing values. Keywords Missing values · Imputation · LI · PFCM · Incomplete data · Fuzzy clustering · MVI · LIPFCM

1 Introduction Recently, missing values are emerging as a significant concern in the field of data mining. It may occur in a database due to several reasons like lack of data, tools defect, unresponsiveness, data entry errors, undetermined value, tools scarcity, data Jyoti (B) · J. Singh · A. Gosain USICT, Guru Gobind Singh Indraprastha University, New Delhi, India e-mail: [email protected] J. Singh e-mail: [email protected] A. Gosain e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 A. Swaroop et al. (eds.), Proceedings of Data Analytics and Management, Lecture Notes in Networks and Systems 788, https://doi.org/10.1007/978-981-99-6553-3_7

87

88

Jyoti et al.

inconsistency, shortage of time, etc. [1–3]. Therefore, the decision-making process turns out to be ineffective, since the comprehensive understanding of the data is unavailable. Missing values leading to incomplete data can result in reduced statistical power and biased outcomes [4, 5]. Furthermore, incomplete data may lead to incorrect conclusions during the analysis of a study. Thus, the treatment of missing values is very crucial to attain accurate and reliable outcomes. Initially, the data may not be clean, and duplication of some information may be observed, which can further reduce the quality of data leading to difficulty in achieving accurate results. Therefore, data pre-processing needs to be taken into consideration for the enhancement of data quality. During the pre-processing stage, treatment of missing values, outliers, and noisy values is done, which helps in recovering from data impurities so that data can be used for further analysis. A very simple method to tackle incomplete data is simply deleting the records with missing values [2, 6], but it is only applicable when the number of observations is very few and a significant data loss can be observed if the size of the data is large. This in turn may reduce the performance and effectiveness of data [7]. For this reason, it is considered to be the worst method for handling missing data. Thus, it becomes essential to impute missing values rather than to simply deleting them to ensure complete and accurate data. Completion of incomplete datasets using missing value imputation (MVI) techniques is done through the imputation of values corresponding to missing values [1, 8]. The MVI techniques can be categorized into two groups: single imputation and multiple imputation [1, 6, 9]. Single imputation (SI) involves estimating only one value for missing data [9], while multiple imputation (MI) techniques allow for the use of more than one estimated value for imputation [1, 6]. Out of the various MI methods, the most efficient one is the fuzzy clustering imputation method [2, 3]. Linear interpolation (LI) method is a simple method that uses correlation between variables [10, 11]. Possibilistic fuzzy c-means (PFCM) is able to deal with noise and outliers in a better way as compared to fuzzy c-means (FCM) and possibilistic cmeans (PCM) fuzzy clustering methods [12]. In this paper, the merits of LI (SI technique) and PFCM (MI technique) for missing value imputation have been a motivation to introduce a new hybrid imputation method called LIPFCM for dealing with incomplete data. The main contribution of this research is as follows: • A new approach LIPFCM has been proposed for the imputation of missing values. • It can be experimentally validated that the proposed LIPFCM approach shows better imputation performance in comparison with other state-of-the-art imputation methods. • Our proposed method shows its robustness to the different percentages of missing values. The rest of the paper is organized as follows. Section 2 briefly shows the related work. The proposed LIPFCM imputation method is described in Sect. 3. The experimental framework is explained in Sect. 4. Section 5 illustrates the experimental results

LIPFCM: Linear Interpolation-Based Possibilistic Fuzzy C-Means …

89

and discussion of comparing LIPFCM with other imputation methods. Section 6 concludes the paper and discusses future work.

2 Related Work Researchers have conducted a number of surveys to address the effect of missing value imputation methods on performance measures [13–16]. Apart from these, many researchers have proposed different imputation techniques based on fuzzy clustering for handling the missing values. Based on FCM algorithms, four different strategies for incomplete data were proposed by Hathway and Bazdek for the first time [17], namely optimal completion strategy (OCS), whole data strategy (WDS), nearest prototype strategy (NPS), and partial data strategy (PDS). A kernel-based fuzzy clustering commonly known as kernel fuzzy c-means (KFCM) approach was introduced by Zhang et al. in [18], for clustering incomplete data. Optimization of fuzzy possibilistic c-means algorithm was done by utilizing Support Vector Regression and Genetic Algorithm (SVRGA) while imputing missing values for error minimization [19]. Several researchers found attribute’s type [20], weight [21], and correlation [22] for imputing missing values and clustering the incomplete data in their studies. Hu et al. introduced an approach to reduce the influence of outliers using similarity measures for processing incomplete data [23]. Based on fuzzy c-means, an alternative approach using the data-filling approach for the incomplete soft set (ADFIS) was proposed in [24]. Four real datasets were used for performing the experiments. This paper concluded that ADFIS performs better than all other methods in terms of accuracy. Hybrid fuzzy c-means and majority vote techniques were proposed to achieve high accuracy for predicting the missing value of microarray data which is typically used in the medical field [25]. The performance of the proposed technique was compared against three imputation methods, namely zero, kNN, and Fcm using four real datasets. They concluded that their proposed technique proved to be the best. In [26], advanced and enhanced FCM algorithm for health care was suggested.

3 Proposed Methodology LI imputation is a simple yet effective way of imputing missing values as it uses correlation between variables for predicting the missing values [10, 11]. PFCM can handle noise and outliers in a better way as compared to FCM and PCM fuzzy clustering methods while handling incomplete data [12]. In this paper, a method called linear interpolation-based possibilistic fuzzy cmeans (LIPFCM) has been proposed for the imputation of missing value which

90

Jyoti et al.

exploits a fuzzy clustering approach and linear interpolation algorithm. The flowchart of proposed method is shown in Fig. 1. The steps of the proposed method are discussed below. Step 1. Missing values simulation From a whole dataset D, an incomplete dataset D f is generated by removing a certain percentage of missing values randomly considering the following constraints: 1. Each attribute must contain at least one value. 2. Each record must contain at least one value. Step 2. Apply the LI imputation to complete the data LI imputation method is applied by imputing the missing values to complete the dataset [10]. LI fits a straight line between two datapoints and uses straight-line equation for imputation of missing values. Let y1 , y2 , and y3 are three datapoints having coordinate points as (y11 , y12 ), (y21 , y22 ), and (y31 , y32 ), respectively. Let datapoint y2 has missing values, then missing value y22 at y21 for y2 is computed using Eq. (1) as follows: y22 = y12 +

(y32 − y12 ) (y21 − y11 ). (y31 − y11 )

(1)

Step 3. Apply PFCM on complete dataset PFCM algorithm groups data items of a dataset X = {x1 , x2 , . . . ..xk } into fuzzy clusters v = {v1 , v2 , . . . ..vc } that works by minimizing the following objective function

Fig. 1 Flowchart depicting the step-by-step process of the proposed LIPFCM method

LIPFCM: Linear Interpolation-Based Possibilistic Fuzzy C-Means …

91

as shown in Eq. (2) [12]: JFCM =

k c ∑` ∑`

(au imj

+

η bti j )di2j

+

j=1 i=1

c ∑`

k ∑` ( )η γk 1 − ti j ,

j=1

(2)

i=1

where ‘c, ’ ‘m, ’ and ‘k’ represent the number of clusters, fuzzifier (constant value), number of data items, respectively. And ‘u i j ’ and ‘ti j ’ denote the fuzzy and possibilistic memberships of datapoint ‘xi ’ in cluster j as shown in Eqs. (4) and (5). c ∑`

u i j = 1, u i j ∈ [0, 1], for i = 1, 2, 3, . . . , n.

(3)

j =1

‘di j ’ denotes Euclidian distance || which|is | the distance among the datapoint ‘xi ’ and the cluster center ‘v j . ’ (d i j = || x i − v j ||). Each datapoint membership is updated as follows: ui j =

∑c

1 ∀i, j, 2 ( ) (m−1)

s=1

ti j = 1+

(4)

d ji dsi

(

1 b 2 d γ ij

1 . ) η−1

(5)

The cluster centers are updated using Eq. (6) as follows: ∑k

η m i=1 (au i j + bti j )x i η m i=1 (au i j + bti j )

v j = ∑k

.

(6)

Step 4. Update previous imputed value with a new value In this step, already imputed missing data feature is updated with a new value. Let ‘ylp ’ be the p th feature value of l th record in dataset X. Let ‘xi p ’ be a missing value that is calculated using Eq. (7) as follows: ∑c

η m i=1 (au i p + bti p )vli η m i=1 (au i p + bti p )

xlp = ∑c

Step 5. Stopping Criteria

.

(7)

|| || The stopping criteria [27, 28] are specified as ||v (k+1) − v (k) || < ε, where ε = 0.00001.

92 Table 1 Description of datasets

Jyoti et al.

Name

No. of records

No. of attributes

No. of class

Iris

150

4

3

Glass

214

9

2

Seed

210

7

3

Wine

178

13

3

4 Experimental Framework 4.1 Dataset Description The performance of the proposed LIPFCM is assessed on four well-known widely used datasets: Iris, Glass, Seed, and Wine that are obtained from the UCI machine repository [29]. In addition, these datasets differ from each other in terms of size and variables. A brief description of the datasets is presented in Table 1.

4.2 Missing Value Simulation A specific portion of data from each dataset has been removed randomly to create missing data for experimentation purpose. Different missing ratios of 1, 3, 5, 7, and 9% of missing values in each original dataset are injected for performance comparison. The simulated missing values are then imputed using imputation methods, namely mean, LI, KNNI, FKMI, LIFCM, and our proposed LIPFCM.

4.3 Evaluation Criteria The performance of LIPFCM is compared utilizing two well-known evaluation criteria, i.e., root mean squared error (RMSE) and mean absolute error (MAE). If N is the number of artificially generated missing values, Ai and P i (1 ≤ i ≤ N) are the actual and predicted values of the i th missing value, respectively. RMSE and MAE are the average difference between actual and predicted values [30, 31] as given in Eqs. (8) and (9). These values vary from 0 to ∞, where minimum value denotes an improved imputation performance. | n | 1 ∑` RMSE = | [Pi − Ai ]2 , n i=1

(8)

LIPFCM: Linear Interpolation-Based Possibilistic Fuzzy C-Means …

MAE =

n 1 ∑` |Pi − Ai |. n i=1

93

(9)

5 Experimental Results and Discussion We compare LIPFCM with four existing techniques, namely mean [6, 9], LI [10], KNNI [32], FKMI [30], and LIFCM [27] for five different missing ratios. There are altogether 20 missing combinations, generated by the combination of four datasets on five different missing ratios. The quantitative analysis of the performance of mean, LI, KNNI, FKMI, LIFCM, and LIPFCM, on all datasets, based on RMSE and MAE for 20 missing combinations are presented in Tables 2 and 3, respectively. Also, the best results among the six techniques have been highlighted. The RMSE of LIPFCM with a missing ratio 1% is 0.127 for Iris dataset as displayed in Table 2. The RMSE of mean, LI, KNNI, FKMI, and LIFCM with missing ratio 1% are 0.308, 0.263, 0.376, 0.248, and 0.178 for Iris dataset, respectively. It can be observed that LIPFCM performs significantly better than other techniques on all missing combinations for Iris, Glass, Seed, and Wine datasets in terms of RMSE and MAE. The graphical representation of RMSE and MAE for Iris, Glass, Seed, and Wine dataset is illustrated in Figs. 2 and 3. It is clearly observed from Tables 2 and 3 and from Figs. 2 and 3 that the proposed LIPFCM MVI method gives better results than mean, LI, KNNI, FKMI, and LIFCM imputation method examined on the four datasets. It can be concluded through the experimental results that the proposed method is more efficient for real-world datasets with different ratio of missing values. Even though the experiment carried out on four dataset having different missing ratios has shown the best result, yet there are still some limitations of this study which are the proposed algorithm is applied on incomplete dataset having only lower percentage of missingness and small datasets are utilized in the experiments. Furthermore, the proposed technique may not be suitable for nonlinear datasets, and coincident cluster problem [12] may also persist while handling incomplete data.

6 Conclusion and Future Work Missing values occurs frequently in numerous datasets in data mining. Thus, MVI methods are extensively employed in data mining to address missing values. In this paper, a framework for imputing missing values is proposed with the hybridization of linear interpolation and a fuzzy clustering technique called LIPFCM. The proposed method has been compared with five other high-quality existing methods using four

94

Jyoti et al.

Table 2 Comparison of performance of proposed method with other imputation methods on datasets in terms of RMSE Dataset Iris

Glass

Seed

Wine

Missing ratio (%)

Imputation technique1 Mean

LI

KNNI

FKMI

LIFCM

LIPFCM

1

0.308

0.263

0.376

0.248

0.178

0.127

3

0.301

0.389

0.313

0.309

0.408

0.245

5

0.600

0.550

0.620

0.527

0.448

0.409

7

0.608

0.469

0.486

0.636

0.556

0.393

10

0.720

0.616

0.629

0.520

0.594

0.417

1

0.084

0.083

0.050

0.100

0.052

0.029

3

0.080

0.094

0.116

0.084

0.074

0.060

5

0.106

0.146

0.162

0.109

0.120

0.079

7

0.181

0.170

0.187

0.147

0.109

0.097

10

0.192

0.171

0.198

0.138

0.117

0.116

1

0.126

0.119

0.105

0.084

0.079

0.046

3

0.103

0.118

0.093

0.158

0.135

0.069

5

0.114

0.105

0.106

0.150

0.110

0.042

7

0.122

0.102

0.117

0.113

0.133

0.102

10

0.148

0.117

0.157

0.142

0.150

0.117

1

0.165

0.191

0.177

0.143

0.142

0.142

3

0.213

0.194

0.220

0.204

0.227

0.182

5

0.091

0.130

0.244

0.095

0.123

0.085

7

0.095

0.177

0.230

0.106

0.176

0.094

10

0.108

0.174

0.237

0.135

0.187

0.097

publicly available real-world datasets for experimentation based on two evaluation criteria: RMSE and MAE. The experimental results show that LIPFCM method performs significantly better than the other five state-of-the-art imputation algorithms based on RMSE and MAE. The MAE of mean, LI, KNNI, FKMI, LIFCM, and LIPFCM with missing ratio 1% are 0.187, 0.143, 0.348, 0.148, 0.145, and 0.120 for Iris dataset, respectively. Therefore, the proposed method is efficient in imputing missing values for all the datasets. In the future, further research can be conducted to explore its suitability on a more extensive range of datasets and applications to test its generalizability.

1

The best result among the six imputation methods is denoted by bold values in the table.

LIPFCM: Linear Interpolation-Based Possibilistic Fuzzy C-Means …

95

Table 3 Comparison of performance of proposed method with other imputation methods on datasets in terms of MAE Dataset Iris

Glass

Seed

Wine

Missing ratio (%)

Imputation technique Mean

LI

KNNI

FKMI

LIFCM

LIPFCM

1

0.187

0.143

0.348

0.148

0.145

0.120

3

0.247

0.238

0.243

0.167

0.176

0.134

5

0.430

0.361

0.556

0.355

0.318

0.302

7

0.418

0.422

0.478

0.360

0.366

0.285

10

0.537

0.633

0.686

0.480

0.390

0.315

1

0.037

0.042

0.030

0.041

0.016

0.011

3

0.021

0.043

0.051

0.026

0.018

0.017

5

0.056

0.076

0.093

0.045

0.029

0.019

7

0.094

0.081

0.101

0.058

0.042

0.021

10

0.102

0.083

0.105

0.087

0.052

0.023

1

0.095

0.073

0.052

0.033

0.047

0.025

3

0.083

0.075

0.057

0.112

0.106

0.033

5

0.087

0.079

0.075

0.109

0.096

0.031

7

0.101

0.085

0.081

0.094

0.102

0.076

10

0.125

0.097

0.127

0.111

0.115

0.083

1

0.128

0.137

0.148

0.093

0.096

0.092

3

0.159

0.166

0.169

0.157

0.187

0.109

5

0.084

0.129

0.180

0.073

0.100

0.070

7

0.079

0.134

0.177

0.074

0.123

0.059

10

0.072

0.148

0.198

0.095

0.138

0.075

96 Fig. 2 Comparison of performance analysis of proposed imputation method with other imputation methods on a Iris dataset, b Glass dataset, c Seed dataset, and d Wine dataset in terms of RMSE (the lower the better)

Jyoti et al.

LIPFCM: Linear Interpolation-Based Possibilistic Fuzzy C-Means … Fig. 3 Comparison of performance analysis of proposed imputation method with other imputation methods on a Iris dataset, b Glass dataset, c Seed dataset, and d Wine dataset in terms of MAE (the lower the better)

97

98

Jyoti et al.

References 1. Jyoti, SJ, Gosain A (2022) Handling missing values using fuzzy clustering: a review. In: International conference on innovations in data analytics 2022 Nov 29. Springer Nature Singapore, Singapore, pp 341–353 2. Di Nuovo AG (2011) Missing data analysis with fuzzy C-Means: a study of its application in a psychological scenario. Expert Syst Appl 38(6):6793–6797 3. Azim S, Aggarwal S (2014) Hybrid model for data imputation: using fuzzy c means and multilayer perceptron. In: 2014 IEEE international advance computing conference (IACC) 2014 Feb 21. IEEE, pp 1281–1285 4. Zhang Y, Thorburn PJ (2022) Handling missing data in near real-time environmental monitoring: a system and a review of selected methods. Futur Gener Comput Syst 1(128):63–72 5. Rani S, Solanki A (2021) Data imputation in wireless sensor network using deep learning techniques. In: Data analytics and management: proceedings of ICDAM 2021. Springer Singapore, pp 579–594 6. Rioux C, Little TD (2021) Missing data treatments in intervention studies: what was, what is, and what should be. Int J Behav Dev 45(1):51–58 7. Kwak SK, Kim JH (2017) Statistical data preparation: management of missing values and outliers. Korean J Anesthesiol 70(4):407–411 8. Goel S, Tushir M (2019) A semi-supervised clustering for incomplete data. In: Applications of artificial intelligence techniques in engineering 2019. Springer, Singapore, pp 323–331 9. Nijman SW, Leeuwenberg AM, Beekers I, Verkouter I, Jacobs JJ, Bots ML, Asselbergs FW, Moons KG, Debray TP (2022) Missing data is poorly handled and reported in prediction model studies using machine learning: a literature review. J Clin Epidemiol 1(142):218–229 10. Noor MN, Yahaya AS, Ramli NA, Al Bakri AM (2014) Filling missing data using interpolation methods: study on the effect of fitting distribution. Trans Tech Publications Ltd. 11. Huang G (2021) Missing data filling method based on linear interpolation and lightgbm. In: Journal of physics: conference series , vol 1754, no 1. IOP Publishing, pp 012187 12. Pal NR, Pal K, Keller JM, Bezdek JC (2005) A possibilistic fuzzy c-means clustering algorithm. IEEE Trans Fuzzy Syst 13(4):517–530 13. Hasan MK, Alam MA, Roy S, Dutta A, Jawad MT, Das S (2021) Missing value imputation affects the performance of machine learning: a review and analysis of the literature (2010– 2021). Inf Med Unlocked 1(27):100799 14. Gond VK, Dubey A, Rasool A (2021) A survey of machine learning-based approaches for missing value imputation. In: 2021 third international conference on inventive research in computing applications (ICIRCA). IEEE, pp 1–8 15. Das D, Nayak M, Pani SK (2019) Missing value imputation–a review. Int J Comput Sci Eng 7(4):548–558 16. Lin WC, Tsai CF (2020) Missing value imputation: a review and analysis of the literature (2006–2017). Artif Intell Rev 53:1487–1509 17. Hathaway RJ, Bezdek JC (2001) Fuzzy c-means clustering of incomplete data. IEEE Trans Syst Man, and Cybernet Part B (Cybernet) 31(5):735–744 18. Zhang DQ, Chen SC (2003) Clustering incomplete data using kernel-based fuzzy c-means algorithm. Neural Process Lett 18(3):155–162 19. Saravanan P, Sailakshmi P (2015). Missing value imputation using fuzzy possibilistic c means optimized with support vector regression and genetic algorithm. J Theoret Appl Inf Technol 72(1) 20. Furukawa T, Ohnishi SI, Yamanoi T (2013) A study on a fuzzy clustering for mixed numerical and categorical incomplete data. In: 2013 International conference on fuzzy theory and its applications (iFUZZY). IEEE, pp 425–428 21. Li D, Zhong C (2015) An attribute weighted fuzzy c-means algorithm for incomplete datasets based on statistical imputation. In: 2015 7th international conference on intelligent humanmachine systems and cybernetics, vol 1. IEEE, pp 407–410

LIPFCM: Linear Interpolation-Based Possibilistic Fuzzy C-Means …

99

22. Mausor FH, Jaafar J, Taib SM (2020) Missing values imputation using fuzzy C means based on correlation of variable. In: 2020 international conference on computational intelligence (ICCI) 2020 Oct 8, IEEE, pp 261–265 23. Hu Z, Bodyanskiy YV, Tyshchenko OK, Shafronenko A (2019) Fuzzy clustering of incomplete data by means of similarity measures. In: 2019 IEEE 2nd Ukraine conference on electrical and computer engineering (UKRCON), IEEE, pp 957–960 24. Sadiq Khan M, Al-Garadi MA, Wahab AW, Herawan T (2016) An alternative data filling approach for prediction of missing data in soft sets (ADFIS). Springerplus 5(1):1–20 25. Kumaran SR, Othman MS, Yusuf LM, Yunianta A (2019) Estimation of missing values using hybrid fuzzy clustering mean and majority vote for microarray data. Proced Comput Sci 1(163):145–153 26. Purandhar N, Ayyasamy S, Saravanakumar NM (2021) Clustering healthcare big data using advanced and enhanced fuzzy C-means algorithm. Int J Commun Syst 34(1):e4629 27. Goel S, Tushir M (2021) Linear interpolation-based fuzzy clustering approach for missing data handling. In: Advances in communication and computational technology: select proceedings of ICACCT 2019 2021. Springer, Singapore, pp 597–604 28. Goel S, Tushir M (2020) A new iterative fuzzy clustering approach for incomplete data. J Stat Manag Syst 23(1):91–102 29. Dua D, Graff C UCI machine learning repository http://archive.ics.uci.edu/ml 30. Li D, Deogun J, Spaulding W, Shuart B (2004) Towards missing data imputation: a study of fuzzy k-means clustering method. In: International conference on rough sets and current trends in computing. Springer, Berlin, Heidelberg, pp 573–579 31. Rahman MG, Islam MZ (2016) Missing value imputation using a fuzzy clustering-based EM approach. Knowl Inf Syst 46(2):389–422 32. Beretta L, Santaniello A (2016) Nearest neighbor imputation algorithms: a critical evaluation. BMC Med Inform Decis Mak 16(3):197–208

Experimental Analysis of Two-Wheeler Headlight Illuminance Data from the Perspective of Traffic Safety Aditya Gola, Chandra Mohan Dharmapuri, Neelima Chakraborty, S. Velmurugan, and Vinod Karar

Abstract The goal of this study is to determine how headlight illumination affects motorized two-wheeler vehicle visibility and safety on the road. There were 15 subjects considered in this study, each riding a different kind of two-wheeler having a range of lighting technology and ages. The subjects considered in the study were asked to rate the visibility from their vehicles, and hence, the study focused on headlight illumination for both vertical and horizontal light distributions at varied forward distances. The study revealed that age of the two-wheeler had huge bearing on the light output from the headlight which can be attributed to range of factors like exterior polycarbonate cover getting hazier due to aging and handling, decreased reflectivity, and misalignment of the light reflector/optics with respect to the light source. Further, the technology of the headlight light source had a big impact on how much light is produced; LED technology produces roughly three times as much light than halogenbased technology. In terms of lux values, angular spread, focusing distances, and nonuniform angular spread, there is a significant variation in light output measured across all 15 vehicles, pointing to either workmanship issues with the headlight assembly, the light design itself, or the effect of headlight aging or inconsistent headlight fitment into the two-wheeler. These results shed light on such variable headlight performance and the need for effective headlight technology, which can assist drivers, automakers, and policymakers in enhancing road visibility and safety for motorized two-wheeler vehicles.

A. Gola · C. M. Dharmapuri G. B. Pant Government Engineering College, New Delhi, India e-mail: [email protected] N. Chakraborty · S. Velmurugan · V. Karar (B) CSIR—Central Road Research Institute, New Delhi, India e-mail: [email protected] N. Chakraborty e-mail: [email protected] S. Velmurugan e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 A. Swaroop et al. (eds.), Proceedings of Data Analytics and Management, Lecture Notes in Networks and Systems 788, https://doi.org/10.1007/978-981-99-6553-3_8

101

102

A. Gola et al.

Keywords Motorized two-wheelers · Headlight · Headlight illuminance · Light angular spread

1 Introduction In recent years, the use of two-wheeled motorized vehicles has increased tremendously in urban areas due to various reasons such as ease of use, lower cost, and less space consumption. They are one of the most commonly used vehicles in cities, particularly in developing countries. However, riding a two-wheeler can be risky due to several factors, including poor road conditions, unpredictable weather, and reckless driving. Therefore, it is crucial for riders to take safety measures to reduce risk of road crashes, associated fatalities, and serious injuries. One such safety feature is the headlight, which plays a vital role in enhancing visibility and reducing the likelihood of collisions. One of the primary causes of road crashes involving two-wheeled vehicles is the failure of other road users to see them. This is often due to the small size of two-wheeled vehicles and the lack of proper lighting. Some of the purposes served by two-wheeler headlights are increased visibility, improved situational awareness, reduced blind spots, enhanced braking distance, and reduced risk of collisions. The efficiency of the headlight in increasing visibility is greatly influenced by its intensity. Lumens, which describe the total quantity of light the bulb emits, are used to express the intensity of a headlight. In general, the brighter the light and the better the visibility, the higher are the lumens. The quantity of light radiated by the headlight bulb as a whole is expressed in lumens as headlight intensity. Illuminance, on the other hand, refers to the amount of light that hits a specific surface expressed in lux. In low-light situations, such as at dawn or dusk or in poorly lit places, a brighter headlight can increase visibility. Employing high beam headlights in well-lit regions can result in glare and reduce other road users’ vision. On the other hand, employing low beam headlights in dimly lit regions might reduce the rider’s sight and raise the danger of road crashes. A higher illuminance means that more light is falling on the road surface, which might aid riders in spotting potential hazards like potholes, debris, or animals. Along with brightness and intensity, the headlight’s direction is equally important for improving visibility and situational awareness. To make sure that the headlights are pointing in the proper direction, they must be properly positioned. Incorrectly positioned headlights can produce glare, reduce visibility, and result in road crashes. The distance, spread, and field of view to which the headlight of a motorized twowheeler should focus on the road can vary depending on various factors such as the type of headlight, the design of the motorcycle, and the intended use of the vehicle. In terms of spread and field of view, the headlight needs to be designed such that it illuminates a wide enough area to provide good visibility for the rider. The spread and field of view depend on the type of headlight and the design of the motorcycle,

Experimental Analysis of Two-Wheeler Headlight Illuminance Data …

103

but it should be sufficient to allow the rider to see the road and any potential hazards on either side of the vehicle and should not cause any inconvenience or danger to other road users. The primary objective of this study is to investigate the correlation between headlight illumination and the visibility and safety of motorized two-wheeler vehicles on the road. Through an experimental study, encompassing a diverse group of twowheeler models with varying lighting technologies and ages, we aim to study the intricate relationship between headlight illumination and its impact on vehicle visibility and safety. The paper is organized in the following way: Sect. 2 covers the “Literature review”, Sect. 3 covers the Methodology, while Sects. 4 and 5 cover “Results and Discussions” and “Conclusion and Future Scope”, respectively.

2 Literature Review In developing nations, motorized two-wheelers (MTWs) are a common means of transportation, but they also present a substantial risk to passengers and other road users [1]. The “Look But Fail to See” error has been linked to one of the most frequent forms of MTW involving road crashes: the failure of another road user to yield to an approaching motorbike on the main roadway when exiting from a side road [2, 3]. In developing nations, motorbike injuries are a serious but under-reported rising public health issue that considerably contributes to overall traffic injuries [1]. The causes of fatal motorbike crashes have been the subject of several research [4, 5]. The prevalence and pattern of motorbike crashes among commercial motorcyclists in Adidome town were high, according to Sugiyanto’s analysis of the incidence of motorbike crashes among these riders [6]. Additionally, Ado-Odo Ota, Ogun State, Nigeria’s prevalence of protective measures and motorbike road crashes was assessed and inadequate compliance with preventive measures for road safety was observed [7]. When compared to riders of average weight, obese motorcyclists experience different types of physical injuries and lengthier hospital stays. In order to minimize injuries, shorten hospital stays, prevent physical handicap, and save societal expenses by lowering the need for institutional care, safety measures including the usage of suitable helmets and clothing are crucial [8]. Low motorcycle conspicuity, or the rider’s inability to be noticed by other road users, is regarded to be a significant risk factor for motorcycle crashes, according to the literature. Low conspicuity may be caused by the size of the motorcycle, an irregular contour, low brightness, or backdrop contrast [9]. Therefore, raising the motorized two-wheeler’s lamp intensity can increase their visibility and lower the likelihood of road crashes. A retrofit for managing an automobile headlight illuminance has also been found to lessen glare and enhance the judgment of approaching traffic. It is crucial to remember that giving the wrong signal to neighboring vehicles can raise the likelihood of deadly collisions. As a result, using intelligent headlamp intensity management systems can help to increase traffic safety [10, 11].

104

A. Gola et al.

According to a study, both with and without street lighting, the contribution of car headlights on visibility of the road and objectives located on the road was examined. It was noted that the street lighting’s contribution was sufficient for ensuring the targets’ appropriate visibility, and that using a car headlight did not always increase a target visibility on the road, instead the glare from cross-beam headlights affects the drivers [12]. In order to increase road safety, it is crucial to take other road user reactions into account when deciding how bright to make headlights. Furthermore, the head injuries are leading cause of disability, and fatalities in motorbike road crashes are a big cause for concern. Therefore, wearing the correct clothes and helmets is vital for preventing injuries, shortening hospital stays, preventing physical handicap, and saving money on social expenditures by lowering the need for institutional care [13]. The daytime running headlights have been shown to increase motorbike detection and lower the probability of road crashes and injuries on the road, according to one study [14]. According to another study, the angle of the lights themselves may be altered to increase visibility of the road and of targets placed there [15]. Additionally, a study assessed how different high beam lighting intensities affected driver visibility and traffic safety. It is found that the chosen separations were advised in the evaluation of headlight glare at distances of 30, 60, 120, and 150 m from the driven car [14]. In addition, a study discovered that using high beam headlights can improve pedestrian conspicuity by enhancing visibility and illumination. To increase road safety, it is crucial to take other road user reactions into account and deploy intelligent headlamp intensity management systems [16, 17]. According to the findings of the rule discovery process, while having a full license, daylight, and the presence of shoulders increased the risk of fatal injuries at signalized intersections, inattentiveness, a good road surface, nighttime, the absence of shoulders, and young riders were highly likely to increase casualty fatalities at non-signalized intersections [18].

3 Methodology The purpose of this study is to assess the safety aspects with respect to specifications of headlight assembly, its modes of operation, age of the vehicle, and functional status of the headlights. The efficiency of a two-wheeler headlight is significantly influenced by the beam angle profile and the brightness of the low and high beam operational modes. The low beam light mode is intended to offer a wide, even beam pattern that lights the road in front of you without upsetting oncoming vehicles with glare. To comply with rules, the beam angle is commonly fixed at 15° [19]. The high beam light, on the other hand, has a narrower beam pattern that may illuminate a wider area and is intended to offer maximum illumination in low-light or dark environments. However, if used carelessly, the high beam light can irritate other motorists [17]. Higher headlamp illuminance levels have been shown in tests to increase driver visibility and road safety, making them another crucial consideration [14]. The optical layout of the headlight system, which may include digital micromirrors to refocus incident light

Experimental Analysis of Two-Wheeler Headlight Illuminance Data …

105

at a certain angle, determines the form of the beam pattern. To balance the demand for visibility with the need to lessen glare for other drivers, the headlight system beam angle profile and illumination must be properly engineered. High beam lights are directed upward to provide the most illumination, while low beam lights are angled downward to reduce glare. High beam lights are more successful at spotting possible hazards and preventing collisions, according to studies, but they can also irritate and blind other drivers. Hence, this study was conducted to examine the operational characteristics of two-wheeler headlights in both low and high beam modes, taking into account various factors such as forward distance, light spread, vehicle age, and headlight technology.

3.1 Methodology . Research design: This study uses a mixed-method research design, which includes both qualitative and quantitative methods. The qualitative methods include literature reviews to gather information on the safety aspects of headlights and gather information through questionnaire answered by the subjects. The quantitative methods encompass observational study of the usage of headlights by two-wheeled motorized vehicle riders on city roads. . Sample selection: The observational study is conducted on a sample of 15 twowheeled motorized vehicle riders on city roads. . Data collection: Data collection involves the following: Two-wheelers, riders, use of lux meter to measure headlight illumination. Measurement of headlight illuminance in low beam and high beam operations: measurement of headlight illuminance in low beam and high beam operations at forward distance of 4, 6, 8, and 10 m at vertical height from ground of 0 and 1 m, and at horizontal spread distance/position viz. 0 m, 2 m left and 2 m right from the approximate center of the headlight. . Questionnaire to riders. . Data analysis.

3.2 Experimental Setup The experimental setup shown in Fig. 1 comprises two-wheelers, lux meter, and a mounting stand to mount lux meter and vary mounting height of the lux meter to simulate light intensity measurements at different heights.

106

A. Gola et al.

Fig. 1 Experimental setup

4 Results and Discussion The subjective data obtained through questionnaire filled by the two-wheeler riders are listed in Table 1. The results are depicted graphically in Fig. 2, Fig. 3, Fig. 4, Fig. 4.4, and Fig. 4.5. From Fig. 2a, which shows the illuminance profile of two-wheeler headlights at horizontal center at forward distances of 4, 6, 8, and 10 m at height 0 m from the ground for low beam mode, it is seen that the headlight illuminance level goes down as we move forward from 4 to 10 m for vehicles nos. 1, 2, 5, 8, 11, 12, and 14, while it goes up as we move forward from 4 to 10 m for vehicle nos. 7, 9, 10, and 15. It also goes up for vehicle nos. 3, 4, and 6 with exception at forward distances of 6 m, 8 m, and 10 m, respectively. The vehicle 13 did not had low beam mode working. From Fig. 2b, which shows the illuminance profile of two-wheeler headlights at horizontal center at forward distances of 4 m, 6 m, 8 m, and 10 m at height 1 m from the ground for low beam mode, it is seen that headlight illuminance level goes down as we move forward from 4 to 10 m for vehicles nos. 1, 3, 4, 6, 7, 8, 9, 10, 11, 12, 14, and 15. It also goes down for vehicle nos. 2 and 5 with exception at forward distance of 6 m. From Fig. 2c, which shows the illuminance profile of two-wheeler headlights at horizontal center at forward distances of 4, 6, 8, and 10 m at height 0 m from the ground for high beam mode, it is seen that headlight illuminance level goes down as we move forward from 4 to 10 m for vehicles nos. 1, 2, 3, 5, 6, 8, 10, 11, 12, and 14 while it goes up for vehicle 4. The illuminance value goes down for vehicles 7, 9, 13, and 15 with exception at forward distance of 6 m. From Fig. 2d, which shows the illuminance profile of two-wheeler headlights at horizontal center at forward distances of 4, 6, 8, and 10 m at height 1 m from the ground for high beam mode, it is seen that headlight illuminance level goes down as we move forward from 4 to 10 m for all the vehicles. From Table 1 (which also provides subjective evaluation), Figs. 3a, b, it is seen that range of headlight illuminance varies from 13.5 lx (8-year-old vehicle with halogen headlight) to 55.9 lx (2-year-old vehicle with LED headlight) at measurement height of 0 m from the ground in low beam mode, while it varies from 8.15 lx (11-year-old

Experimental Analysis of Two-Wheeler Headlight Illuminance Data …

107

Table 1 Subjective data through questionnaire obtained about the two-wheeler vehicle type/model, vehicle and rider age, light source, original/replacement fitment, headlight working status, and light source used in the headlight Rider no. Vehicle Two-wheeler type model and age of rider (in years)

Vehicle age—Years: Y Months: M

Light: original or replaced

Rider-1

Bike

Hero Splendor i3s

3Y 5M

Original Yes

Both working

Halogen

Rider-2

Bike

Hero 7Y 11M Splendor Pro 2015 BS3

Original Yes

Both working

Halogen

Rider-3

Scooter Activa 6G STD

11M

Original Yes

Both working

Halogen

Rider-4

Bike

Honda Sp125-Disc

2Y 4M

Original Yes

Both working

LED

Rider-5

Bike

Hero CD Deluxe

11Y

Original Yes

Both working

Halogen

Rider-6

Bike

TVS Redeon 2Y 9M

Original Yes

Both working

Halogen

Rider-7

Bike

Hero 8M Splendor i3S

Original Yes

Both working

Halogen

Rider-8

Bike

Hero 11Y 1M Splendor Pro

Original Yes

Both working

Halogen

Rider-9

Bike

Hero HF Deluxe

Original Yes

Both working

Halogen

Rider-10

Bike

Hero Passion 8Y 1M X Pro

Original Yes

Both working

Halogen

Rider-11

Scooter Honda Activa 5G

4Y 4M

Original Yes

Both working

LED

Rider-12

Bike

Super Splendor

5Y

Original Yes

Both working

Halogen

Rider-13

Bike

Bajaj Platina 100 BS3

16Y

Original Yes

Only high Beam working

Halogen

Rider-14

Bike

Hero Splendor +

6Y 5M

Original Yes

Both working

Halogen

Rider-15

Bike

Hero Splendor + Xtec BSVI

4M

Original Yes

Both working

Halogen

2Y 10M

Is road visibility with current headlight good?

Working Headlight status of light high and source low modes of headlight

108

A. Gola et al.

Fig. 2 Illuminance profile of two-wheeler headlights at horizontal center in forward distances of 4, 6, 8, and 10 m a at height 0 m in low beam mode b at height 0 m in high beam mode c at height 1 m in low beam mode d at height 1 m in high beam mode

vehicle with halogen headlight) to 62.4 lx (2-year-old vehicle with LED headlight) in high beam mode. At this height, the headlight illuminance value in low beam mode is approximately 3 times higher at horizontal center than at left and right, while illuminance value in high beam mode is approximately 2–3 times higher at horizontal center than at left and right of the horizontal center. The range of headlight illuminance varies from 13.75 lx (7-year-old vehicle with halogen headlight) to 193.31 lx (3-year-old vehicle with halogen headlight) for height of 1 m from the ground, while it varies from 17.3 lx (7-year-old vehicle with halogen headlight) to 714 lx (4-year-old vehicle with LED headlight). At this height, the headlight illuminance value in high beam mode is approximately 25 times higher at horizontal center than at left and right of the horizontal center, while the illuminance value in high beam mode is approximately 35 times higher at horizontal center than at left and right of the horizontal center. It is also seen that at forward distance of 4 m, the intensity levels at left and right positions from the horizontal center are almost symmetrical in most of the vehicles. Age of the vehicle also plays a major role majorly due to its outer polycarbonate cover becoming scratchy and hazier due to aging and handling. The age of the vehicles nos. 5, 8, and 10 is more than 8 years, and hence, their headlight intensity has gone down due to its outer polycarbonate cover becoming hazy. The vehicle no. 3 is a new scooter with halogen headlight, its light output measured at height 0 m is less but high at 4 m height due to light being deigned to focus at certain distance. The

Experimental Analysis of Two-Wheeler Headlight Illuminance Data …

109

Fig. 3 Illuminance profile of two-wheeler headlights at three horizontal positions at a forward distance of a 4 m in low and high beams at height 0 m from the ground b 4 m in low and high beams at height 1 m from the ground c 6 m in low and high beams at height 0 m from the ground d 6 m in low and high beams at height 1 m from the ground

headlight intensity for vehicle no. 4, which is a bike and headlight based on LED light source, is quite high at 0 m height in both low and high beam modes but is very low at 1 m height due to light being deigned to focus at certain distance. The vehicle no. 11, with headlight based on LED light source, has a very high light output measured at a height of 1 m. One common observation is that the light output measured at horizontal center is significantly more compared to the values measured at left and right of the horizontal center. The light output measured for vehicle no. 7, which is just 8-month-old bike with halogen-based headlight, is considerably lower compared to the one with LED headlight. It is observed from Figs. 3c, d that the range of headlight illuminance varies from 12.02 lx (6-year-old vehicle with halogen headlight) to 58.3 lx (2-year-old vehicle with LED headlight) for measurement height of 0 m from the ground, while it varies from 6.95 lx (11-year-old vehicle with halogen headlight) to 29 lx (2-year-old vehicle with LED headlight). At this height, the headlight illuminance value in low beam mode is approximately 5 times higher at horizontal center than at left and right of the horizontal center. The illuminance value in high beam mode is approximately 2 times higher at horizontal center than at left and right of the horizontal center.

110

A. Gola et al.

Fig. 4 a Illuminance profile of two-wheeler headlights at three horizontal positions at a forward distance of a 8 m in low and high beams at height 0 m from the ground b 8 m in low and high beams at height 1 m from the ground c 10 m in low and high beams at height 0 m from the ground d 10 m in low and high beams at height 1 m from the ground

The range of headlight illuminance varies from 6.92 lx (5-year-old vehicle with halogen headlight) to 185.17 lx (11-year-old vehicle with halogen headlight) for measurement height of 1 m from the ground, while it varies from 7.6 lx (7-year-old vehicle with halogen headlight) to 372 lx (4-year-old vehicle with LED headlight). At this height, the headlight illuminance value in low beam mode is approximately 10 times higher at horizontal center than at left and right of the horizontal center. The illuminance value in high beam mode is approximately 15 times higher at horizontal center than at left and right of the horizontal center. The vehicle 4 shows high illuminance levels at all regions at height of 0 m, while vehicle 11 provides maximum illuminance levels among all vehicles at height of 1 m. At 6 m horizontal distance and 1 m vertical distance, the illuminance levels in low beam mode are better than what it is for 4 m measurement height. It is seen from Fig. 4a, b that the observed range of headlight illuminance varies from 11.44 lx (11-year-old vehicle with halogen headlight) to 82.6 lx (2-year-old vehicle with LED headlight) for measurement height of 0 m from the ground. These values vary from 6.71 lx (11-month-old vehicle with halogen headlight) to 98.6 lx (2-year-old vehicle with LED headlight). At this height, the headlight illuminance

Experimental Analysis of Two-Wheeler Headlight Illuminance Data …

111

value in low beam is approximately 2 times higher at the horizontal center than at left and right of the horizontal center, while in high beam, it is approximately 2 times higher at horizontal center than at left and right of the horizontal center. The range of measured headlight illuminance varies from 3.35 lx (7-year-old vehicle with halogen headlight vehicle) to 59.13 lx (11-year-old vehicle with halogen headlight vehicle) for height of 1 m from the ground. This value varies from 5.15 lx (7-year-old vehicle with halogen headlight vehicle) to 198.2 lx (4-year-old vehicle with LED headlight vehicle). At this height, the headlight illuminance value in low beam mode is approximately 5 times higher at horizontal center than at left and right of the horizontal center while in high beam mode, and it is approximately 7 times higher at horizontal center than at left and right of the horizontal center. The vehicle 4 shows maximum illuminance value in all regions at height of 0 m, while vehicle 11 shows maximum illuminance value at high beam at height of 1 m. It is seen from Figs. 4c, d that the range of measured headlight illuminance varies from 8.39 lx (5-year-old vehicle with halogen headlight vehicle) to 63.7 lx (2-yearold vehicle with LED headlight) for height of 0 m from the ground, while it varies from 7.5 lx (11-month-old vehicle with halogen headlight) to 92.4 lx (2-year-old vehicle with LED headlight). At this height, the headlight illuminance value in low beam is approximately 2 times higher at horizontal center than at left and right of the horizontal center, while measured illuminance value in high beam is approximately 2 times higher at horizontal center than at left and right of the horizontal center. The range of measured headlight illuminance is varied from 4.01 lx (7-year-old vehicle with halogen headlight) to 36.1 lx (8-year-old vehicle with halogen headlight) for height of 1 m from the ground, while it is varied from 5.14 lx (7-year-old vehicle with halogen headlight) to 132.9 lx (4-year-old vehicle with LED headlight). At this height, the headlight illuminance value in low beam is approximately 3 times higher at horizontal center than at left and right of the horizontal center, while measured illuminance value in high beam is approximately 4 times higher at horizontal center than at left and right of the horizontal center. Broadly, it is observed that age of two-wheeler has impact on light output coming from the headlight majorly due to outer polycarbonate cover becoming hazier due to aging and handling issues, the light reflector reflectivity going down, and its misalignment with respect to light source and mechanical fitment in the bike. The headlight light source technology plays a significantly role in light output with LED technology giving light output almost three times as compared to halogen-based technology. There is significant variation in light output measured in term of illuminance values in lux, their angular spread, and their focusing distances pointing to either the workmanship issues with respect to the headlight assembly or consistency in its fitment in the two-wheeler. The paper introduces an innovative perspective of the research by examining the effects of headlight penetration in both low and high beam operational modes on various motorized two-wheelers with different ages and headlight technologies. Further exploration is needed to delve into the implications of headlamp technology on both fellow drivers and the surrounding environment. This necessitates conducting extensive investigations encompassing longer forward

112

A. Gola et al.

distances, larger spatial gaps, wider horizontal angular coverage, varying heights, and diverse ambient light conditions.

5 Conclusion and Future Scope The significance of headlamp lighting in ensuring the visibility and safety of motorized two-wheeler vehicles on the road has been highlighted in this study. The results of the study demonstrate the important influence of headlamp technology and bike age on light output, angular spread, and illuminance values. The necessity for effective headlamp technology is highlighted by the finding that LED technology produces nearly three times as much light than halogen-based technology. The study also found a sizable variance in light output among the 15 vehicles, suggesting that the headlamp assembly, the light design, the aging of the headlights, or inconsistent headlight fitment into the two-wheeler may all have been the result of poor workmanship. The findings of the study emphasize the significance of precise assembly and uniform headlamp placement on two-wheelers. The results of this study can help improve road visibility and safety for motorized two-wheeler vehicles for drivers, automakers, and policymakers. The findings of the study have important ramifications for both public health and traffic safety, highlighting the necessity of effective headlight technology to lower collision rates, particularly in low-light conditions. The effects of headlamp technology on other drivers and the surroundings need to be further investigated at longer forward ranges with larger distances, at wider horizontal angular spread and different heights, and under various ambient light conditions.

References 1. Hassan O, Shaker R, Eldesouky R, Hasan O, Bayomy H (2014) Motorcycle crashes: attitudes of the motorcyclists regarding riders’ experience and safety measures. J Community Health 39. https://doi.org/10.1007/s10900-014-9883-1 2. Lee YM, Sheppard E, Crundall D (2015) Cross-cultural effects on the perception and appraisal of approaching motorcycles at junctions. Transp Res Part F: Traffic Psychol Behav 31:77–86, ISSN 1369-8478. https://doi.org/10.1016/j.trf.2015.03.013 3. Brown ID (2002) A review of the ‘looked but failed to see’ accident causation factor. Psychology 4. Soehodho S (2017) Public transportation development and traffic accident prevention in Indonesia. IATSS Research 40:76–80 5. Suthanaya PA (2016) Analysis of fatal accidents involving motorcycles in low income region (Case Study of Karangasem Region, Bali-Indonesia). Int J Eng Res Afr 19:112–122 6. Konlan KD, Doat AR, Mohammed I, Amoah RM, Saah JA, Konlan KD, Abdulai JA (2020) Prevalence and pattern of road traffic accidents among commercial motorcyclists in the Central Tongu District, Ghana. Sci World J 2020:10, Article ID 9493718. https://doi.org/10.1155/2020/ 9493718 7. Afelumo OL, Abiodun OP, Sanni F (2021) Prevalence of protective measures and accident among motorcycle riders with road safety compliance in a Nigerian semi-urban community. Int J Occup. Safety Health 11:129–138. https://doi.org/10.3126/ijosh.v11i3.39764

Experimental Analysis of Two-Wheeler Headlight Illuminance Data …

113

8. Oliveira A, Petroianu A, Gonçalves D, Pereira G, Alberti L (2015) Characteristics of motorcyclists involved in accidents between motorcycles and automobiles. Rev Assoc Med Bras 1992(61):61–64. https://doi.org/10.1590/1806-9282.61.01.061 9. Wells S, Mullin B, Norton R, Langley J, Connor J, Lay-Yee R, Jackson R (2004) Motorcycle rider conspicuity and crash related injury: case-control study. BMJ (Clinical research ed.) 328:857. https://doi.org/10.1136/bmj.37984.574757.EE 10. Sukumaran A, Narayanan P (2019) A retrofit for controlling the brightness of an automotive headlight to reduce glare by using embedded C program on a PIC Microcontroller. Int J Recent Technol Eng 8:4240–4244. https://doi.org/10.35940/ijrte.C5150.098319 11. Vrabel J, Stopka O, Palo J, Stopkova M, Dro´zdziel P, Michalsky M (1978) Research regarding different types of headlights on selected passenger vehicles when using sensor-related equipment. Sensors 2023:23. https://doi.org/10.3390/s23041978 12. Bacelar A (2004) The contribution of vehicle lights in urban and peripheral urban environments. Light Res Technol 36(1):69–76. https://doi.org/10.1191/1477153504li105oa 13. Yousif MT, Sadullah AFM, Kassim KAA (2020) A review of behavioural issues contribution to motorcycle safety. IATSS Research 44(2):142–154, ISSN 0386-1112. https://doi.org/10.1016/ j.iatssr.2019.12.001 14. Prasetijo J, Jawi ZM, Mustafa M, Zadie Z, Majid H, Roslan M, Baba I, Zulkifli AFH (2018) Impacts of various high beam headlight intensities on driver visibility and road safety. J Soc Automot Eng Malaysia 2:306–314. https://doi.org/10.56381/jsaem.v2i3.96 15. Chhirolya V, Sachdeva P, Gudipalli A (2019) Design of a modular beam control system for vehicles. Int J Smart Sens Intell Syst 12:1–6. https://doi.org/10.21307/ijssis-2019-008 16. Sewall A, Borzendowski S, Tyrrell R, Stephens B, Rosopa P (2016) Observers judgments of the effects of glare on their visual acuity for high and low contrast stimuli. Perception 45. https:// doi.org/10.1177/0301006616633591 17. Balk SA, Tyrrell RA (2011) The (in)accuracy of estimations of our own visual acuity in the presence of glare. Proc Hum Factors Ergon Soc Ann Meet 55(1):1210–1214. https://doi.org/ 10.1177/1071181311551252 18. Tamakloe R, Das S, Aidoo EN, Park D (2022) Factors affecting motorcycle crash casualty severity at signalized and non-signalized intersections in Ghana: insights from a data mining and binary logit regression approach. Accid Anal Prev 165:106517, ISSN 0001-4575. https:// doi.org/10.1016/j.aap.2021.106517 19. Tsai CM, Fang YC (2011) Optical design of adaptive automotive headlight system with digital micro-mirror device. Proceedings SPIE 8170, illumination optics II, 81700A 21 Sept 2011. https://doi.org/10.1117/12.896394

Detecto: The Phishing Website Detection Ashish Prajapati, Jyoti Kukade, Akshat Shukla, Atharva Jhawar, Amit Dhakad, Trapti Mishra, and Rahul Singh Pawar

Abstract Phishing attacks are among the most prevalent types of cybercrime that target people and businesses globally. Phishing websites mimic real websites to obtain sensitive data of users like usernames, passwords, and credit card numbers. To identify phishing websites, many people employ machine learning algorithms. These algorithms use supervised learning techniques to classify websites into the phishing or legitimate categories. Machine learning algorithms use features such as URL length, domain age, SSL certificate, and content similarity to determine whether a URL is real or fake. In recent years, authors have published papers working on the classification of websites with features by using a support vector machine and achieving 95% accuracy, and also they classify phishing websites by using a URL identification strategy or utilizing the random forest algorithm. The dataset contains a collection of URLs of 11,000+ websites. Each has 30 parameters and a class label identifying as a phishing website or not. To achieve the highest level of accuracy, we suggested a model using 32 features extracted from phishing websites and various machine learning classifiers. Every website has distinct characteristics that are categorized by trained models. We achieved 97.4% accuracy using 7 classifiers, including Naïve Bayes, logistic regression, random forest, decision tree, and gradient boosting algorithm. A. Prajapati (B) · J. Kukade · A. Shukla · A. Jhawar · A. Dhakad · T. Mishra · R. S. Pawar Medi-Caps University, Indore, India e-mail: [email protected] J. Kukade e-mail: [email protected] A. Jhawar e-mail: [email protected] A. Dhakad e-mail: [email protected] T. Mishra e-mail: [email protected] R. S. Pawar e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 A. Swaroop et al. (eds.), Proceedings of Data Analytics and Management, Lecture Notes in Networks and Systems 788, https://doi.org/10.1007/978-981-99-6553-3_9

115

116

A. Prajapati et al.

Keywords Phishing · Legitimate · Machine learning · Cybercrime · Supervised learning

1 Introduction Phishing attacks are serious hazards to individuals, businesses, and governments since they are a common and sophisticated threat in the digital world. The usage of the internet increases security threats and cybercrimes [1]. Phishing is a fraudulent activity where cyber criminals create fake websites that mimic legitimate ones, intending to steal personal information like banking or social media ID passwords, credit card numbers, or other personal data. Phishing attacks pose a serious risk to both people and companies, and cybersecurity must find them. Machine learning techniques have shown promising results in detecting phishing websites. These techniques involve training models on large datasets of phishing and legitimate websites and then using the models to classify new websites as either phishing or legitimate. When a person falls for the scam by putting their trust in the fake page, the phisher is succeeded. In recent studies, researchers have focused more on phishing attempts to prevent harm to unintentional web users [2]. Social networking, communication, finance, marketing, and service delivery have all undergone revolutionary changes as a result of the internet. These internet facilities are being used by an increasing number of people. To suit human requirements, however, communication technology is developing. Yet opponents are also developing new ways to obstruct communication. These adversaries deceive the user by using malicious software or phishing websites to steal crucial information. One of the deceptive methods used in the online world is phishing. The scammer sends a temptation that looks like a real website and waits for users to become victims. A common phishing attack tactic uses a phishing website to trick people into visiting fraudulent websites by mimicking the domain and designs of trustworthy websites like Flipkart, SBI, and Amazon [3]. Some common features that can be used to train these models include URL length, presence of subdomains, use of HTTP or HTTPS, presence of certain keywords or phrases in the URL, and characteristics of the website’s content. Phishing is an illegal attempt made by attackers to get users’ personal information by setting up phony websites. In these circumstances, users who submit information regarding anything linked to transactions, user IDs, passwords, etc. into the false websites run the risk of these kinds of information getting misused by the attacker, which might result in loss of money and personal data. As a result of ongoing technological advancements and the substantial influx of data utilized by numerous businesses on a daily basis, Numerous online enterprises, including those in the financial sector, are facing reputational damage due to the proliferation of fraudulent websites. Several online companies, including financial services, are experiencing reputational harm as a result of the growth of these phony websites. It will be incredibly advantageous for everyone if we just simply detect these websites early on. Due to the dynamic nature of phishing efforts, there is no alternative approach for phishing removal; hence, more efficient and improved methods for detecting them are required. According to

Detecto: The Phishing Website Detection

117

the literature review, the majority of current machine learning techniques have flaws like a high rate of false alarms, a low detection rate, and the inability of classification models and some hybridized techniques to produce incredibly effective and efficient detection of phishing sites [4]. Yet, since they were made using so many cuttingedge methods, finding these websites is challenging. Although many methods have been proposed for the identification of websites, many of them fall short of producing 100% accurate findings, and several new phishing websites may be created in a matter of minutes. Machine learning-based classifiers can maintain their resistance to zerohour phishing attempts while achieving very accurate classification [5]. Overall, the development of effective phishing detection techniques is an important step toward enhancing online security, and the ongoing research in this area is expected to lead to even more sophisticated and accurate methods for detecting and preventing phishing attacks. Main contribution of this study: • We proposed a system using 32 feature extractions from website URLs which is capable of detecting phishing websites with high precision and accuracy of 97.40%. It outperforms even simple websites with high precision. • We introduce age of domain, IFrame redirection, and disabling right-click as an extra feature to classify websites as legitimate or phishing. • We have proposed a novel approach to ensemble machine learning that makes use of multi-threading to execute ensemble-based machine learning models in parallel. Parallel processing throughout the training and testing phases speeds up procedures, making it possible to identify phishing URLs instantly.

2 Literature Review S. No.

Author

Publisher

1

Shouq Applied Alnemari Sciences et al. [6]

2

Mausam et al. [7]

IJSRED

Year

Problem addressed

Approach results

Limitation

2023 A better approach to automate the detection of phishing URL

Used ensemble technique or integrate neural network, random forest, and SVM and get 96% accuracy

A limited number of features and classifier are used to train the model

2022 Implementation of sequential ML algorithms to detect phishing attacks

Three ML algorithms XGBoost, RF, and KNN are used, and RF produced 96.75% accuracy

Only 10 features are extracted

(continued)

118

A. Prajapati et al.

(continued) S. No.

Author

Publisher

Year

3

Sönmez et al. [1]

ISDFS

2018 Phishing attacks This strategy Achieve classification consists of 95.34% categorizing accuracy websites and extracting features from websites. Six activation functions were utilized in the extreme learning machine (ELM), which outperformed the SVM and NB in accuracy (95.34%)

4

Zuhair and Selamat [2]

Int. J. Intell. Syst. Technol. Appl.

2016 Phishing detection

Hybrid phishing detection

Less number of feature comparisons with different classifiers

5

Aydin and Baykal [8]

IEEE Conf. Commune Network Security

2015 Framework for feature extraction adaptable and straightforward with fresh tactics

The dataset and outside service providers produced 133 features

The result is produced by comparing Naïve Bayes and SMO

6

Parekh et al. [9]

ICICCT

2018 Use URL identification to identify phishing sites

Eight features It obtained an out of a total of accuracy level 31 are of 95% considered for parsing. The accuracy level for the random forest approach was 95%

Problem addressed

Approach results

Limitation

(continued)

Detecto: The Phishing Website Detection

119

(continued) S. No.

Author

7

Zhang International et al. [10] Journal of Engineering Research and Technology (IJERT)

Publisher

Approach results

Limitation

2017 The word embedding semantic characteristics, semantic features, and multi-scale statistical features are mined by the phishing detection model to efficiently detect phishing performance

To obtain statistical aspects of web pages, 11 features were extracted and divided into 5 types. The model is learned and tested using AdaBoost, bagging, random forest, and SMO

Only eleven feature is extracted

8

Jeeva Human-centric 2016 Combining et al. [11] Computing and length, slash Information number, point Sciences number, and position attributes with transport layer security aspects

The rules produced by the apriori algorithm discovered a 93% accuracy rate

They discovered a 93% accuracy rate

9

Gautam Springer et al. [12]

2018 Association data mining approach

The 16 characteristics were extracted by them, and their accuracy was 92.67%

This is inadequate, and thus, the suggested algorithm can be improved for a high detection rate

10

Sonowal [11]

2020 Detected BSFS phishing emails technique weighed the accuracy of 97.31%

Accuracy 97.31%

11

Fadheel IEEE et al. [13]

2017 Detect phishing To help with websites phishing identification, 19 of the site’s original 30 characteristics have been chosen

Only 19 feature is used

SN Computer Science

Year

Problem addressed

(continued)

120

A. Prajapati et al.

(continued) S. No.

Author

Publisher

Year

12

Shima [14]

ICIN

2018 Applying a neural network model for the URL domain

Problem addressed

Approach results

Limitation

A neural network model is used to automatically extract information without any specialized knowledge of the URL domain

A minimum feature is used

2.1 Methodologies for Phishing Website Detection The proposed method will put a strong emphasis on boosting the accuracy of faked website detection using various supervised learning techniques. Kaggle was used to get the data. The dataset consists of 32 features and 11,056 occurrences. The dataset is then divided into sections based on entropy. The refined dataset demonstrates accuracy. Following that, the partitioned dataset is employed to check correctness. The best qualities for each leaf node are identified using correlation and a working model. The prediction model is trained via ensemble learning, which makes use of numerous learning models. While making predictions, it is possible to avoid having one model dominate the results by employing numerous models. As a result, we show how the majority of votes are calculated using the output from all models (Fig. 1).

2.2 Dataset In this model, we have blended datasets generated with phishing datasets that we have acquired from a variety of online sources, including Kaggle. We get the phishing dataset from Kaggle to test and train the model in the ratio of 20–80, respectively. The dataset, which has 32 columns and 11,056 rows, includes information from both phishing and legitimate websites, out of which 31 are independent features and 1 is dependent feature. Features are listed down from the dataset like long URLs, short URLs, domain age, HTTPs, page rank, etc. (Fig. 2).

Detecto: The Phishing Website Detection

121

Fig. 1 Proposed methodologies

2.3 Feature Extraction To distinguish between genuine and fake websites, a website may be utilized to extract several attributes. The effectiveness of the systems for identifying phishing websites depends on the quality of the characteristics that are retrieved. More information on these characteristics and their significance is provided in [14]. The features are extracted into four categories: address bar grounded features, abnormal grounded features, HTML and JavaScript grounded features, and domain grounded features. Address bar grounded features refer to techniques that attackers use to manipulate the URL in the address bar of the web browser. Some of the features in this category include using the IP address instead of the domain name, using long URLs to hide suspicious parts, using URL shortening services, redirecting using “//”, and adding prefixes or suffixes separated by a hyphen to the domain. Other features in this category include subdomains, HTTPS, domain registration length, favicon, using non-standard ports, and the existence of the “HTTPS” token in the domain part of the URL. Abnormal grounded features refer to techniques that attackers use to hide or obfuscate the true nature of a website. Some of the features in this category include the request URL, the URL of anchor tags, server form handlers (SFH), and submitting information to email or abnormal URLs.

122

A. Prajapati et al.

Fig. 2 Dataset classification heatmap

HTML and JavaScript grounded features refer to techniques that attackers use to manipulate the HTML and JavaScript code of a website. Some of the features in this category include website forwarding, status bar customization, disabling right-click, using pop-up windows, and IFrame redirection. Domain grounded features refer to techniques that attackers use to manipulate the domain name and its associated properties. Some of the features in this category include the age of the domain, DNS records, website traffic, page rank, Google index, the number of links pointing to the page, and statistical reports-based features (Fig. 3).

2.4 Machine Learning Algorithm The decision tree is a specific kind of machine learning technique used for classification and regression analysis. The model predicts the value of the target variable based

Detecto: The Phishing Website Detection

123

Fig. 3 Feature importance

on a number of input variables. A representation of decisions and their results that resembles a tree is created using the decision tree algorithm. About 95.9% accuracy is produced via the decision tree (Fig. 4). To increase prediction precision, random forest mixes many decision trees. It works effectively for model training and produces results with an accuracy of

Fig. 4 Decision tree

124

A. Prajapati et al.

Fig. 5 Random forest

96.7%. Moreover, it contributes to increased precision, decreased overfitting, and the capacity for both category and numerical data handling (Fig. 5). Naive Bayes classifier is an algorithm based on probability in machine learning used for classification tasks. It uses Bayes’ theorem, which describes the probability of an event based on prior outcomes or evidence. It is a fast and simple algorithm helpful to handle a large dataset with high-dimensional features. Naïve Bayes gives an accuracy of 60.5%. p(c|x) =

p(c|x) p(x) p(x)

p(c|x) = p(x1 |c) × p(x2 |c) × · · · × p(xn |c) × p(c)

(1)

In machine learning, issues are binary classified using logistic regression. The given model is a statistical model and a supervised learning algorithm that utilizes one or more input factors to predict the probability of a binary outcome. The given equation illustrates how this algorithm’s hypothesis goes toward the cost function’s upper limit between 0 and 1. 0 ≤ hθ (x) ≤ 1

(2)

A supervised learning technique called support vector machine (SVM) is utilized for outlier identification, classification, and regression. We use it for high accuracy, the ability to handle high-dimensional data, and robustness to outliers. Its produces result with an accuracy of 96.4%.

Detecto: The Phishing Website Detection

125

Fig. 6 Gradient boosting classifier

A machine learning approach called gradient boosting classifier is employed for classification and regression issues. It is an ensemble learning technique that turns several weak models into potent models. About 97.4% accuracy makes it the model with the best performance. It aids in determining whether a website is real or phishing and achieves higher accuracy (Fig. 6). A nonparametric machine learning method called K-nearest neighbors (KNN) is used to solve classification and regression problems and used to generate predictions based on how closely fresh input data points resemble those in the training set. It produces an accuracy of 95.6% (Fig. 7).

3 Result To ensure the highest level of accuracy, this model has been evaluated and trained using a variety of machine learning classifiers and various ensemble techniques. After all, algorithms have returned their results; each algorithm will state its estimated accuracy. Every algorithm is contrasted with others to analyze which offers Table 1 accuracy percentage with the highest accuracy rate. As an earlier study suggested in the paper [15], they used an ensemble technique and achieve 87% accuracy as shown in Fig. 8.

126

A. Prajapati et al.

Fig. 7 K-nearest neighbors Table 1 Comparison table S.N

ML model

Accuracy

f1_score

Recall

Precision

1

Gradient boosting classifier

0.974

0.977

0.994

0.986

2

Random forest

0.967

0.971

0.991

0.991

3

Support vector machine

0.964

0.968

0.980

0.965

4

Decision tree

0.959

0.963

0.991

0.993

5

K-nearest neighbors

0.956

0.961

0.991

0.993

6

Logistic regression

0.934

0.941

0.943

0.927

7

Naïve Bayes classifier

0.605

0.454

0.292

0.997

Fig. 8 Accuracy of all models bar graph

Detecto: The Phishing Website Detection

127

Fig. 9 Accuracy comparison

Our model has performed better with the highest accuracy. Gradient boosting algorithm has given better results with a final accuracy of 97.4%. For easier understanding, an accuracy comparison graph will show the accuracy of each algorithm. Figure 9 displays the final algorithm accuracy comparison of our model.

4 Limitation Since phishing attacks and cyber risks are continually changing, this study may not have considered recent developments in detection techniques or new trends. Additionally, depending on the particular context and features of the phishing attempts, the efficiency and usefulness of the studied detection approaches may change. Additionally, the quality and availability of datasets, which might not completely reflect the variety of phishing cases, substantially influence the evaluation of detection algorithms. Finally, while the goal of this research is to suggest areas for future study, it does not offer complete solutions to all the problems relating to phishing website detection. To overcome the limitations found and provide more reliable and effective strategies to counter the increasing danger posed by phishing assaults, additional study is required.

128

A. Prajapati et al.

5 Conclusion Phishing attacks are becoming more sophisticated, making it challenging to identify phishing websites. The detection of phishing websites is essential for protecting sensitive information from being stolen by cybercriminals. Various techniques and methodologies can be used for phishing website detection, including machine learning algorithms, blacklisting, and heuristic analysis. However, these techniques have their limitations, and new techniques need to be developed to detect advanced phishing attacks. In order to prevent sensitive information from being taken, it is crucial to take the required precautions. Phishing attacks can result in severe financial losses and identity theft.

References 1. Alnemari S, Alshammari M (2023) Detecting phishing domains using machine learning. Appl Sci 13(8):4649 2. Mausam G, Siddhant K, Soham S, Naveen V (2022) Detection of phishing websites using machine learning algorithms. Int J Sci Res Eng Dev 5:548–553 3. Pujara P, Chaudhari MB (2018) Phishing website detection using machine learning: a review. Int J Sci Res Comput Sci Eng Inf Tech 3(7):395–399 4. Somesha M, Pais AR, Srinivasa Rao R, Singh Rathour V (2020) Efficient deep learning techniques for the detection of phishing websites. S¯adhan¯a 45:1–18 5. Yang R, Zheng K, Wu B, Wu C, Wang X (2021) Phishing website detection based on deep convolutional neural network and random forest ensemble learning. Sensors 21(24):8281 6. Taha A (2021) Intelligent ensemble learning approach for phishing website detection based on weighted soft voting. Mathematics 9(21):2799 7. Mehanovi´c D, Kevri´c J (2020) Phishing website detection using machine learning classifiers optimized by feature selection. Traitement du Sig 37:4 8. Sönmez Y, Tuncer T, Gökal H, Avci E (2018) Phishing web sites features classification based on extreme learning machine. In: 6th International symposium on digital forensic and security ISDFS 2018—Proceeding, vol 2018–Janua, pp 1–5 9. Zuhair H, Selamat A, Salleh M (2016) Feature selection for phishing detection: a review of research. Int J Intell Syst Technol Appl 15(2):147–162 10. Aydin M, Baykal N (2015) Feature extraction and classification phishing websites based on URL. In: 2015 IEEE conference on communications and network security, CNS 2015, pp 769–770 (2015)_ 11. Jeeva, S. Carolin, and Elijah Blessing Rajsingh. “Intelligent phishing url detection using association rule mining.“ Human-centric Computing and Information Sciences 6, no. 1 (2016): 1–19. 12. X. Zhang, Y. Zeng, X. Jin, Z. Yan, and G. Geng, “Boosting the Phishing Detection Performance by Semantic Analysis,” 2017 13. Gautam, Sudhanshu, Kritika Rani, and Bansidhar Joshi. “Detecting phishing websites using rule-based classification algorithm: a comparison.“ In Information and Communication Technology for Sustainable Development: Proceedings of ICT4SD 2016, Volume 1, pp. 21–33. Springer Singapore, 2018.

Detecto: The Phishing Website Detection

129

14. Sonowal G (2020) Phishing email detection based on binary search feature selection. SN Computer Science 1(4):191 15. Barraclough PA, Hossain MA, Tahir MA, Sexton G, Aslam N (2013) Intelligent Phishing Detection and Protection Scheme for Online Transactions. Expert Syst Appl 40(11):4697–4706

Synergizing Voice Cloning and ChatGPT for Multimodal Conversational Interfaces Shruti Bibra, Srijan Singh, and R. P. Mahapatra

Abstract Conversational AI systems have gained a lot of attention in recent years because they are capable of interacting with users in a natural and an emotional way. Designing a personalized and human-like chat experience remains a difficulty. We introduce a paper that delves into the possibilities of bringing together two technologies, voice cloning and ChatGPT, to create more seamless, natural, and intriguing multimodal conversational interactions. Voice cloning is the process of replicating the voice of a user. ChatGPT, on the other hand, provides contextual and human-like text-based responses. With the help of these two technologies, we are creating an intuitive and natural conversational experience that better reflects the user’s communication style. From the analysis of our proposed system, we find that this model highly improves the conversational AI. This research provides valuable insights into the potential of multimodal dialogue networks and opens the door for further innovations in the field. Keywords ChatGPT · Voice cloning · Multimodal dialogue networks

1 Introduction Advancements in natural language processing have made conversational interfaces increasingly popular and widely used in various applications. However, the current state-of-the-art conversational systems still struggle to generate coherent and engaging responses consistently. One promising solution to this problem is the integration of voice cloning technology with language models such as ChatGPT, which can potentially enhance the quality and naturalness of the conversational output. Our proposed system works in two parts—voice cloning and ChatGPT. The voice cloning system is a neural network-based system for text-to-speech (TTS) synthesis. We have created our own ChatGPT model called Quicksilver using Python and OpenAI’s API. S. Bibra (B) · S. Singh · R. P. Mahapatra SRM Institute of Science and Technology, Ghaziabad, India e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 A. Swaroop et al. (eds.), Proceedings of Data Analytics and Management, Lecture Notes in Networks and Systems 788, https://doi.org/10.1007/978-981-99-6553-3_10

131

132

S. Bibra et al.

These two technologies combine to create a more natural and advanced AI chatbot. The voice technology models face a number of issues and we strive to minimize them. These problems begin from Ethical concerns which mean unauthorized impersonation of a person’s voice which can be misused to malicious purposes. We eliminate this issue by making our targeted user consensually say a sentence with no meaning directly through the system’s microphone since a normal human being would not do it in normal circumstances and then clone the voice. This will ensure that no voice is being replicated from a person’s previous audio or any other source. Voice cloning has been flourishing since a long time now, however, no integration with other technologies has been. Through this model, we aim to synergize the two technologies. Finally, most of the AI voice assistants consist of only limited contextual data. Through our system, we will be able to give “a brain to a cloned voice”. The combination of these two technologies can potentially result in a more humanlike and engaging conversational interface. In this research paper, we investigate the synergies between voice cloning and ChatGPT for the development of multimodal conversational interfaces that can produce more natural and engaging responses. Our study aims to shed light on the potential of this integration and identify the challenges and opportunities that arise from the use of these technologies in combination. The insights from our research can inform the development of more sophisticated conversational systems that can provide a more natural and personalized experience to the users.

2 Related Works The hybrid approach toward voice cloning and ChatGPT is not a very popular technique. However, advancements have been made in both of the fields individually. There have been methods to build a deep learning system which includes three stages that performs real-time voice cloning for a long time [1] interfaces. Neural network-based speech synthesis has been shown to produce high-quality speech for large numbers of speakers. [2] introduces a neural voice cloning system that takes fewer audio samples as input. Speaker encoding and speaker adaption are the two strategies that we take into consideration. Both methods work well, even when there are not enough cloned audios, in terms of the speech’s authenticity and similarity to the actual speaker. It is expensive and time-consuming to try to create a speaker voice that is different from the one you learnt because you will need to buy additional data and retrain the model. This is the fundamental justification for single-speaker TTS versions. By attempting to develop a system that can simulate a multi-speaker acoustic area, [3] seeks to get around these restrictions. As a result, even if they were not heard during the training phase, you can produce noises that sound like various target speakers. A variety of chatbots are now under the management of ChatGPT a web-based chat platform that allows for in-person, sensitive talks. [4] provides a highly immersive and engaging user experience by seamlessly combining

Synergizing Voice Cloning and ChatGPT for Multimodal …

133

Fig. 1 Flowchart of the proposed system

cutting-edge computer vision, speech processing, and natural language processing technology (Fig. 1).

3 The Proposed System Our proposed system is a conversational AI that can be used as a chatbot as well as a voice assistant. It is not just an integration of ChatGPT and voice cloning model but a carefully designed system that has the ability to communicate with the user more naturally and in an intuitive way. The core concept is using the voice of the targeted user to train it in our voice cloning model. This model provides a cloned voice of the user. On the other hand, the question asked to the ChatGPT provides a response. This response which is in the form of text is converted to speech integrated with the cloned voice created before. The architecture of the model is simple and easy to understand.

3.1 Voice-Enabled ChatGPT The user gives the audio input to the voice interface. The voice interface converts this audio into text using speech-to-text API (STT); here, we have used IBM Watson API. This text input created goes to the OpenAI API where it first undergoes tokenization. The input text is tokenized into individual words and punctuation marks. Each token is encoded into a numerical vector representation that captures its meaning and context. The GPT model, which comprises numerous layers of feedforward and selfattention networks, processes the encoded input. The GPT model predicts the most likely response based on the input and its own training data. The predicted response is decoded from the numerical vector representation back into natural language text. Hence, the response is obtained in the form of a text. The text input that has been

134

S. Bibra et al.

Fig. 2 Flowchart of voice-enabled ChatGPT

received is saved in a text file (.txt format) so that it is ready to be used by our voice cloning model (Fig. 2). The process of system-wide implementation entails steps like installing prerequisites, setting up the environment so that the project can run smoothly, installing libraries, and obtaining API keys. As previously stated, Python 3 and Google Colab were used for all required codings and implementations. The installations required are: 1. 2. 3. 4.

Gradio IBM Watson STT OpenAI Whisper

The Python pip install command was used to complete the installation described above.

3.2 Voice Cloning Model The Transfer Learning from Speaker Verification to Multispeaker Text-To-Speech Synthesis study uses our voice cloning technique as a functional prototype. This model is based upon a three-layered LSTM for synthesis of text to speech. It has progressive three-stage pipeline that is capable of cloning an unknown voice from a few seconds of sample speech. These three stages are: 1. Speaker Encoder: creates an embedding from a single speaker’s brief utterance. The speaker’s voice is meaningfully represented by the embedding, making similar voices close to one another in latent space. 2. Synthesizer: creates a spectrogram from text based on the embedding of a speaker. This model is a WaveNet-free version of the well-known Tacotron 2. 3. Vocoder: Spectrograms produced by the synthesizer are used to infer an audio waveform.

Synergizing Voice Cloning and ChatGPT for Multimodal …

135

Fig. 3 Flowchart of voice cloning model

The voice interface picks up the user’s voice whose voice has to be copied. The speaker encoder picks up this voice. A brief reference utterance from the speaker to be copied is given to the speaker encoder at the moment of inference. The synthesizer receives as input a sentence that has been transformed into a phoneme sequence and produces an embedding that is used to condition the synthesizer. The speech waveform that makes up the cloned voice is created by the vocoder using the synthesizer’s output. The voice cloning model reads the saved text file it obtained from the ChatGPT model before delivering the desired response in the voice of the targeted user (Fig. 3). The system-wide implementation entails steps like installing prerequisites, setting up the project’s environment, retrieving datasets, encoding, and implementing encoder modules, synthesizer modules, and vocoder modules, as well as implementing them. As previously stated, Python 3 was used for all required codings and implementations. The mandatory installations that are required for the working of the project included: 1. 2. 3. 4. 5. 6.

TensorFlow Numpy Encoder Vocoder Pathlib Synthesizer.inference

The Python pip install command was used to complete the installation described above.

4 Methodology The methodology for research paper on synergizing voice cloning and ChatGPT for multimodal conversational interfaces involves a combination of data collection, training, integration, and evaluation, with a focus on identifying and addressing the gaps in the existing literature. The ultimate goal of this research is to create more effective and natural multimodal conversational interfaces that can better meet the needs of users.

136

S. Bibra et al.

4.1 Voice-Enabled ChatGPT The methodology for creating a voice-enabled ChatGPT using OpenAI involves leveraging OpenAI’s pre-trained models for speech recognition and speech synthesis, as well as their GPT-3 model as a starting point for the ChatGPT model architecture. This approach can significantly reduce the time and resources required to develop a high-quality voice-enabled conversational AI system. 1. Voice input: Digital audio format.wav format is taken as an input. 2. Speech-to-Text (STT) API: STT API applies noise reduction and filtering to improve audio quality and remove background noise. Then splits the audio input into smaller chunks to improve processing efficiency and accuracy. It uses a speech recognition algorithm (Hidden Markov Models) to the audio chunks to transcribe the spoken words into text. Further, it combines the transcribed text from each audio chunk into a complete transcript. After that, applies postprocessing techniques such as punctuation and capitalization normalization and spelling correction to improve the accuracy and readability of the final transcript. Finally, it returns the transcript as the output of the API. 3. OpenAI API: It makes use of the third version of the Generative Pretrained Transformer, a neural network machine learning model, and the GPT-3 language model. A large corpus of text data was used to pre-train ChatGPT using an unsupervised learning method. The model learns to anticipate missing words in a given text during pre-training. This aids in understanding the context and connections between various words. ChatGPT is fine-tuned on particular activities, including text production or question answering, following pre-training. During fine-tuning, the model is trained on a smaller dataset that is tailored to the task at hand, enabling it to pick up on nuances and patterns unique to that work. ChatGPT uses a range of NLP techniques, such as tokenization, word embeddings, and language modeling, to process and understand natural language input. Beam search is a decoding algorithm used by ChatGPT to generate responses to user input. It involves generating multiple possible responses and selecting the one with the highest probability, based on the model’s predictions. Finally, ChatGPT is able to generate responses that are contextually relevant to the user’s input, thanks to its ability to understand the relationships between different words and phrases (Fig. 4).

5 Voice Cloning 1. Speaker encoder: The speaker encoder is the first module that needs training. The preprocessing, audio training, and visualization models are all included because it manages the auditory input that is given to the system. The speaker encoder is an LSTM3-layer with 256-unit projection layer and 768 hidden nodes. We assume that a projection layer is just a densely networked layer with 256 outputs per

Synergizing Voice Cloning and ChatGPT for Multimodal …

137

Fig. 4 Voice cloning block diagram

LSTM that is iteratively applied to each LSTM output because there is no mention of what a projection layer is in any of the publications. For quick prototyping, simplicity, and a reduced training burden, it is possible to employ 256 LSTM layers directly as opposed to building the speaker encoder for the first time. Our result in this case is a 40-channel log-mel spectrogram with a stage of 10 ms and a window width of 25 ms. The output (a 256-element vector) is the L2normalized hidden state of the last layer. Our method additionally includes a pre-standardization ReLU layer that aims to make embedding sparse and more understandable. 2. Synthesizer: The Google Tacotron 2 model synthesizer is utilized without WaveNet. An iterative intersequence system called Tacotron forecasts text-based mel spectrograms. A vector of specific characters is initially placed into the text string standard layers which are added thereafter to lengthen a single encoder block. To produce the output frames for the encoder, these frames pass through bidirectional LSTMs. Each frame produced by the Tacotron encoder has a speaker embedding associated with it. The attention function analyzes the encoder output frame to produce a decoder input frame. Our solution does not validate the input text’s pronunciation, and the characters are given exactly as they are. However, there are some cleaning procedures. All letters are moved to ASCII, all spaces are normalized, and all letters are reduced. Full-text format is used in place of abbreviations and numbers. Punctuation is permitted, although it is not recorded in the record. 3. Vocoder: The vocoder modules are trained last since the encoder synthesizer vocoder is supposed to train the modules in the order. Tacotron 2 has a vocoder called WaveNet. The vocoder model used is based on WaveRNN and is an opensource PyTorch implementation 15, although it has a number of various user

138

S. Bibra et al.

fatchord design choices. “Alternative WaveRNN” is the name of this architecture. In each training phase, the mel spectrogram and its related waveform are separated into the same number of segments. Segments t and t-1 of the simulated spectrogram serve as design inputs. It ought to be created so that each segment of the waveform is the same length. The number of mel channels is kept constant as the mel spectrogram is upsampled to fit the target waveform’s length. As the mel spectrogram is converted to a waveform, models like ResNet use the spectrogram as an input to generate features that alter the layers. To change the length of the waveform segment, the resulting vector is repeated. Then, this adjustment vector is divided into four equal parts, each of which corresponds to a channel dimension. The first portion of this division is concatenated with the upsampling spectrogram and waveform segment of the preceding time step. The resulting vector changes in certain ways when there is a skip connection. A high-density layer comes after two GRU layers.

6 Result and Discussion With the error-free working and execution of the project, the system was able to successfully provide the response of the ChatGPT in the voice of the targeted user by cloning his/her voice. We have evaluated are results on the basis of word error rate, naturalness and the speed of response. Word Error Rate (WER) is a metric used to evaluate the accuracy of speech recognition systems, machine translation systems, and other natural language processing (NLP) models. It measures the difference between the words in the predicted output and the words in the reference (i.e., the ground truth). human speech. Naturalness is a measure of the degree to which synthesized speech sounds like it are produced by a human speaker, both in terms of sound quality and prosody (i.e., the rhythm, intonation, and stress patterns of speech). Since our model uses OpenAI API, it would have the same results as that of ChatGPT with a slight drop in speed. We have compared our paper with AI voice assistants (Table 1). The results of WER of Google assistant and Siri have been obtained from [5, 6], respectively, whereas we see the evaluation of naturalness for Google and Siri in [7, 8], respectively. A detailed analysis has been provided by IBM Watson [9] for WER of our model and [10] for naturalness. Table 1 Final MOS results

Source

Word error rate (%)

Naturalness

Google voice assistant

4.9

3.5

Siri voice assistant

6.4

4.17

This paper

6.5

3.7

Synergizing Voice Cloning and ChatGPT for Multimodal …

139

7 Conclusion This paper has successfully developed a framework for Integration of Voice Cloning with ChatGPT. Despite a few complications, the results are competent. The model’s ability to synthesize voices is very good. The ChatGPT provides excellent results, but the speech of response can always be improved. Beyond the scope of the project, there are still ways to improve certain frameworks and implement some of the recent advances in this area made at the time of writing. While it has been agreed that our proposed system is an improved design of an AI voice assistant. We can also confirm the prediction that future development of the same technical area would lead to the development of better and more sophisticated models. Therefore, this approach proved to be an attempt to understand, implement, and innovate the expertise gained during the research. We anticipate that this framework will soon be available in more potent forms.

8 Future Scope The current paper has laid the foundation for a number of potential areas of future development and improvement. It can be expanded to include more languages and more accents in the voice cloning model. Further, ChatGPT can also be inclusive of more languages. The project can be integrated with edge computing and somehow, be made for offline use as well. This research should also become an indispensable part of mobile interfaces, operating systems, etc. It can be further optimized by training the voice cloning model to be more natural and seamless. The ChatGPT should become inclusive of images, videos and is giving out visual representations as well. Customization of this proposed system can also be done according to the specific needs and requirements of the various businesses and individuals. This can be done by making the model train on a particular dataset as per the requirements. Overall, the future scope of the project is promising and offers ample opportunities for growth, innovation, and impact.

References 1. Jia Y, Zhang Y, Weiss RJ, Wang Q, Shen J, Ren F, Chen Z, Nguyen P, Pang R, Moreno IL, Wu Y (2018) Transfer learning from speaker verification to multispeaker text-to-speech synthesis. In: 32nd conference on neural information processing systems (NeurIPS2018), Montréal, Canada 2. Arik SO, Chen J, Peng K, Ping W, Zhou Y (2018) Neural voice cloning with a few samples. In: 32nd conference on neural information processing systems (NIPS2018), Montréal, Canada 3. Ruggiero G, Zovato E, Di Caro L, Pollet V (2021) Voice cloning: a multi-speaker text-to-speech synthesis approach based on transfer learning. arXiv preprint arXiv:2102.05630 4. Alnuhait D, Wu Q, Yu Z (2023) FaceChat: an emotion-aware face-to-face dialogue framework. arXiv preprint arXiv:2303.07316

140

S. Bibra et al.

5. Besacier et al (2019) Speech command recognition on a low-cost device: a comparative study. IEEE Access J 6. Liu et al, A comparative study on mandarin speech recognition: Alexa, google assistant, and Siri. In: 19th annual conference of the international speech communication association 7. Besacier L, Castelli E, Gauthier J, Karpov A, Naturalness and intelligibility of six major voice assistants: Siri, Google assistant, Cortana, Bixby, Alexa, and Mycroft. In: Proceedings of the 19th annual conference of the international speech communication association (Interspeech), pp 1303–1307 8. Apple, Deep learning for siri’s voice: on-device deep mixture density networks for hybrid unit selection synthesis. Apple Mach Learn Res 9. IBM Watson, Speech to text API documentation. Accessed 8 Mar 2023 https://arxiv.org/abs/ 2303.07316 10. Radford A, Wu J, Child R, Luan D, Amodei D, Sutskever I, Language models are unsupervised multitask learners. OpenAI Research

A Combined PCA-CNN Method for Enhanced Machinery Fault Diagnosis Through Fused Spectrogram Analysis Harshit Rajput, Hrishabh Palsra, Abhishek Jangid, and Sachin Taran

Abstract This research introduces a novel strategy to improve the accuracy and resilience of machinery malfunction identification by combining multimodal spectrogram fusion and deep learning techniques. The proposed approach involves the division of the dataset into two equal portions, followed by the application of the continuous wavelet transform (CWT) and Short-Time Fourier Transform (STFT) separately, resulting in two sets of 2D images. A Principal Component Analysis (PCA) fusion approach is then employed to merge these images, extracting the most relevant and complementary characteristics. Subsequently, a convolutional neural network (CNN) model is applied to the fused spectrogram, enabling classification and learning of intricate and abstract features. The suggested approach offers several advantages, including enhanced feature extraction, improved accuracy, faster processing, robustness to noise and artifacts, and transferability. To illustrate its efficiency, the Case Western Reserve University (CWRU) dataset, comprising vibration signals from various fault states in rotating machinery, is utilized. Experimental results demonstrate that the proposed method surpasses existing approaches in machinery failure diagnostics, achieving a high classification accuracy. Keywords Spectrogram · CWRU dataset · Dataset splitting · Continuous wavelet transform (CWT) · Short-time fourier transform (STFT) · Principal component analysis (PCA) · Convolutional neural network (CNN)

H. Rajput (B) · H. Palsra · A. Jangid · S. Taran Department of Electronics and Communication Engineering, Delhi Technological University, Delhi 110042, India e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 A. Swaroop et al. (eds.), Proceedings of Data Analytics and Management, Lecture Notes in Networks and Systems 788, https://doi.org/10.1007/978-981-99-6553-3_11

141

142

H. Rajput et al.

1 Introduction Asset maintenance is crucial to preserve its practicality and prevent defects. Lack of effective maintenance can reduce manufacturing capacity. There are two traditional approaches to maintenance—corrective and preventive. Corrective maintenance is impulsive and leads to maximum exploitation of machinery, while preventive maintenance is systematic but can be economically costly. Predictive maintenance is an advancement over previous strategies and is required for smart machines. Rolling elements are critical components, and early failure detection is essential for minimizing downtime, waste rates, and maintaining product quality. In the era of industries where we are pacing toward the Industry 4.0 which is the automation of machines by providing them neural capabilities of humans, i.e., building smart machines, by employing technique of machine learning or deep learning. In the approach suggested by Zhang et al. one-dimensional vibration signals are transformed into time–frequency pictures using STFT, which are then fed into STFTCNN for learning and identifying fault feature [1]. A novel method for fault identification under varied working conditions based on STFT and transfer deep residual network (TDRN) is described in the research put forth by Du et al. By combining with transfer learning, the TDRN can create a link between two dissimilar working environments, leading to excellent classification accuracy [2]. In order to address the issue of uneven data distribution in the field of diagnosing rolling bearing faults, Han et al. offer a data augmentation method integrating continuous wavelet transform and deep convolution-generated adversarial network (DCGAN). The technique assesses image quality and variety while expanding time–frequency image samples of fault categories using DCGAN. According to experimental findings, the proposed method is more accurate than the conventional one-dimensional data expansion method [3]. Wang et al. suggested a hybrid approach employing variational modal decomposition (VMD), CWT, CNN, and support vector machine (SVM) for diagnosing rolling bearing faults. After preprocessing using VMD, CWT is used to create twodimensional time–frequency pictures. SVM is used for defect identification, and CNN is utilized to extract features. With great accuracy, the approach is validated using spindle device failure tests and datasets from CWRU. For better visualization and feature extraction, time–frequency pictures can be acquired using CWT [4].

A Combined PCA-CNN Method for Enhanced Machinery Fault …

143

2 Proposed Methodology 2.1 Dataset The rolling-element bearing vibration signals from various operating circumstances are collected in the Case Western Reserve University (CWRU) dataset. Rotating machinery frequently makes use of rolling-element bearings, and their failure may result in costly downtime and repairs. Therefore, early detection of bearing faults is crucial to prevent catastrophic failures and reduce maintenance costs. The dataset includes vibration signals obtained from four types of bearing faults: inner race, outer race, roller element, and combination faults. Each fault type is simulated by introducing a defect at a specific location on the surface of bearing. The vibratory signals are recorded using accelerometers placed on the motor casing. The obtained signals are then sampled at a rate of 48 kHz, which means that the signals are recorded 48,000 times per second [5]. To improve the accuracy and reliability of the machine learning models, data collection under actual working settings is crucial. In the CWRU dataset, the load applied to the bearings during the experiments is 1 horsepower (1HP), which corresponds to a load of approximately 750 Watts. This load level is representative of typical operating conditions in industrial applications (Fig. 1).

Fig. 1 a Ball bearing system experimental platform for the CWRU bearing test rig [6, 7], the REB’s component parts, and b its cross-sectional view

144

H. Rajput et al.

2.2 Data Preprocessing Two most popular data preprocessing methods are Continuous Wavelet Transform (CWT) and Short-Time Fourier Transform (STFT). CWT: CWT is a time–frequency analysis technique that allows the decomposition of a signal into different frequency components over time. A two-dimensional image generated through convolution is commonly known as a scalogram or wavelet spectrogram. In this representation, the x-axis represents time, the y-axis represents frequency, and the color or intensity corresponds to the amplitude of the wavelet coefficients. The fundamental benefit of CWT is that it can analyze signals with nonstationary time–frequency content. However, the main drawback of CWT is that it can be computationally expensive and can suffer from edge effects and cross-term interference [8]. STFT: STFT, on the other hand, is a method for breaking down a signal into its frequency components, over time using a series of overlapping windows. The STFT method is useful for analyzing non-stationary signals, such as those generated by rotating machinery, where the frequency content may change over time due to the presence of faults or other operating conditions [9].

2.3 Fusion PCA is a statistical method that is utilized to identify the most significant components of the input data and to reduce the dimensionality of the data. In the context of spectrogram fusion, PCA can be used to identify the common features across the two spectrograms and to generate a new spectrogram that captures these features. The fundamental principle of PCA is to transform the input data into a new coordinate system so that it can be represented with fewer dimensions while still preserving the most crucial information. The transformation is carried out by calculating the eigenvalues and eigenvectors of the input data’s covariance matrix. The data’s principal components are represented by the eigenvectors, and each principal component’s variance is shown by the eigenvalues [10]. To use PCA for spectrogram fusion, we follow these steps [11]: • Preprocess the input data: The input data should be preprocessed to remove any noise, artifacts, or irrelevant features. This can be done by applying suitable filters, normalization, or other preprocessing techniques. • Compute the CWT and STFT spectrograms: The CWT and STFT spectrograms are computed separately for the preprocessed input data. • Shuffling the time–frequency signals: The time–frequency signals obtained from STFT and CWT are stored and shuffled according to their respective labels. • Centering the matrix: The first step in centering the matrix is to take a time– frequency signal X with dimensions n x m, where n is the number of frequency components and m is the number of time points.

A Combined PCA-CNN Method for Enhanced Machinery Fault …

X centered

⎞n ⎛ m 1 ⎝ =X− Xi j ⎠ m j=1

145

· 1m ,

i=1

where 1m is a column vector of ones with length m. • Compute the covariance matrix: The covariance matrix of the concatenated spectrogram matrix is computed. The covariance matrix represents the correlation between the different elements of the spectrogram matrix. The covariance matrix C of X centered is computed as follows: C=

1 T X centered X centered . m

• Determine the eigenvalues and eigenvectors: The covariance matrix’s eigenvalues and eigenvectors are calculated. The spectrogram matrix’s primary components are represented by the eigenvectors, while each principal component’s variance is represented by the eigenvalues. The values of the covariance matrix C can be calculated using a matrix decomposition: C = V DV T , where V is a matrix of eigenvectors and D is a diagonal matrix of eigenvalues. • Ordering the principal components: The principal components can be ordered by their corresponding eigenvalues in descending order: [PC1 , PC1 , . . . , PCk ] = sor t (diag(D), descend )[, I ] = sor t diag(D), descend V = V (:, I ) • Generate the fused spectrogram: The fused spectrogram is generated by projecting the CWT and STFT spectrograms onto the selected eigenvectors. Y = V T X centered . The resulting matrix Y represents the time–frequency signal projected onto the selected principal components, with the most important patterns or features highlighted. PCA-based fusion algorithm has the advantage of being simple, fast, and efficient. It can be used to identify the common features across the CWT and STFT spectrograms and to generate a new spectrogram that captures these features [12].

146

H. Rajput et al.

Fig. 2 Data flow in model suggested

2.4 CNN After applying the PCA fusion algorithm to the CWT and STFT spectrograms to obtain a fused spectrogram, a CNN model can be used for further feature extraction and classification. A CNN is a particular kind of deep neural network that excels at image and pattern recognition tasks. It is made up of many layers of neurons, the processing units, which are arranged into convolutional, pooling, and fully linked layers (Fig. 2). The fundamental principle of a CNN is to learn a hierarchy of features that abstractly describe the input data at various levels. Edge detection and corner detection are examples of local features that are extracted by the first few layers of the CNN, whereas shape identification and texture recognition are examples of more global features that are extracted by the subsequent levels. To apply a CNN to the fused spectrogram obtained from the PCA fusion algorithm, we can follow these steps [13].

A Combined PCA-CNN Method for Enhanced Machinery Fault …

147

3 Results and Discussion From Table 1, the proposed method has the average accuracy of 99.63%, while the CWT method and STFT method had 98.94% and 99.49%, respectively. The confusion matrix is a tool for evaluating the performance of a classification model. It provides detailed information on the number of correct and incorrect predictions for each class. Figure 3 represents the confusion matrix obtained by CWT method having maximum accuracy 99.30%. Fig. 4 represents the confusion matrix obtained by STFT method having maximum accuracy 99.60%. Figure 5 represents the confusion matrix obtained by proposed method having maximum accuracy 99.70%. The overall advantages of using the methods of splitting the CWRU dataset, applying the PCA fusion algorithm, and passing them through a CNN model include: – Improved feature extraction: By splitting the dataset and applying different preprocessing techniques, we can extract more diverse and complementary features from the input data. The PCA fusion algorithm combines the features extracted by CWT and STFT, which can improve the overall accuracy and robustness of the classification. – Increased accuracy: By using a CNN model, we can learn more complex and abstract features from the fused spectrogram, which can improve the accuracy of the classification. CNNs are particularly effective for image and pattern recognition tasks and have been shown to achieve state-of-the-art results in many applications. Table 1 Table employing accuracies for different methodologies performed Methods employed

Accuracies Average accuracy (%)

Best accuracy (%)

Worst accuracy (%)

CWT

98.94

99.30

98.50

STFT

99.49

99.60

99.38

Proposed

99.63

99.70

99.56

Fig. 3 Best accuracy confusion matrix for CWT

148

H. Rajput et al.

Fig. 4 Best accuracy confusion matrix for STFT

Fig. 5 Best accuracy confusion matrix for proposed method

– Faster processing: The use of PCA fusion algorithm can reduce the dimensionality of the input data and provide a more compact representation of the features, which can speed up the processing time of the CNN model. – Robustness to noise and artifacts: The use of PCA fusion algorithm and CNN model can improve the robustness of the classification to noise and artifacts in the input data. The PCA fusion algorithm can reduce the effect of noise and artifacts by combining information from multiple sources, while the CNN model can learn to distinguish relevant features from irrelevant ones.

4 Conclusion This paper proposes a new method having average accuracy 99.63% for fault diagnosis using a combination of CWT and STFT data preprocessing techniques. The method involves splitting the dataset into two halves and passing each half to either CWT or STFT, respectively. The resulting 2D spectrogram images are fused using

A Combined PCA-CNN Method for Enhanced Machinery Fault …

149

PCA and passed to a CNN for classification. The proposed method is compared to existing methods that use either CWT or STFT alone, and experimental results on the CWRU bearing dataset show that the proposed method outperforms the existing methods in terms of diagnostic accuracy. The fusion of CWT and STFT provides a more comprehensive representation of the data and improves the fault diagnosis accuracy. The proposed method was validated on 48 kHz sampling rate and 1Hp load of CWRU dataset. Other wavelet transform methods and fusion methods can also be investigated and analyzed. For example, PCA with methods like discrete wavelet transform (DWT) fusion or independent component analysis (ICA) fusion can be compared.

References 1. Zhang Q, Deng L (2023) An intelligent fault diagnosis method of rolling bearings based on short-time fourier transform and convolutional neural network. J Failure Anal Prevent 1–17 2. Du Y, Wang A, Wang S, He B, Meng G (2020) Fault diagnosis under variable working conditions based on stft and transfer deep residual network. Shock Vib 2020:1–18 3. Han T, Chao Z (2021) Fault diagnosis of rolling bearing with uneven data dis- tribution based on continuous wavelet transform and deep convolution generated adversarial network. J Braz Soc Mech Sci Eng 43(9):425 4. J. Wang, D. Wang, S. Wang, W. Li, and K. Song, “Fault diagnosis of bearings based on multisensor information fusion and 2d convolutional neural network,” IEEE Access, vol. 9, pp. 23 717–23 725, 2021. 5. Yuan L, Lian D, Kang X, Chen Y, Zhai K (2020) Rolling bearing fault diagnosis based on convolutional neural network and support vector machine. IEEE Access 8:137 395–137 406 6. Li SY, Gu KR (2019) Smart fault-detection machine for ball-bearing system with chaotic mapping strategy. Sensors 19(9):2029 7. “Case western reserve university bearing data center (2019). https://csegroups.case.edu/bearin gdatacenter/home. Accessed 22 Dec 2019 8. Sharma P, Amhia H, Sharma SD (2022) Transfer learning-based model for rolling bearing fault classification using cwt-based scalograms. In: Pandian AP, Palanisamy R, Narayanan M, Senjyu T eds Proceedings of third international conference on intelligent computing, information and control systems. Singapore, Springer Nature Singapore, pp 565–576 9. Yoo Y, Jo H, Ban S-W (2023) Lite and efficient deep learning model for bearing fault diagnosis using the cwru dataset. Sensors 23(6):3157 10. Hong M, Yang B (2013) An intelligent fault diagnosis method for rotating machin- ery based on pca and support vector machine. Measurement 46(9):3090–3098 11. Wang J, Zhao X, Xie X, Kuang J (2018) A multi-frame pca-based stereo audio coding method. Appl Sci 8(6). [Online]. Available: https://www.mdpi.com/2076-3417/8/6/967 12. Gupta V, Mittal M (2019) Qrs complex detection using stft, chaos analysis, and pca in standard and real-time ecg databases. J Instit Eng (India): Series B 100(03) 13. Yang S, Yang P, Yu H, Bai J, Feng W, Su Y, Si Y (2022) A 2dcnn-rf model for offshore wind turbine high-speed bearing-fault diagnosis under noisy environment. Energies 15(9):3340

FPGA-Based Design of Chaotic Systems with Quadratic Nonlinearities Kriti Suneja, Neeta Pandey, and Rajeshwari Pandey

Abstract This paper presents a systematized methodology to implement chaotic systems with quadratic nonlinearities on digital platform using Runge–Kutta 4 (RK4) numerical method. Field programmable gate arrays (FPGAs), because of their flexibility, reconfigurability, and parallelism, have been used for the implementation using Verilog hardware description language (HDL) and the state machine control. The synthesis results based on Xilinx Artix device 7a200tffv1156-1, and simulation results using inbuilt simulator of Vivado design suite have been presented. The simulation results have been validated by python-based numerical simulations as well. The implemented chaotic systems have been evaluated based on hardware utilization and time delay. Keywords Chaotic system · Quadratic nonlinearity · Field programmable gate array · Synthesis · Simulation

1 Introduction Since, in the decade of eighties, electronic system design witnessed a paradigm shift from analog to digital domain, so digitized chaotic systems have their own advantages. A field programmable gate array (FPGA) is an integrated circuit (IC) consisting of three blocks mainly: logic blocks, interconnects, and input–output blocks, all three being programmable. Each of the logic cell contains basic circuit elements, such as lookup tables (LUTs) on which combinational logic can be mapped and flip flops (FFs) to design sequential logic. However, the composition of these blocks differs in various FPGA families and packages. Some of the FPGA devices offer additional hardware resources for flexible designing capability, such as memory blocks and K. Suneja (B) · N. Pandey · R. Pandey Delhi Technological University, Delhi 110042, India e-mail: [email protected] R. Pandey e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 A. Swaroop et al. (eds.), Proceedings of Data Analytics and Management, Lecture Notes in Networks and Systems 788, https://doi.org/10.1007/978-981-99-6553-3_12

151

152

K. Suneja et al.

digital signal processing (DSP) blocks. Because of their reprogrammable feature, FPGAs have utility in being used as prototypes for application-specific integrated circuits (ASICs) applications. FPGA-based design of chaotic systems finds applications in the embedded engineering areas, such as image encryption [1, 2], text encryption [3], random number generation [4], secure communication [5], and cryptography [6]. The digital designing of chaotic systems can be done using different types of digital platforms, including application-specific integrated circuits (ASICs), digital signal processors (DSPs) and FPGAs. ASICs are capable of providing better performance than its counterparts, but at the cost of time and money in the production of prototypes. Also, in order to bring down the cost, ASIC-based applications stand in need of mass production which is intolerable to even minute errors. DSP chips are the favorite candidate of engineers to implement complex mathematical operations and processes, but their sequential manner of processing is not favorable for the concurrency requirements of chaotic systems. FPGAs provide the desired flexibility, concurrency, low cost, enough resources for the implementation of chaotic systems. Thus, in prototyping phase, they stand out among others for rapid and low-cost design. An extensive literature survey suggests that different combinations of hardware description languages (HDLs) and FPGA families have been used to design chaotic systems. For instance, in [7–9], Virtex FPGA family has been used to map chaotic systems while Artix in [10], Zynq in [11], Kintex in [12], and Altera Cyclone in [13]. Though Artix has lesser resources than Kintex and Virtex in 7 series, all of them have sufficient resources to implement a chaotic system. Thus, the choice of family does not affect the performance unless the resources are depleted here. In HDLs, while VHDL has been chosen in [7, 8, 10, 11], Verilog is used in [9, 12, 13]. This work presents a systematic approach to implement chaotic systems with quadratic nonlinearities on FPGA-based digital platform using Runge–Kutta 4 (RK4) numerical method, because it utilizes weighted average of the slopes at four points, providing better accuracy than the lower-order RK methods. The remaining paper is formulated as follows: Sect. 2 explains the design methodology used for the digitization of chaotic systems, Sect. 3 briefs about the design flow in FPGA and contains the synthesis and simulation results, and finally Sect. 4 concludes the paper.

2 Design Methodology 2.1 Mathematical Representation The chaotic systems are represented by their constituent differential characteristic equations. In order to map those differential equations on FPGA board, the existing numerical methods, such as Euler, improved Euler (also known as Heun), fourthorder Runge–Kutta (RK4) method [14, 15] are used to discretize the differential

FPGA-Based Design of Chaotic Systems with Quadratic Nonlinearities

153

equations. Out of the existing numerical methods, we have chosen RK4 method because of its higher degree of accuracy in providing solutions [16]. It uses four intermediate points K 1 , K 2 , K 3, and K 4 to determine the solution using the previous sample. K 1 corresponds to the beginning, K 2 and K 3 are near middle and K 4 corresponds to the end of the interval. The three chaotic differential equations, corresponding to three state variables x, y, and z, are thus discretized using RK4 method as represented by Eqs. 1–6. x(n + 1) = x(n) +

h [K x1 + 2K x2 + 2K x3 + K x4 ] 6

(1)

y(n + 1) = y(n) +

h [K y1 + 2K y2 + 2K y3 + K y4 ] 6

(2)

z(n + 1) = z(n) +

h [K z1 + 2K z2 + 2K z3 + K z4 ] 6

(3)

where K x1 = f x [x(n), y(n), z(n)]

(4a)

K x2 = f x [x(n) + h

K y1 K z1 K x1 , y(n) + h , z(n) + h ] 2 2 2

(4b)

K x3 = f x [x(n) + h

K y2 K z2 K x2 , y(n) + h , z(n) + h ] 2 2 2

(4c)

K x4 = f x [x(n) + h K x3 , y(n) + K y3 , z(n) + K z3 ]

(4d)

K y1 = f y [x(n), y(n), z(n)]

(5a)

K y2 = f y [x(n) + h

K y1 K z1 K x1 , y(n) + h , z(n) + h ] 2 2 2

(5b)

K y3 = f y [x(n) + h

K y2 K z2 K x2 , y(n) + h , z(n) + h ] 2 2 2

(5c)

K y4 = f y [x(n) + h K x3 , y(n) + h K y3 , z(n) + h K z3 ]

(5d)

K z1 = f z [x(n), y(n), z(n)]

(6a)

K z2 = f z [x(n) + h

K y1 K z1 K x1 , y(n) + h , z(n) + h ] 2 2 2

(6b)

154

K. Suneja et al.

K z3 = f z [x(n) + h

K y2 K x2 K z2 , y(n) + h , z(n) + h ] 2 2 2

K z4 = f z [x(n) + h K x3 , y(n) + h K y3 , z(n) + h K z3

(6c) (6d)

where K xi , K yi, and K zi, i=1 to 4 represent the intermediate slopes of variables x, y, and z, respectively, f x , f y, and f z represent the differential equations corresponding to a given chaotic system and h is the step size or the interval between consecutive samples. The set of equations represented by (1)–(6) are implemented as follows. The digital design of the chaotic system has two paths: the control path to control the flow of the operations and datapath implementing all the algebraic operations. The control path will consist of one initial state, also known as default state to initialize the state variables and one final state or idle state which waits for the next set of instructions. These two states are represented by S0 and S6, respectively. Besides, there will be a requirement of five other states to evaluate [1–6]. The state diagram, consisting of total seven states, representing the control path is shown in Fig. 1. Three state variables are required to represent these states: S0 (000), S1 (001), S2 (010), S3 (011), S4 (100), S5 (101), and S6 (110). The function of the seven states in control path S0 –S6 [17] are: S0 : It is the initial and default state. In this state, the initial values are assigned to the state variables x, y, and z. The process then passes to the next state S1 unconditionally. S1 : In this state, the K x1 , K y1, and K z1 increments based on the slopes at the beginning of the interval are calculated using (4a), (5a), and (6a). The process then jumps to the next state S2 unconditionally. S2 : The increments based on the slopes near midpoint of the interval, K x2 , K y2, and K z2 using K x1 , K y1, and K z1 are calculated in this state using (4b), (5b), and (6b), followed by unconditional transition to the next state S3 .

Fig.1 State transition graph of the finite state machine

FPGA-Based Design of Chaotic Systems with Quadratic Nonlinearities

155

S3 : The increments based on the slopes again near midpoint, but different than that of the previous point, K x3 , K y3, and K z3 using K x2 , K y2, and K z2 are calculated in this state from (4c), (5c), and (6c), followed by an unconditional transition to next state S4 . S4 : The increments based on the slopes at the end of the interval, K x4 , K y4, and K z4 using K x3 , K y3, and K z3 are calculated in this state from (4d), (5d), and (6d), followed by an unconditional transition to next state S5 . S5 : In this state, the next chaotic samples x, y, and z are generated using (1)–(3). In the next clock cycle, if the counter’s count C p is less than the user-defined integer N, which represents the number of required samples, the process jumps to S1 for calculating the next solution, else it jumps to S6 where it stays waiting.

3 Results Eleven chaotic systems [18–30], including some popularly known systems, such as Rossler ¨ and Lorenz have been designed using the methodology described in Sect. 2 on a common FPGA platform to compare them and choose the best fit for digital applications. To synthesize these systems, both control and datapath have been entered in Xilinx tool using Verilog HDL, the top module for is shown in Fig. 2 and the pseudocode in Table 1. It consists of three output signals x n , yn, and zn for chaotic system, each of 32 bits. Clock signal ‘clk’ and the step size ‘h’ have been taken as input signals. The value of h has been chosen as ‘2–7 ’. It is to be noted that it has been taken as a power of two because division and multiplication processes using power of two in binary logic can simply be implemented using right and left shift operations, respectively. A counter ‘C p ’ has been defined which will increment by ‘1’ every time the samples’ values are calculated till it reaches the parameter N = 50,000. The ‘N’ is variable depending upon the number of samples required in an application. Two tasks were declared: product for multiplication operation in datapath and F_ det to find the values of time derivatives of state variables for certain inputs. All intermediate slopes K is will be evaluated using case statement and product, and F_ det tasks will be used in datapath. It is to mention that while configuring the FPGA for a chaotic system, only the definition of F_det task to evaluate xdot, ydot, and zdot will change in the code in accordance with the characteristic equations, while the remaining part of the code will remain unchanged for all the chaotic systems. This makes it easier for the user to implement any new chaotic system on FPGA faster. Eleven chaotic systems [11–23] having quadratic-type nonlinearity have been designed using the above methodology. The name/reference of the chaotic system, its three-dimensional characteristic equations, parameter values, number of arithmetic operations have been tabulated in Table 2. These chaotic systems have been synthesized, as well as simulated, the results obtained for which are provided and discussed as follows.

156

K. Suneja et al.

Fig. 2 Top module of RK4 method in Verilog

Table 1 Pseudocode to implement the proposed FSM in Xilinx 1

Module RK4(x n , yn , zn , wn , h, clk) Parameter N = 50,000 Reg [31:0] C p = 0

2

Task product (input [31:0] a, b, output reg [31:0] c)

3

Task F_det (input [31:0] x, y, z, w, output reg [31:0] xdot, ydot, zdot, wdot)

4

Always @(posedge clk) Case (state) S0 : begin x n = 32’h0000_0000; yn = 32’h0002_0000; zn = 32’h0000_0000 S1 : begin if(cp < N) F_val(x n , yn , zn , wn , K x1 , K y1 , K z1 , K w1 ) State = S2 ; End

3.1 Synthesis Results The target device used for the purpose is Xilinx’s Artix 7 FPGA family’s 7a200tffv1156-1, on which 134,600 slice LUTs and 740 DSP blocks are available. Out of the available resources, the percentage utilization and the total delay including the logic and net delay for each chaotic system have been summarized in Table 2. It is evident from Table 2 that the increase in number of operations in the characteristic equations results in the increase in hardware requirements also. However, the total delay varies from system to system depending on the logical operations, as well as net delay. Out of the implemented chaotic systems, the comparative analysis favors Pehlivan system because of its lesser hardware requirements and Rossler ¨ chaotic system because of its lesser delay. Since there are ample amount of hardware resources available on the Artix device, where the implementation of these chaotic systems is lesser than 10% each, we can also implement hyperchaotic systems on the same device. Also, delay is a critical parameter when the hardware requirements are within limits. So, we recommend the use of Rossler ¨ chaotic system for FPGA-based applications from the results obtained.

FPGA-Based Design of Chaotic Systems with Quadratic Nonlinearities

157

Table 2 Chaotic systems’ characteristic equations and synthesis results S. No.

Chaotic system (references)

Characteristic equations

No. of operations (No. of product terms)

% Utilization of Slice LUTs

% Utilization of DSP blocks

Total delay (nSec)

1

Rossler ¨ [18]

x˙ = −y − z 7(2) y˙ = x + ay z˙ = b + z(x − c) a = 0.2, b = 0.2, c = 5.7

4.08

8.65

30.745

2

Lorenz [19]

x˙ = a(y − x) 9(5) y˙ = cx −x z− y z˙ = x y − bz a = 10, b = 8/ 3, c = 28

4.33

12.97

42.250

3

Pehlivan [20] x˙ = y − x y˙ = ay − x z z˙ = x y − b a = b = 0.5

6(3)

3.63

10.81

40.310

4

[21]

x˙ = a(y − x) 8(4) y˙ = x z − y z˙ = b − x y −cz a = 5, b = 16, c=1

4.10

11.89

41.637

5

[22]

x˙ = a(y − x) + yz

11(6)

4.81

16.22

41.391

x˙ = −ax − byz 11(5) y˙ = −x + cy

3.78

10.81

47.978

4.29

14.05

42.242

y˙ = cx − y−x z z˙ = x y − bz a = 35, b = 8/ 3, c = 25 6

MACM [23]

z˙ = d − y 2 − z a = 2, b = 2, c = 0.5, d = 4 7

[24, 25]

x˙ = a(y − x) y˙ = cx − x z z˙ = x y − bz a = 35, b = 3, c = 35

9(5)

(continued)

158

K. Suneja et al.

Table 2 (continued) S. No.

Chaotic system (references)

Characteristic equations

8

Li [26]

9

Rabinovich [27, 28]

No. of operations (No. of product terms)

% Utilization of Slice LUTs

% Utilization of DSP blocks

Total delay (nSec)

x˙ = a(y − x) 8(4) y˙ = x z − y z˙ = b − x y −cz a = 5, b = 16, c=1

4.10

11.89

41.637

x˙ = hy − ax + yz

14(8)

4.61

12.97

45.204

y˙ = hx − by − x z z˙ = x y − dz a = 4, b = 1, d = 1, h = 6.75 10

Chen [29]

x˙ = a(y − x) 11(6) y˙ = (c − a)x + cy − x z z˙ = x y − bz a = 35, b = 3, c = 28

4.50

15.14

39.717

11

Lü [30]

x˙ = a(y − x) y˙ = cy − x z z˙ = x y − bz a = 36, b = 3, c = 20

4.29

14.05

40.431

9(5)

3.2 Simulation Results In the flow of FPGA-based design of a system, the functional verification of the implemented system done using simulation results is a necessary step to validate the design. For the validation of the FPGA-based results of the design, all the 11 chaotic systems have also been simulated in python using RK4 numerical method, out of which the results, in the form of the time series, for two of the systems, Lorenz and Rossler, ¨ have been put with the Xilinx simulation results in Fig. 3. The simulation results from Xilinx Vivado are in line with the simulation results from python, thus promising the feasibility of these chaotic systems on FPGA.

FPGA-Based Design of Chaotic Systems with Quadratic Nonlinearities

Fig. 3 Numerical simulation and Xilinx Vivado simulation results of a Lorenz, b Rossler ¨

159

160

K. Suneja et al.

4 Conclusion In this paper, the FPGA digital circuit design of eleven chaotic systems using RK4 numerical method in Verilog hardware description language has been proposed. The advantage of the proposed methodology is its field programmability and easy implementation of a new chaotic system in comparison with that of their analog counterparts. All considered chaotic systems have been synthesized and compared in terms of percentage utilization of hardware resources on target FPGA device Artix 7 and the total time delay. While Pehlivan chaotic system outperforms in terms of hardware utilization, Rossler ¨ chaotic system is the best fit for lesser delay requirements. The simulation results have also been validated by python-based numerical simulations. Based on this design methodology, these chaotic systems can be further used for digital applications.

References 1. Paliwal A, Mohindroo B, Suneja K (2020) Hardware design of image encryption and decryption using CORDIC based chaotic generator. In: 2020 5th IEEE international conference on recent advances and innovations in engineering (ICRAIE), Jaipur, India, pp 1–5. https://doi.org/10. 1109/ICRAIE51050.2020.9358354 2. Tang Z, Yu S (2012) Design and realization of digital image encryption and decryption based on multi-wing butterfly chaotic attractors. In: 2012 5th international congress on image and signal processing, Chongqing, China, pp 1143–1147. https://doi.org/10.1109/CISP.2012.646 9744 3. Negi A, Saxena D, Suneja K (2020) High level synthesis of chaos based text encryption using modified hill cipher algorithm. In: 2020 IEEE 17th India Council International Conference (INDICON), New Delhi, India, pp 1–5. https://doi.org/10.1109/INDICON49873.2020. 9342591 4. Gomar S, Ahmadi M (2019) A digital pseudo random number generator based on a chaotic dynamic system. In: 2019 26th IEEE international conference on electronics, circuits and systems (ICECS), Genoa, Italy, pp 610–613. https://doi.org/10.1109/ICECS46596.2019.896 4861 5. Suchit S, Suneja K (2022) Implementation of secure communication system using chaotic masking. In: 2022 IEEE global conference on computing, power and communication technologies (GlobConPT), New Delhi, India, pp 1–5. https://doi.org/10.1109/GlobConPT57482. 2022.9938303 6. Yang T, Wu CW, Chua LO (1997) Cryptography based on chaotic systems. IEEE Trans Circ Syst I Fundam Theor Appl 44(5):469–472. https://doi.org/10.1109/81.572346 7. Tuna M, Alçın M, Koyuncu I, Fidan CB, Pehlivan I (2019) High speed FPGA-based chaotic oscillator design. Microproces Microsyst 66:72–80 8. Tuna M, Fidan CB (2016) Electronic circuit design, implementation and FPGA-based realization of a new 3D chaotic system with single equilibrium point. Optik 127(24):11786–11799 9. Chen S, Yu S, Lü J, Chen G, He J (2018) Design and FPGA-based realization of a chaotic secure video communication system. IEEE Trans Circ Syst Video Technol 28(9):2359–2371. https://doi.org/10.1109/TCSVT.2017.2703946 10. Nuñez-Perez JC, Adeyemi VA, Sandoval-Ibarra Y, Pérez-Pinal FJ, Tlelo-Cuautle E (2021) FPGA realization of spherical chaotic system with application in image transmission. Math Probl Eng. Article ID 5532106, 16p

FPGA-Based Design of Chaotic Systems with Quadratic Nonlinearities

161

11. Schmitz J, Zhang L (2017) Rössler-based chaotic communication system implemented on FPGA. In: 2017 IEEE 30th Canadian conference on electrical and computer engineering (CCECE), pp 1–4. https://doi.org/10.1109/CCECE.2017.7946729 12. Tolba MF, Elwakil AS, Orabi H, Elnawawy M, Aloul F, Sagahyroon A, Radwan AG (2020) FPGA implementation of a chaotic oscillator with odd/even symmetry and its application. Integration 72:163–170 13. Shi QY, Huang X, Yuan F, Li YX (2021) Design and FPGA implementation of multi-wing chaotic switched systems based on a quadratic transformation. Chin Phys 30(2):020507-1– 020507-10 14. Koyuncu I, Özcerit A, Pehlivan I (2014) Implementation of FPGA-based real time novel chaotic oscillator. Nonlinear Dyn 7:49–59 15. Garg A, Yadav B, Sahu K, Suneja K (2021) An FPGA based real time implementation of Nosé hoover chaotic system using different numerical techniques. In: 2021 7th international conference on advanced computing and communication systems (ICACCS), Coimbatore, India, pp 108–113. https://doi.org/10.1109/ICACCS51430.2021.9441923 16. Cartwright JHE, Piro O (1992) The dynamics of Runge-Kutta methods. Int J Bifurcation Chaos 2:427–449 17. Sadoudi S, Tanougast C, Azzaz MS et al (2013) Design and FPGA implementation of a wireless hyperchaotic communication system for secure real-time image transmission. J Image Video Proc 2013:43. https://doi.org/10.1186/1687-5281-2013-43 18. Rössler OE (1976) An equation for continuous chaos. Phys Lett A 57(5):397–398 19. Lorenz EN (1963) Deterministic non-periodic flows. J Atmos Sci 20:130–141 20. Pehlivan I, Uyaro˘glu Y (2010) A new chaotic attractor from general Lorenz system family and its electronic experimental implementation. Turkish J Electr Eng Comput Sci 18(2):171–184. https://doi.org/18. https://doi.org/10.3906/elk-0906-67 21. Li XF, Chlouverakis KE, Xu DL (2009) Nonlinear dynamics and circuit realization of a new chaotic flow: a variant of Lorenz, Chen and Lü. Nonlinear Anal Real World Appl 10(4):2357– 2368 22. Qi G, Chen G, Du S, Chen Z, Yuan Z (2005) Analysis of a new chaotic system. Physica A: Stat Mechan Appl 352(2–4):295–308 23. Méndez-Ramírez R, Cruz-Hernández C, Arellano-Delgado A, Martínez-Clark R (2017) A new simple chaotic Lorenz-type system and its digital realization using a TFT touch-screen display embedded system. Complexity 6820492 24. Yang Q, Chen G (2008) A chaotic system with one saddle and two stable node-foci. Int J Bifur Chaos 18:1393–1414 25. Liu Y, Yang Q (2010) Dynamics of a new Lorenz-like chaotic system. Nonlinear Anal Real World Appl 11(4):2563–2572 26. Li XF, Chlouverakis KE, Xu DL (2009) Nonlinear dynamics and circuit realization of a new chaotic flow: a variant of Lorenz, Chen and Lü. Nonlinear Anal Real World Appl 10:2357–2368 27. Pikovski AS, Rabinovich MI, Trakhtengerts VY (1978) Onset of stochasticity in decay confinement of parametric instability. Soviet Physics JETP 47:715–719 28. Kocamaz UE, Uyaro˘glu Y, Kizmaz H (2014) Control of Rabinovich chaotic system using sliding mode control. Int J Adapt Control Signal Proces 28(12), 1413–1421 29. Chen G, Ueta T (1999) Yet another chaotic attractor. Int J Bifurcat Chaos 9:1465–1466 30. Lu J, Chen G (2002) A new chaotic attractor coined. I J Bifurcat Chaos 12:659–661. https:// doi.org/10.1142/S0218127402004620

A Comprehensive Survey on Replay Strategies for Object Detection Allabaksh Shaik

and Shaik Mahaboob Basha

Abstract Object detection is a task to expect the actual features and type of object in a scene. Development of Convolution Neural Network (CNN) gave rise to great advances in object detection. The most popular object detectors are Yolo and Faster RCNN (Region-based CNN). The primary limitation of these object detectors is lack of capability to continually gain knowledge of new objects in the dynamic world. Humans are born to learn continued knowledge while grasping the ability to keep the old knowledge. However, every deep network has a limited capacity to learn and cannot exactly replicate the way humans perform continual learning. This is primarily due to a phenomenon addressed as catastrophic forgetting which cannot retain the previously learnt data while learning a new task. The issue of continual learning extensively measured in image classification applications as these are essential in resolving object detection problems. Incorporating the continual learning strategies within the existing deep learning-based object detectors will be very useful in applications like retail, autonomous driving, and surveillance-related issues. Various recent research findings relate awareness refinement to limit the representation to hold older information. This rigid limitation is disadvantageous for learning an innovative familiarity. Among the various techniques that exist in the literature, replay-based approach is very close to the way humans perform continual learning to retain previous knowledge. This article surveys and analyzes the state-of-the-art replay techniques and compares them to identify the most suitable technique for object detection on edge devices. Keywords Convolutional neural network · Object detector · Continual learning · Catastrophic forgetting · Replay strategies object detectors A. Shaik (B) Jawaharlal Nehru Technological University Anantapur, Ananthapuramu, Andhra Pradesh, India e-mail: [email protected] Sri Venkateswara College of Engineering Tirupati, Affiliated to Jawaharlal Nehru Technological University Anantapur, Ananthapuramu, Andhra Pradesh, India S. M. Basha N.B.K.R. Institute of Science and Technology, Affiliated to Jawaharlal Nehru Technological University Anantapur, Vidyanagar, Ananthapuramu, Andhra Pradesh, India © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 A. Swaroop et al. (eds.), Proceedings of Data Analytics and Management, Lecture Notes in Networks and Systems 788, https://doi.org/10.1007/978-981-99-6553-3_13

163

164

A. Shaik and S. M. Basha

1 Introduction In recent time, varied research findings are observed to expand a realistic approach to speed up the progress of deep learning techniques. We can observe many advanced techniques which derived exceptional outcomes. Cun et al. developed the Convolution Neural Network (CNN) [1] which augmented research advancements in object detection. The continuous reformations with experiential results in deep learning procedures were identified in the literature. Object localization is the recognition of all the features in an image by integrating the precise location of those images. The object identification, computer vision localization using deep learning techniques are visualized and effective techniques derived and deduced rapidly. The object detection methods are widely used in various fields like military, radar sensing and image processing. The exact object identification with respect to the features like dimensions, postures, and viewpoints is a challenging area of research. For the past few years, enormous research carried out using Machine Learning (ML) and Deep Learning (DL) techniques. Xiao et al. proposed an object detection technique [2] which has a number of associations with object classification and semantic segmentation explained with reference to Fig. 1. Object classification involves determining the category to which objects in an image belong. In contrast, object detection not only identifies the object categories but also accurately locates them using rectangular bounding boxes. Semantic segmentation focuses on predicting the object category for each individual pixel, without distinguishing between different instances of the same object. On the other hand, instance segmentation goes beyond semantic segmentation by not only predicting the object categories for each pixel but also distinguishing between different instances of objects. Figure 2 illustrates the fundamental components utilized in object detection. The region selector employs a sliding window approach, employing windows of varying sizes and aspect ratios, which traverse the image. The initial movement of the sliding windows is from left to right, followed by top to bottom. Key point is the total procedure involves firm step size. The sliding window is used to crop image blocks which in turn changed to form an image with consistent dimension. The technique for extracting attributes are described in HOG [3], Haar [4] and SIFT [5]. To recognize the type of the object in the extracted attributes, classifiers are proposed in SVM

(a)

(b)

(c)

(d)

Fig. 1 a Object classification b Object detection. c Semantic segmentation d Instance segmentation

A Comprehensive Survey on Replay Strategies for Object Detection

165

Fig. 2 Basic architecture of traditional object detection algorithm

Fig. 3 Conventional object detection based on DCNNs

[6] and AdaBoost [7]. Figure 3 presents an overview of various object detection methods. Deep Convolutional Neural Networks (DCNNs) have undergone significant evolution over the years, leading to remarkable advancements in object detection. In the past, object detection applications have been frequently used for investigating the evolution of DCNNs and the backbone networks. Subsequently, examination and comparison of network frameworks and the reported loss functions for object detection are discussed. Research will focus on improving object detection model’s ability to generalize to unseen classes with limited or no training data, allowing for better adaptability in novel scenarios. Object detection models will be designed to learn continuously from new data, adapting to evolving environments without catastrophic forgetting.

166

A. Shaik and S. M. Basha

2 Object Detectors Object detection encompasses two main tasks: object localization and object classification. Deep Convolutional Neural Networks (DCNNs) have been widely used in object detection, and they can be categorized into two types. One type is the two-stage object detection architecture, which separates the tasks of object localization and classification. This architecture first generates region proposals and then classifies them. The key advantage of the two-stage approach is its high accuracy. However, it suffers from slower detection speed. Some notable examples of two-stage object detection architectures include RCNN [8], SPPNet [9], Fast RCNN [10], Faster RCNN [11], Mask RCNN [12], and RFCN [13]. The other type is the one-stage object detection architecture, which directly locates and classifies objects using DCNNs without dividing them into separate stages. This one-stage detector can produce class probabilities. The location coordinates of an object in a single step by eliminating the need for a region proposal process are also observed. This makes it simpler compared to the two-stage approach. One of the primary advantages of one-stage detectors is their ability to quickly identify objects in a scene. However, they often exhibit lower accuracy compared to two-stage architectures. Examples of one-stage object detection models include OverFeat [14], the YOLO series [15–17], SSD [18], DSSD [19], FSSD [20], and DSOD [21]. Table 1 presents various parameters of performance of some traditional two-stage and one-stage object detectors, while Fig. 6 illustrates the evolution of object detection milestones. For a comprehensive overview of milestone object detection designs, refer to Table 1, which summarizes their features, properties, and weaknesses. The subsequent sections will investigate interested in the details of the two-stage, onestage object detection architectures. It also emphasizes open-source object detection platform for this need. CNN uses a Deep Convolutional Neural Network (DCNN) for the feature extraction to backbone network in the paper as opposed to a HOG [3] and other typical feature extraction techniques, which were paired with regional proposals methods to generate region suggestions. As seen in Fig. 4, this gave rise to the RCNN architecture [8]. The steps from the RCNN architecture pipeline are as follows. The selective search method produces about 2000 recommendations regions that are independent of any given category. These region suggestions are fed into the DCNN, which create a 4096-dimensional feature as a representation. The features are classified using the SVM approach, which improves RCNN performance by 30%. Although RCNN has demonstrated exceptional performance in object detection, it still has three key shortcomings that preclude it from being employed in practical applications, which are given below. Each picture has to have roughly 2000 region proposals pre-fetched, which uses a lot of storage space and I/O resources. This can be a problem because if the photos do not contain enough region suggestions, they may not be able to obtain the correct image. If you want to use AlexNet as your backbone network, you must crop or warp the region block to create a 227,227 RGB image. This will result in the object image being truncated or stretched. This might lead to the loss of important

A Comprehensive Survey on Replay Strategies for Object Detection

167

Table 1 Chart of highlights, properties, and milestone object detection architectures Method

Highlights and properties

Shortcomings

RCNN [8]

The software uses Deep Convolutional Neural Networks (DCNNs) to extract image features. It selects 2000 proposals using the selective search algorithm and uses Support Vector Machine (SVM) to classify regions. Finally, it uses a bounding-box regressor to refine regions

The training process is characterized by slow-speed, high-resource consumption, and the absence of end-to-end training

SPPNet [9]

Extract features of the entire image with DCNNs. On the image, choose up 2,000 area suggestions, and then map them to feature maps. Multi-scale images will be submitted to DCNNs using spatial pyramid pooling

Selective search is to pick up region proposals which is still slow; no need for end-to-end training

Fast RCNN [10]

Pick the features from the entire image using DCNNs; extract region proposals from the image using the selective search technique, and map them to the feature maps. To obtain fixed-size feature maps, down-sample the region proposals data;

Discriminating exploration to extract region proposals is still a slow process. End-to-end training is not appeared in this case

Faster RCNN Switch the selective search algorithm to the [11] region proposal network (RPN). End-to-end training is achievable since the RPN shares the feature maps with the backbone network

Large and complex objects experience poor performance when scaled up or down. This can be a problem for applications that need to respond quickly to changes, like real-time games

Mask RCNN [12]

Instead of using ROI pooling layer that improves detection precision, use ROI align pooling layer. For better detection accuracy, combine object detection and segmentation training; relevant for concentrating on Small Object Detection

The speed at which detection can take place cannot keep up with the real-time requirements

YOLO [15]

A novel single-stage detection network can detect objects with high speed and demands to meet the real time

Dense objects or small objects have low accuracy in detection

YOLOV2 [16]

Multi-dataset joint training is employed; it makes use of DarkNet19, a new backbone network. It creates anchor boxes using the k-means clustering technique

Training is complex

YOLOV3 [17]

Feature fusion is a process of combining the results of multiple measurements to improve accuracy. The new backbone network, DarkNet53, employs feature fusion at many levels to increase the precision of multi-scale detection

Performance declines with increasing IoU

(continued)

168

A. Shaik and S. M. Basha

Table 1 (continued) Method

Highlights and properties

Shortcomings

SSD [18]

It makes use of a multi-layer detecting system. It employs many levels of the multi-scale anchors’ mechanism

This mechanism is not well-suited for detecting small objects

DSSD [19]

It uses multi-layer detection mechanism; up-sampling using deconvolution instead of simple linear interpolation employed to improve the resolution in the image

Detection speed decreases relative to SSD

Fig. 4 R-convolutional neural network architecture

information. Another issue is that each area proposal is fully separate and does not make use of DCNN feature sharing. This means that extracting them all will require a significant amount of resources. The last convolutional layer spatial pyramid pooling [10] is introduced after the cropping/warping phase of the RCNN is eliminated. Cropping or distorting an image can result in missing object information. In order to produce a 21-dimensional fixedlength feature vector for the fully connected (FC) layer [9], an image of any size can be input into the DCNNs. This feature vector is then used by the FC layer to forecast the next pixel in the image. SPPNet test performance is 10–100 times faster than RCNN since the complete feature map is shared. Because SPPNet and RCNN rely significantly on end-to-end training, there is no way to completely execute the model without first performing extensive data preprocessing. In SPPNet, network accuracy [10] is constrained since the convolutional layer cannot be trained further during fine-tuning. Girshick et al. described about the object detection using Regional convolutional neural networks and the proposed method gave an avenue for future researchers to develop effficient algorithms that offer better accuracy rates and speed [8]. Numerous industries, including high-end robotics and automation, biometric and face identification, and medical imaging, use object detection. Based on how the tasks of classification and bounding-box regression are carried out, the majority of object detectors may be generally divided into two groups of object detectors. Dai et al. suggested that to recognize objects within the bounding box faster than RCNN, object detection is carried out using region-based fully convolutional

A Comprehensive Survey on Replay Strategies for Object Detection

169

networks [13]. For the purpose of locating the target item within the bounding box, RCNN employs the selective search approach. Girshick et al. work focused on Fast RCNN [10] which is a fast algorithm for deep learning that uses a RoI pooling layer to reduce the speed, space, and training time process. RoI pooling layer reduces the number of required training samples by grouping nearby points into regions, which allows the algorithm to fixate on more crucial areas. In the Fast RCNN, each DCNN calculates a feature map of the image. To locate regions that are comparable to the feature map region, the selective search method makes use of a map. Different feature areas are then converted to fixed-size feature vectors by the RoI pooling layer and sent into the fully connected (FC) layer. Finally, the bounding-box regression precisely estimates object position, while the Softmax predicts item categories as shown in Fig. 5. Using multi-task loss, the Fast RCNN concurrently trains classification and bounding-box regression, resulting in two tasks that share convolution information. As a result, multi-task training may be applied to stage-wise SVM + bounding-box regression training. The benefits of Fast RCNN over RCNN/SPPNet as a result of these advances are as follows. Fast RCNN performs better in terms of accuracy than RCNN/SPPnet. The training of the detector is end-to-end due to multi-task loss. Compared to SPPNet, which updates just the fully connected (FC) layer, fast RCNN training can update all network layers. Hard disk storage is not required for feature caching. Training and testing for RCNN/SPPNet are quicker.

Fig. 5 Fast RCNN architecture

Fig. 6 Faster RCNN architecture

170

A. Shaik and S. M. Basha

It takes a while for selective search technique to examine all region proposals in the image and mapping them into feature maps. Fast RCNN required about 2.3 s to make predictions in the test, and roughly 2 s of that time was used to build 2000 ROIs. As a result, the bottleneck of the object identification architecture is the conventional region proposal methodologies. Ren et al. created a regional proposal network (RPN) in the Faster RCNN [11] to resolve this issue. Proposal networks are the other alternatives in this region. The DCNNs share the properties of the detection network’s convolution with the whole image feature maps to the RPN’s region proposal extraction (Fig. 6). YOLO is the popular single-stage detector and is popularly used in the literature for real-time object detection. Unlike two-stage detectors, all the YOLO versions V1, V2, V3, etc., operate in single stage where there is no intermediate computation of probable region proposals. Redmon et al. proposed single-stage detectors which will execute in a single step. YOLO V1 [15] computes both the classification label and bounding-box coordinates directly from the feature maps computed from backbone feature extractor. However, all these object detectors go through from the setback of catastrophic forgetting. They tend to forget the information gained from formerly learnt classes when refined for new classes. In Yolo, the given picture is distributed across a grid of N × N cells. In every cell, it calculates confidence for ‘n’ bounding boxes. A tensor encoding the predicted result is N × N × (s × s + p). N × N sub-images are separated from the input image. A bounding box of an object detected has five attributes, namely confidence score, weight, height, and center coordinates (x, y). A number of limitations also exist with YOLO V1. YOLO V1 constraints can be based on the closeness of objects in the image. If the objects appeared in a group, they could not find the small objects. The key concern is locating objects in a given image due to localization error. YOLO V1 sometimes fails because in the image above it identifies the human as an airplane (Fig. 7). Shafiee et al. focused on Yolo V2 [16] supersedes Yolo helps you achieve an incredible balance between running speed and precision. For higher accuracy, Yolo V2 features batch normalization, which makes it easier to increase two percentage points in a map by integrating it into each layer of convolution. With the help of changing its filters, high-resolution classifiers are employed to perform well while allowing more time for a larger, more diverse knowledge of community to operate

Fig. 7 YOLO architecture

A Comprehensive Survey on Replay Strategies for Object Detection

171

well. For various aspect ratios’ objects, there will be weak generalization ability which can be solved by YOLO V2 by introducing anchor. Each grid cell can anticipate three scale and three aspect ratios. To discover the previous bounding boxes automatically, YOLO V2 uses the K-means clustering technique, which can increase the accuracy of detection. By restricting the ground truth offset in relation to the grid cell arrangement to a range between 0 and 1, YOLO V2 resolves the instability of the connections approach. For object detection, Redmon et al. proposed the enhanced version of YOLO V3 [17], which is employed in many aspects of human life, including health and education and many others. As a majority of these sectors were developed, so onestage model should be improved. In YOLO V3, the targetness score is calculated using logistic regression. Instead of the Softmax layer used in YOLO V2, YOLO V3 employs a logistic classifier for each class and uses darknet53, which has 53 convolution layers. YOLO V3 is more in depth compared with YOLO V2 in which fourteen convolution layers are present. Yolo V3 addressed Yolo V2’s issues and developed a balance between speed and accuracy. Both RCNN series and YOLO exhibit compensation in terms of accuracy and speed. The RCNN has outstanding accuracy but reduced speed for object detection. Similarly, YOLO detects objects fast, but its effect for small objects is minimal. Liu et al. presented the single-shot multibox detector (SSD) [18], for instance by taking into account the benefits of Faster RCNN and YOLO. As the backbone network for feature extraction, SSD uses VGG16. Figure 8 depicts the extraction of SSD network hierarchical features. The SSD may be adjusted to detecting multiscale objects using anchor algorithm in conjunction with Faster RCNN’s multiscale feature maps. SSD512 outperforms the three times Faster RCNN with VGG16 accuracy. SSD300 outperforms YOLO by 59 frames per second, with substantial quality detection [18]. DSSD [19] is using a deep learning network, known as ResNet101 which is backbone network, to improve the SSD’s low-level feature maps. The low-level feature maps with feature fusion are achieved by incorporating the deep convolution

Fig. 8 Single-shot multibox detector architecture

172

A. Shaik and S. M. Basha

and skip-connection modules. Similar to this, depending on SSD, FSSD transforms low-level features into sophisticated characteristics.

3 Continual Learning and Catastrophic Forgetting Research on training object detection models to continuously learn from fresh data over time, as proposed by Menezes et al., is known as continual learning for object detection. Traditional object detection algorithms are frequently trained on a fixed dataset and probably have access to all training data. However, in real-world scenarios, new object classes, variations, or environments may emerge after the initial training, requiring the model to adapt and learn from these new examples. Continual learning approaches aim to address this challenge by enabling object detection models to incrementally be trained from new data by preserving knowledge gained from preceding data. The goal is to keep away from the concept of catastrophic forgetting. At this point, the model forgets earlier learned data compared to fresh data. There are several techniques and strategies used in continual learning for object detection like Regularization, Replay and Memory, Generative Models, Knowledge Distillation, Task-Freezing, and Fine-Tuning. The modern researchers can utilize to regulate their incremental object detector studies. In this manner, we suggest the contributions that: a brief and methodical recap of the key solutions to the issue of continuously learning and identifying new object instances. In the area of continuous object detection, CL methods are combined to address memory loss and knowledge transferability between object detection tasks. A broad understanding of both subjects is necessary in order to identify prospects in the area and understand the findings of this analysis. The task ID is essential in determining the types of classes and distributions that can be discovered during testing in classification tasks. This determines whether a task-specific approach is required or if a more general strategy is sufficient. As such, Van de Ven and Tolia’s convention for three typical task scenarios has been widely adopted in the literature on continual learning (CL). Task-incremental learning assumes that the model possesses knowledge of the task ID during both training and testing which enables a task-specific solution. On the other hand, domainincremental learning does not provide the task ID during testing but retains task structure. Normally, class labels are kept, although the distribution of the data may alter. Class-incremental learning: The model must infer the task ID because it is assumed that it will not be provided during the test time. The model must gradually incorporate more classes and increase the variety of its predictions. Task-Free or Task-Agnostic CL adds a scenario for when the task labels are not provided during either training or testing, making it the most challenging scheme. In order to deal with changing data distribution, the model still lacks knowledge of task limits. The major techniques are separated into three families such as parameter isolation, regularization, and episodic replay. In parameter isolation, some of the networks parameters are freeze and a new layer is added whenever the new task is presented to the network.

A Comprehensive Survey on Replay Strategies for Object Detection

173

This enables an increased network capacity without training it from scratch. Regularization will prevent the network from overfitting to either the new or old classes which will aid in improved learning ability of the object detection architectures. The final technique is based on the replay mechanism where the data corresponding to previously learnt knowledge will be constantly replayed to the deep networks which assist in avoiding the problem of catastrophic forgetting. McCloskey and Cohen [23] and Ratcliff et al. [24] describe about the catastrophic forgetting which is an issue that impacts neural networks, along with other acquiring systems, comprising both organic and artificial intelligence systems. A learning scheme may initially forget how to perform the first job when it is trained with another. A well-supported algorithm of organic acquiring in people advises that neocortical neurons acquire knowledge applying an procedure that is prone to catastrophic forgetting and that the neocortical acquiring procedure is complemented by a virtual encounter treat that replays memories stored in the hippocampus for the purpose of continually reinforce assigned tasks that have not been lately carried out in the paper [25]. As artificial intelligence researchers, the lesson we may pick up out of that is that it is acceptable for our acquiring algorithmic program to suffer from forgetting, but they may need commonly confused word algorithmic program to decrease data loss. Designing such complementary algorithms relies upon on expertise the characteristics of the forgetting experienced via our modern-day primary gaining knowledge of algorithms. In this paper [26], we check out the quantity to which catastrophic forgetting influences a variety of learning algorithms and neural network activation functions. Neuro-scientific evidence suggests that the relationship between the old and new assigned tasks strongly influences the result of the two successive acquiring experiences Consequently, we contemplate three distinct types of relationship between assigned tasks: one in which the assigned tasks are functionally identical but with distinct formats of the input, one in which the assigned tasks are similar, and one in which the assigned tasks are dissimilar.

4 Replay Strategies for Object Detection Kudithipudi et al. [27] described that humans can learn about new objects with limited experience without forgetting the information about old objects. They tend to use wide variety of techniques. Some of them include: neurogenesis, episodic replay, meta-plasticity, neuro-modulation, context-dependent perception and gating, hierarchical distributed systems, and many more. Among these multiple paradigms of continual learning, there is strong evidence in support of episodic replay for memory consolidation in the biological brain (Fig. 9). Even though episodic replay has been considerably explored in the context of image classification, very few works exist in the space of object detection. Figure 1 shows the generic block diagram for a replay-based continual learning framework. Replay-based techniques always associate a memory with the object detection

174

A. Shaik and S. M. Basha

Fig. 9 Replay-based continual learning framework

networks where either the instances or feature maps corresponding to key instances are replayed to the object detection networks at regular intervals to avoid catastrophic forgetting phenomenon. Both the classification and the bounding-box regression modules acquire the instance level features. Binary cross-entropy is typically used to reduce classification loss. However, focal loss tends to minimize the effects of class imbalance. The equation for focal loss can be given by following Eq. 1: L cls =

N

F L dti , d ip ,

(1)

i=1

where dti indicates the ground truth one-hot vector, whereas the d ip indicates the onehot vector corresponding to ith sample. For bounding-box regression, the typical loss function can be provided by the following Eq. 2: L loc =

smooth L1 tiu − vi .

(2)

i{x,y,w,h}

Smooth-L1 loss in the above Eq. 2 is more robust than L2 loss, where {x, y, w, h} indicates the coordinates of bounding box. This paper attempts to review and present the comparative analysis among modern replay-based approaches in object detection. In Shin et al. paper “Continual Learning with Deep Generative Replay [28]”, episodic replay is employed in order to continue learning. This technique allows to replay with new data and old data in memory at the same time. One of the key drawbacks is that the old samples need to be stored in the memory and this cannot be scaled for larger datasets. Deep Generative Replay will eliminate the need for storing the key samples. Here, model is trained using pseudo data which are generated and it can replay the knowledge of old objects. The network adopts a twofold model design that combines a Deep Generative Model (also known as a “generator”) and a TaskSolving Model (also known as a “solver”), also known as a Scholar Model. Deep Generative Model (“generator”) is trained using Generative Adversarial Networks (GANs) (framework). The paired data here are known as generator–solver pair (both models’ data). This produces fake data as required by the user’s requirements (desired

A Comprehensive Survey on Replay Strategies for Object Detection

175

targets). Generator–solver pairs are presented in new tasks which are updated using the generator and solver. By using this, we can overcome the CF. By using this input target pair using generator and solver can be used to teach the other models. By using this, it retains knowledge and it does not need to revisit the past data. In their study titled “Take goods from shelves: A dataset for class-incremental object detection” [29], Hao et al. introduced a valuable contribution by proposing a dataset specifically designed for class-incremental object detection. The dataset includes three coarse classes, consisting of 38,000 top-quality images and 24 finegrained classes. This work represents an advanced approach to class-incremental techniques. The researchers adopted Faster RCNN (FRCNN) as the base model and made several modifications to enhance its detection capabilities without sacrificing the knowledge learned from previous classes. Specifically, they focused on modifying the classification part of the model in a class-incremental fashion while keeping the regression part unchanged. Additionally, the authors of [29] introduced a novel technique involving knowledge distillation applied to the FRCNN branch, further improving the model’s performance. By leveraging these innovations, the authors successfully addressed the challenges of class-incremental object detection, providing a valuable resource for future research in this field. The strategy applied here is an image-level exemplar management strategy and it is used to avoid forgetting in the implementation of class-incremental learning model. Even though this approach does not directly use replay techniques, since this involves a useful dataset for continual learning, the brief description about the work has been included in this section. Shieh et al. paper “Continual Learning for Autonomous Driving [30]” describes a continual learning framework for one-stage object detection network by effectively combining the old and new data through a memory-based replay technique. In this technique, a portion of previously seen data is stored with in a memory buffer. The replayed information along with the new data is used to avoid the catastrophic forgetting problem. Each batch of data fed into the network contains both old and new classes. This approach has been validated on the modified version of Pascal VOC 2007 dataset and making it suitable for continual learning for YOLO network as base detection model. Augmented images or expanded images are stored in memory so that there is a loss in accuracy. Acharya et al. paper “Rodeo: Replay for online object detection” [31] describes about the object detection which is a localization task that entails predicting bounding boxes and class labels for all gadgets in a scene. Majority of deep gaining knowledge of systems for detection are skilled offline, i.e., they cannot be usually updated with new item classes. If the trained deep networks are fine-tuned, they suffer from catastrophic forgetting. In this work, RODEO replays compressed feature representations corresponding to the object from a fixed memory buffer while fine-tuning the pretrained deep network. The feature representations are in use from an intermediate layer of the CNN backbone and compressed to reduce storage requirements. During the training procedure, RODEO framework essentially combines a random subset of samples from its replay buffer along with the new input.

176

A. Shaik and S. M. Basha

Yang et al. paper work on “One-Shot Replay: Boosting Incremental Object Detection via Retrospection of One Object [32]” focuses to generate new data. We need to generate synthetic samples with objects of old and new classes by copying the stored one-shot object of each old class and pasting them on new samples or clean background randomly. The synthetic samples are feed as input into the dual network, where the old model gives the knowledge of old classes in the features and outputs. To minimize the storage of old data, the authors proposed to store only one object for each old class. They make use of copy–paste to perform replay for incremental learning, which replays objects of old classes by augmenting new samples (creating new samples using augmentation). This approach initially selects a cropped object from memory and resizes it with random width and height in a range and then search for a position in the new sample for pasting the object, where the IOUs between the object and the ground truths of the new sample should be lower than a threshold. The search time is made sure to have a certain upper limit. The advantage with copy–paste technique is that the memory usage of instance level information will be far less than the whole image. Also, this approach will not increase the training set which will not eventually increase the number of forward steps and the time consumed for training. Kim et al. paper “Continual Learning on Noisy Data Streams through a SelfPurified Replay” [33], the proposed self-purified replay is used under continual learning and noisy label classification setting. The technique involves continual learning by means of replay-based technique. This simple procedure surpasses previous techniques with respect to performance and memory efficiency. The replaying of a noisy buffer intensified the forgetting procedure. The reason is because of the deceptive mapping of earlier knowledge. Self-Purified Replay (SPR) is used to tackle noisy labeled continual learning. The authors of [33] introduced noisy labels for catastrophic forgetting. Filtering noise from the input data stream before storing it in the replay buffer is crucial. SPR ensures the maintenance of a clean replay buffer. Mirza et al. paper “Domain Incremental through statistical corrections (DISC)” [34] described that it is extremely challenging to build the autonomous vehicle switch which will have the ability to adapt to new weather conditions. During the process of adapting these vehicles to new weather conditions, they tend to forget the information previously learnt. The approach proposed in this paper is referred as DISC which can incrementally learn new tasks without the need for retraining or large storage space to store previously learnt samples. The weather changes in DISC are captured in the form of statistical first and second-order moments which consume very less storage space. However, these statistical parameters capture only the global weather changes and may not be easily adaptable for the domain shifts within local regions such as object’s appearance. Chen et al. in their paper titled “Rehearsal balancing priority assignment network (RBPAN)” [35] proposed a continual learning detector for remote sensing image applications with very high resolution (VHR). Due to the inherent class imbalance problem in many datasets, the network tends to be biased to a certain class and this in turn leads to rehearsal imbalance where the samples corresponding to certain classes are given higher priority during the rehearsal. The authors of this work propose RBPAN which uses the entropy reservoir sampling technique to maintain

A Comprehensive Survey on Replay Strategies for Object Detection

60 50

42.1 39.8 40.6 44.9 45.5

70

BRISQUE 54.3 54 54.6 55.6 57.2

66.5 68.2 68.6 68.6 69

80

177

Class Bal Buffer Class Bal Samples

40 30

GSS [18]

20 10

RSS [19]

0 mIoUBDD

mIoUCityscapes

mIoUAverage

Fig. 10 Comparison of various sampling strategies [36]

the rehearsal balance during the training procedure. The proposed network in [35] assigns adaptive weights to the images during the replay procedure which boosts the importance for minority classes while decreasing the weights assigned for majority classes. According to Kalb et al.’s description of the object identification methodology, semantic segmentation is another method for identifying things in pictures or movies [36]. The authors in [36] proposed an improved replay technique where they have shown that maintaining a uniform distribution of classes in the memory buffer will avoid the new class of objects bias the network within a class-incremental learning framework, whereas in a domain-incremental learning setting, sampling the features uniformly from different domains will tend to decrease the representation shift and thus avoiding the problem of catastrophic forgetting. The comparative analysis of various sampling techniques are presented in Fig. 10. The experiment has been performed using BDD and Cityscapes datasets which are the popular autonomous driving datasets. They have used mIoU as their metrics for comparison with the various other sampling strategies and can be computed by using the following equations: IoU =

Area of Overlap , Area of Union

(3)

Total Number of predictions detected correctly , Total Number of Prediction

(4)

Sensitivity =

Number True Positives , Number of True Positives + Number of True Negatives

(5)

Precision =

Number True Positives , Number of True Positives + Number of False Positives

(6)

Accuracy =

178

A. Shaik and S. M. Basha

TPR =

Number True Positives , Number of True Positives + Number of False Negatives

(7)

FPR =

Number False Positives . Number of False Positives + Number of True Negatives

(8)

The above is to compute the accuracy, sensitivity, precision, True Precision, False Precision for calculation of Mean Average precision (mAp) and Mean Inter-section of Union (mIoU) of various strategies. The mean is computed across all instances and images to report the mIoU for performing comparison across the popular sampling strategies. Among all the approaches from Fig. 10 comparison of various strategies, the RSS [36] technique performs the best and this signifies the need for balanced uniform sampling technique in a domain-incremental scenario. Replay-based techniques are used in the framework of continual object detection. Experienced replay is shown as one of the popular approaches for memory retention within human brains. Inspired from experienced replay, there are several approaches which are proposed for continual learning for image classification networks, while there was a little focus on the object detection networks which is an important computer vision problem spanned across several safety–critical applications where there is a continual evolution of new data. In an ideal scenario, external memory needs to be used in conjunction with deep networks to store knowledge about previous networks, thereby preventing catastrophic forgetting.

5 Conclusions This paper provides a comprehensive review of replay-based techniques utilized within the framework of continual object detection. Experienced replay is shown as one of the popular approaches for memory retention within human brains. Inspired from experienced replay, there are several approaches which are proposed for continual learning for image classification networks, while there was a little focus on the object detection networks which is an important computer vision problem spanned across several safety–critical applications where there is a continual evolution of new data. In an ideal scenario, an external memory needs to be used in conjunction with the deep networks to store the knowledge about previous networks. However, the edge devices will only have a small amount of memory and use a method similar to experience replay, which cannot scale for bigger datasets. However, generative or one-shot replay techniques with the balanced sampling strategies will be ideal for edge devices where there is an access to very limited memory.

A Comprehensive Survey on Replay Strategies for Object Detection

179

References 1. LeCun Y, Boser B, Denker JS, Henderson D, Howard RE, Hubbard W, Jackel LD (1989) CNN-LeNet. In: IEEE International phoenix conference on computers and communications 2. Xiao Y, Tian Z, Yu J, Zhang Y, Liu S, Du S, Lan X (2020) A review of object detection based on deep learning. Multimedia Tools and Appl 23729–23791 3. Dalal N, Triggs B (2005) Histograms of oriented gradients for human detection. In: 2005 IEEE computer society conference on computer vision and pattern recognition (CVPR’05), June, vol 1. pp 886–893 4. Lienhart R, Maydt J (2002) An extended set of haar-like features for rapid object detection. In: International conference on image processing, September, vol 1. pp I–I 5. Lowe DG (2004) Distinctive image features from scale-invariant keypoints. Int J Comput Vision 60:91–110 6. Cortes C, Vapnik V (1995) Support-vector networks. In: Machine learning, vol 20. Published in Kluwer Academic Publishers, pp 273–297 7. Viola P, Jones M (2001) Rapid object detection using a boosted cascade of simple features. In: Proceedings of the 2001 IEEE computer society conference on computer vision and pattern recognition. CVPR 2001, December, vol 1. pp I–I 8. Girshick R, Donahue J, Darrell T, Malik J (2014) Rich feature hierarchies for accurate object detection and semantic segmentation. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 580–587 9. He K, Zhang X, Ren S, Sun J (2015) Spatial pyramid pooling in deep convolutional networks for visual recognition. IEEE Trans Pattern Anal Mach Intell 37(9):1904–1916 10. Girshick R (2015) Fast R-CNN. In: Proceedings of the IEEE international conference on computer vision, pp 1440–1448 11. Ren S, He K, Girshick R, Sun J (2015) Faster r-cnn: towards real-time object detection with region proposal networks. In: Advances in neural information processing systems, vol 28 12. He K, Gkioxari G, Dollár P, Girshick R (2017) Mask r-cnn. In: Proceedings of the IEEE international conference on computer vision, pp 2961–2969 13. Dai J, Li Y, He K, Sun J (2016) R-FCN: object detection via region-based fully convolutional networks. In: Advances in neural information processing systems, vol 29 14. Sermanet P, Eigen D, Zhang X, Mathieu M, Fergus R, LeCun Y (2013) Overfeat: integrated recognition, localization and detection using convolutional networks. In: [ICLR 2014] International conference on learning representations 15. Redmon J, Divvala S, Girshick R, Farhadi A (2016) You only look once: unified, real-time object detection. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 779–788 16. Redmon J, Farhadi A (2017) YOLO9000: better, faster, stronger. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 7263–7271 17. Redmon J, Farhadi A (2018) Yolov3: an incremental improvement. arXiv preprint arXiv:1804. 02767 18. Liu W, Anguelov D, Erhan D, Szegedy C, Reed S, Fu CY, Berg AC (2016) Single shot multibox detector. In: Computer vision–ECCV 2016: 14th European conference, October 11–14, 2016, Proceedings, Part I 14, pp 21–37 19. Fu CY, Liu W, Ranga A, Tyagi A, Berg AC (2017) Dssd: deconvolutional single shot detector. arXiv preprint arXiv:1701.06659 20. Li Z, Zhou F (2017) FSSD: feature fusion single shot multibox detector. arXiv preprint arXiv: 1712.00960 21. Shen Z, Liu Z, Li J, Jiang YG, Chen Y, Xue X (2017) Dsod: learning deeply supervised object detectors from scratch. In: Proceedings of the IEEE international conference on computer vision, pp 1919–1927 22. Menezes AG, de Moura G, Alves C, de Carvalho AC (2023) Continual object detection: a review of definitions, strategies, and challenges. Neural networks

180

A. Shaik and S. M. Basha

23. McCloskey M, Cohen NJ (1989) Catastrophic interference in connectionist networks: the sequential learning problem. In: Psychology of learning and motivation, vol 24. Academic Press, pp 109–165 24. Ratcliff R (1990) Connectionist models of recognition memory: constraints imposed by learning and forgetting functions. Psychol Rev 97(2):285 25. McClelland JL, McNaughton BL, O’Reilly RC (1995) Why there are complementary learning systems in the hippocampus and neocortex: insights from the successes and failures of connectionist models of learning and memory. Psychol Rev 102(3):419 26. Goodfellow IJ, Mirza M, Xiao D, Courville A, Bengio Y (2013) An empirical investigation of catastrophic forgetting in gradient-based neural networks. arXiv preprint arXiv:1312.6211 27. Kudithipudi D, Aguilar-Simon M, Babb J, Bazhenov M, Blackiston D, Bongard J, Brna AP, Chakravarthi Raja S, Cheney N, Clune J, Daram A (2022) Biological underpinnings for lifelong learning machines. Nature Mach Intell 4(3):196–210 28. Shin H, Lee JK, Kim J, Kim J (2017) Continual learning with deep generative replay. In: Advances in neural information processing systems, vol 30 29. Hao Y, Fu Y, Jiang YG (2019) Take goods from shelves: a dataset for class-incremental object detection. In: Proceedings of the 2019 on international conference on multimedia retrieval, June, pp 271–278 30. Shieh JL, Haq QMU, Haq MA, Karam S, Chondro P, Gao DQ, Ruan SJ (2020) Continual learning strategy in one-stage object detection framework based on experience replay for autonomous driving vehicle. Sensors 20(23):6777 31. Acharya M, Hayes TL, Kanan C (2020) Rodeo: replay for online object detection. arXiv preprint arXiv:2008.06439 32. Yang D, Zhou Y, Hong X, Zhang A, Wang W, Yang D (2023) One-shot replay: boosting incremental object detection via retrospecting one object. In: AAI 33. Kim CD, Jeong J, Moon S, Kim G (2021) Continual learning on noisy data streams via selfpurified replay. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 537–547 34. Mirza MJ, Manasa M, Possegger H, Bischof H (2022) An efficient domain-incremental learning approach to drive in all weather conditions. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 3001–3011 35. Chen X, Jiang J, Li Z, Qi H, Li Q, Liu J, Zheng L, Liu M, Deng Y (2023) An online continual object detector on VHR remote sensing images with class imbalance. Eng Appl Artif Intell 117:105549 36. Kalb T, Mauthe B, Beyerer J (2022) Improving replay-based continual semantic segmentation with smart data selection. In: 2022 IEEE 25th international conference on intelligent transportation systems (ITSC), October, pp 1114–1121

Investigation of Statistical and Machine Learning Models for COVID-19 Prediction Joydeep Saggu and Ankita Bansal

Abstract The development of technology has a significant impact on every aspect of life, whether it is in the medical industry or any other profession. By making decisions based on the analysis and processing of data, artificial intelligence has demonstrated promising outcomes in the field of health care. The most crucial action is early detection of a life-threatening illness to stop its development and spread. There is a need for a technology that can be utilized to detect the virus because of how quickly it spreads. With the increased use of technology, we now have access to a wealth of COVID-19-related information that may be used to learn crucial details about the virus. In this study, we evaluated and compared various machine learning models with the traditional statistical model. The results of the study concluded the superiority of machine learning models over the statistical model. The models have depicted the percentage improvement of 0.024%, 0.103%, 0.115%, and 0.034% in accuracy, MSE, R2 score, and ROC score, respectively. Keywords Machine learning · Computational ıntelligence · COVID-19 · Statistical algorithm · K-nearest neighbors · Logistic Regression · Decision Tree · Random Forest · XGBoost · Support Vector Machine

1 Introduction The novel coronavirus first surfaced in Wuhan, China, in December of this year [1], and on December 31 of this year, it was reported to the World Health Organization. On February 11, 2020, the WHO designated the virus COVID-19 as a threat to the entire world. Clinical form studies demonstrate the existence of carriers who are asymptomatic in the community as well as the age groups most afflicted [2]. J. Saggu · A. Bansal (B) Netaji Subhas University of Technology, Dwarka, India e-mail: [email protected] J. Saggu e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 A. Swaroop et al. (eds.), Proceedings of Data Analytics and Management, Lecture Notes in Networks and Systems 788, https://doi.org/10.1007/978-981-99-6553-3_14

181

182

J. Saggu and A. Bansal

A person who is infected develops symptoms in 2–14 days. Fever, dry cough, and weariness are listed by the World Health Organization as symptoms and indicators of moderate-to-severe diseases, whereas dyspnea, fever, and fatigue may occur in severe cases. Transmission from person to person of the virus is anticipated to occur mostly through direct contact and respiratory droplets [3]. According to the WHO, the time needed for incubation of this virus ranges from 2 to 10 days in most instances. People are more likely to get the virus and develop major illnesses such as diabetes, asthma, and heart disease. The fast outspread of the virus, which has killed hundreds of thousands of people, has necessitated the development of a technology that may be used to detect the infection. However, infections can be reduced to some extent by practicing good hygiene. Moreover, it has been observed that the early detection of this disease can help in the containment of the virus. Tools such as machine learning (ML) software, datasets, and classification algorithms are crucial for creating the COVID-19 predictive model. Employing ML to detect COVID-19 has served in the monitoring and prevention of infectious patients and has helped in various circumstances where the ML has come as an aid for detecting COVID-19 more efficiently than statistical models like Linear and Logistic Regression and thus has reduced dependencies on hospitals where RT-PCR tests were the standard method of concluding whether the individual has COVID or not [4, 5]. This project intends to compare the accuracies of various ML algorithms such as K-nearest neighbors, Decision Tree, Random Forest, XGBoost, and Support Vector Machine (SVM) versus the statistical model of Logistic Regression and then utilize the best of them to determine an approach that forecasts whether or not the individual has COVID based on the data presented to the model. Contribution of the work can be summarized as: (i) evaluation of ML models for COVID-19 prediction, (ii) comparison of ML models with statistical models for COVID-19 prediction. Following this section, Sect. 2 describes the work in literature. The methodology of work is discussed in Sect. 3. Brief description about each model is given in Sect. 4. The results are discussed in Sect. 5, followed by conclusion in Sect. 6.

2 Related Work Millions of lives could be saved by a reliable and thorough diagnosis of COVID19, which would also provide a wealth of data for training ML models. In this context, ML may offer useful inputs, particularly for formulating diagnoses based on clinical literature, radiographic pictures, etc. According to studies in [6], a Support Vector Machine (SVM) algorithm can successfully distinguish COVID-19 patients from other patients in 85% of cases. In the study, COVID-19 test results from the Israeli government database were analyzed. The data collected were in the duration of March 2020–November 2021. During the first several weeks of the outbreak, it served as one of the primary COVID testing facilities in the nation. A task committee created to address the COVID-19 situation carried out this study. It evaluated the efficacy of a few ML techniques (neural networks, gradient-boosted trees, Random

Investigation of Statistical and Machine Learning Models …

183

Forests, Logistic Regression, and SVM) for COVID positivity. The study in [7] used a number of dividers, including Logistic Regression, multilayer perceptron (MLP), and XGBoost. Over 91% of COVID-19 patients were correctly categorized. An ML algorithm was created and tested for COVID-19 diagnosis in the work in [8]. Based on lab features and demographics, the algorithm was created. They tested a few ML models before aggregating them to perform the final categorization. The created method exhibited a 0.093 sensitivity and a 0.64 specificity. COVID-19 is predicted by the function in [9] with 91% and 89% accuracy, respectively. Additionally, in 98% of cases, the requirement for an ICU or semi-ICU was predicted [10]. Since there is not a lot of research on text-based diagnosis and prediction and most of the analysis is done on image recognition for COVID-19, we employed ML models to categorize clinical reports as either COVID-positive or COVID-negative.”

3 Methodology 3.1 Data Collection As WHO declared the coronavirus pandemic, a public health emergency, hospitals, and researchers have made data on the epidemic available to the public. We procured a dataset from Kaggle.com, and it has 5,861,480 rows and ten columns. This dataset contains ten features/variables that are binary in nature and could be determinants in the prediction of COVID-19, as well as one class attribute that defines if COVID-19 is found, as shown in Table 1. The following table gives a concise description of the columns of the dataset used in our analysis.

3.2 Data Preprocessing It is the phenomenon of converting raw data into an understandable format. Data from the real world might contain noise, be missing numbers, or be in an incompatible format, rendering ML models unable to use it directly. Data preparation is an important stage in which we clean the data and prepare it to be compatible with, or suitable for use in, an ML model. The key phases in data preparation are as follows: Removing Features: Since test_date and test_indication = other would not have significance on prediction of target variable, i.e., corona_result, we remove both the features. Removed features = test_date, test_indication_other We also performed a Chi-Square test on the dataset since our dataset is completely categorical, to see if we can further remove any unimportant variables that may not contribute to the detection of our target variable, i.e., corona_result. However, completion of the test confirmed that there were no such variables.

184

J. Saggu and A. Bansal

Table 1 Description of parameters considered in our dataset Parameter

Description

Cough

This column describes whether a patient has cough or not

Fever

This column describes whether a patient has fever or not

Sore_throat

This column describes whether a patient has sore throat or not

Shortness_of_ breath

This column describes whether a patient suffers from shortness of breath or not

Head_ache

This column describes whether a patient suffers from head ache or not

Corona_result

This column describes whether a patient is COVID positive or negative

Age_60_and_ above

This column describes whether a patient’s age is above or below 60 year

Gender

This column describes patient’s gender

Test_indication

This column tells about the test_indication and is further classified into ‘Other’, ‘Abroad’, and ‘Contact with Confirmed’

Binary variables = sore_throat, cough, shortness_of_breath, fever, head_ache, corona_result, age_ 60_and_above, gender. Non-binary variable = test_indication

Undersampling the data: We performed data undersampling after removing the features, as it was noticed that the data were abundant for negative COVID cares. We used RandomUnderSampler() under the imblearn.under_sampling library, setting our sampling strategy as 0.6. Before and after undersampling of the dataset is graphically demonstrated in Fig. 1. Splitting the dataset: The dataset must be split as the next step in the preprocessing of ML data. The training and testing datasets for a ML model should be separated. We split the data in half, 70:30. This means that we preserve 30% of the data for testing and use the remaining 70% to train the model.

Fig. 1 Graph of corona_result versus count before (left) and after (right) undersampling the data

Investigation of Statistical and Machine Learning Models …

185

3.3 Performance Metrics The following parameters are considered in order to draw a comparison between the performances of the Logistic Regression model with all the other ML models under consideration: Accuracy: It evaluates a model’s percentage of true predictions. Accuracy = (TN + TP)/(TN + TP + FN + FP),

(1)

where TP TN FP FN

true-positive predictions. true-negative predictions. false-positive predictions. false-negative predictions.

Mean Squared Error: It helps to determine the average squared difference between predicted and actual values. MSE =

n 1 (Yi − Yˆi )2 , n i=1

(2)

where n is the number of data points, yi is the actual target value, and Yˆi is the model’s projected value. R2 Score: It is the proportion of the variance in the variable that is dependent that can be anticipated using the independent variables, and it ranges from 0 to 1; larger the value, stronger the predictive power. R 2 Score = 1 − (SSR/SST),

(3)

where SSR (Sum of Squared Residuals) = variation in the predicted values that the model cannot explain. SST (Total Sum of Squares) = total variation in the dependent variable. ROC Score: It evaluates the effectiveness of a classification model by comparing the amount of true positives to the percentage of false positives at various criterion settings, with greater scores suggesting stronger discrimination ability. The ROC score is typically calculated using the Area Under the ROC Curve (AUC).

186

J. Saggu and A. Bansal

4 Algorithms Used The major goal for us is to assess the effectiveness of the Logistic Regression statistical model against the Supervised ML algorithms, which include KNN, Decision Tree, Random Forest, XGBoost, and SVM.

4.1 Statistical Model The Logistic Regression statistical technique is used in investigating the interaction between a dependent variable, that is binary and one or more independent variables [11]. It is often used to forecast the likelihood of an event occurring based on historical data. Logistic Regression generates a probability value between 0 and 1, that may be used to categorize fresh data items. The model employs a logistic function to convert the input variables to the output probability. Hyperparameters: C = 1; penalty = ‘l1’; solver = ‘saga’.

4.2 Machine Learning Algorithms Predictive modeling is evolving with the development of computer technology. Predictable modeling can now be done more effectively and economically than in the past. In order to identify the most sophisticated answer for each classification method, we employ a GridSeachCV in our project. In Table 2, we demonstrate the algorithms along with the tuned hyperparameters for the best performance:

5 Result Analysis From Table 3, we can see that the ML algorithms outperform the Statistical Logistic Regression method based on various performance metrics. Also, the former has lesser MSE value than latter which demonstrates less scope of error. This is also shown in Fig. 2 graphically, where the average of ML algorithms has a better performance than Statistical Logistic Regression. From Table 4, we can also see the difference between the metrics of average of the ML models and statistical Logistic Regression model. Also, among the ML models (Table 3), SVM and XGBoost have the highest accuracy of 97.75% each, but the former has a MSE value of 2.252 and the latter has a value of 2.254 as seen in Table 3. Hence, SVM is the best-performing ML algorithm. Therefore, performance of the average of ML algorithms is better than Logistic Regression, and among the ML models, SVM has the best performance for predicting the target variable in the dataset, i.e., corona_result.

Investigation of Statistical and Machine Learning Models …

187

Table 2 Different machine learning models and the hyperparameters used Algorithm

Description

Hyperparameters (if any)

KNN

Classifies new data points in the training set based on the majority vote of their k-nearest neighbors

k=5

Decision Tree

Generates a flowchart-like tree structure in order to make judgments by recursively partitioning the data depending on the decision criteria [12]

Random Forest

Based on ensemble learning, dataset is divided into subsets and applied on different Decision Trees for implementing strong-learners [13, 14]

n_estimators = 200, max_depth = 8

XGBoost

Combines numerous weak predictive models stage by stage to generate a strong predictive model with the goal of minimizing total prediction error

eta = 0.2, gamma = 0.5, max_depth = 5, n_ estimators = 200

SVM

This column divides data points into classes by locating an ideal hyperplane with the greatest margin between themescribes patient’s gender [15, 16]

Table 3 Comparing performance of Logistic Regression with other machine learning models Model

Algorithm

Accuracy

MSE

R2 score

ROC score

Statistical

Logistic Regression

97.72

2.279

90.26

98.151

Machine learning

KNN

97.74

2.259

90.35

98.179

Decision Tree

97.74

2.258

90.36

98.181

Random Forest

97.74

2.255

90.36

98.189

XGBoost

97.75

2.254

90.37

98.182

SVM

97.75

2.252

90.38

98.193

Bold means Proposed Model results

6 Conclusion With the increased rise of COVID-19, it is essential to investigate various machine learning models for accurate prediction of COVID-19. As machine learning has shown tremendous positive outcomes in the healthcare field by making predictions in the analysis and processing of data, we try to inculcate these models in the detection of COVID-19. We have used the Statistical Logistic Regression model and compared it with the stronger ML models including KNN, Decision Tree, Random Forest, Extreme Gradient Boost, and SVM. The ML models have outperformed the statistical model. The models have depicted the percentage improvement of 0.024%, 0.103%, 0.115%, and 0.034% in accuracy, MSE, R2 score, and ROC score, respectively. Therefore, the authors suggest the use of ML models to predict and diagnose COVID-19. As future work, the authors plan to investigate more ML models including ensemble learners on a number of open-source COVID-19 datasets. This would lead to more generalizable results and hence more verified and stable conclusions.

188

J. Saggu and A. Bansal

Fig. 2 Graphs comparing the different evaluation metrics between Logistic Regression and an average of all the machine learning algorithms

Table 4 Percentage improvement of ML algorithms over statistical model

Accuracy

Mean squared error

R2 score

ROC score

0.024%

0.103%

0.115%

0.034%

References 1. Wu F, Zhao S, Yu B, Chen YM, Wang W, Song ZG, Hu Y, Tao ZW, Tian JH, Pei, YY, Yuan ML, Zhang YL, Dai FH, Liu Y, Wang QM, Zheng JJ, Xu L, Holmes EC, Zhang YZ (2020) A new coronavirus associated with human respiratory disease in China 2. Gautret P, Lagier JC, Parola P, Meddeb L, Mailhe M, Doudier B, Courjon J, Giordanengo V, Vieira VE, Dupont HT (2020) Hydroxychloroquine and azithromycin as a treatment of Covid-19. Int J Antimicrob Agents 56(1):105949 3. Lai CC, Shih TP, Ko WC, Tang HJ, Hsueh PR (2020) Severe acute respiratory syndrome coronavirus 2 (sars-cov-2) and coronavirus disease-2019 (Covid-19). Int J Antimicrob Agents 55(3):105924

Investigation of Statistical and Machine Learning Models …

189

4. Garcia S, Luengo J, Sáez JA, Lopez V, Herrera F (2012) A survey of discretization techniques. IEEE Trans Knowl Data Eng 25(4):734–750 5. Muhammad I, Yan Z (2015) Supervised machine learning approaches. ICTACT J Soft Comput 5(3) 6. Medscape Medical News (2020) The WHO declares public health emergency for novel coronavirus 7. Batista AFM, Miraglia JL, Donato THR, Filho ADPC (2020) COVID-19 diagnosis prediction in emergency care patients. medRxiv 8. Mondal MRH, Bharati S, Podder P, Podder P (2020) Data analytics for novel coronavirus disease. In: Informatics in Medicine Unlocked Elsevier, vol 20, pp 100374 9. Schwab P, Schutte AD, Dietz B, Bauer S (2020) Clinical predictive models for COVID-19: systematic study. J Med Internet Res 22(10):e21439 10. Goodman-Meza D, Rudas A, Chiang JN, Adamson PC, Ebinger J (2020) A machine learning algorithm to increase COVID-19 inpatient diagnostic capacity. PLoS ONE 15(9):e0239474 11. Connelly L (2020) Logistic regression. Medsurg Nurs 29(5):353–354 12. Patel BR, Rana KK (2014) A survey on decision tree algorithm for classification. Int J Eng Dev Res IJEDR 2(1) 13. Breiman L (2001) Random forests. 45(1):5–32 14. Hastie T, Tibshirani R, Friedman J (2009) Random forests. In: The elements of statistical learning. Springer, 587–604 15. Wang H, Xiong J, Yao Z, Lin M, Ren J (2017) Research survey on support vector machine. In: Proceedings of the 10th EAI International conference on mobile multimedia communications, pp 95–103 16. Rahman MM, Islam MD, Manik MD, Hossen M, Al-Rakhami MS (2021) Machine learning approaches for tackling novel coronavirus (Covid-19) pandemic. Sn Comput Sci 2(5):1–10 17. Sun Y, Koh V, Marimuthu K, Ng OT, Young B, Vasoo S, Chan M (2020) Epidemiological and clinical predictors of COVID-19. Clin Infect Dis 71(15):786–792

SONAR-Based Sound Waves’ Utilization for Rocks’ and Mines’ Detection Using Logistic Regression Adrija Mitra, Adrita Chakraborty, Supratik Dutta, Yash Anand, Sushruta Mishra, and Anil Kumar

Abstract SONAR, which is called sound navigation and ranging, uses sound waves to identify things underwater. It is usually utilized for two things which include rock detection and mine detection. Rock detection entails utilizing SONAR to identify the presence of rocks or other underwater impediments that might endanger boats or ships. This is often accomplished by analyzing the sound waves that bounce back from the bottom and use machine learning algorithms to find patterns that signal the presence of rocks. Mine detection, on the other hand, is a more difficult process that entails recognizing and finding underwater explosive devices. This is often accomplished by combining SONAR with additional sensing technologies, such as magnetic or acoustic sensors. Machine learning techniques are then used to analyze the data and detect patterns that indicate the existence of mines. Based on the input features, logistic regression can predict one of two outcomes and is frequently used for binary classification. It is capable of classifying SONAR data as rock or mine. To train the logistic regression model, a dataset of rock and mine examples are gathered and preprocessed to extract key characteristics which further gets normalized. The model should then be trained to learn a decision boundary that divides the two classes. The trained algorithm can predict whether new SONAR data will be classified as rocks or mines. Depending on the properties of the dataset and the task at hand, other machine learning algorithms, such as support vector machines or neural networks, may be more effective. Recorded training and testing accuracies using logistic regression were 96.2 and 91.5%, respectively. Keywords Sound waves · Rocks · Mines · Machine learning · Logistic regression

A. Mitra · A. Chakraborty · S. Dutta (B) · Y. Anand · S. Mishra Kalinga Institute of Industrial Technology, Deemed to be University, Bhubaneswar, India e-mail: [email protected] S. Mishra e-mail: [email protected] A. Kumar DIT University, Dehradun, India © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 A. Swaroop et al. (eds.), Proceedings of Data Analytics and Management, Lecture Notes in Networks and Systems 788, https://doi.org/10.1007/978-981-99-6553-3_15

191

192

A. Mitra et al.

1 Introduction The use of machine learning to foresee SONAR rocks against mines is covered in this article. SONAR is an acoustic technique used to find and gauge the size and direction of underwater objects. The SONAR gadget recognizes and recognizes sound waves produced or reflected by the object. There are three different categories for SONAR systems. A target object reflects back a sound wave that is emitted by an acoustic projector in an active SONAR system. A receiver picks up the reflected signal and examines it in order to determine the target’s range, heading, and relative velocity. In essence, passive systems are receivers that pick up noise that the target (such as a ship, submarine, or torpedo) emits. By using this method of observation, waveforms can be inspected to identify features as well as direction and distance. An acoustic communication system, which requires a projector and receiver at both ends of the acoustic line, is the third type of SONAR equipment. The extraction of rich minerals from the crust of the globe, including its oceans, is known as mining. A mineral is an inorganic substance that is found in nature and has distinct chemical properties, physical features, or molecular structure, with a few significant exceptions. When evaluating mineral reserves, profit must be taken into account. The ore reserve only relates to the quantities that can be profitably extracted, whereas the mineral inventory refers to the entire number of minerals in a given deposit. Figure 1 shows the SONAR usage to distinguish between rocks and mines. SONAR noises and targets can be recognize using machine learning and deep learning algorithms. Machine learning enables the analysis of SONAR waves and target detection. It is a branch of artificial intelligence that provides guidelines for improving the data usage of robots. Receiving data as input, recognizing characteristics, and predicting fresh patterns are the three stages of machine learning. Principal component analysis, logistic regression, support vector machines, k-nearest neighbors (KNN), C-means clustering, and other ML approaches are commonly used in this subject. In this statistical model called logistic regression, a binary dependent variable is modelled using a logistic function based on one or more independent variables, often called features or predictors. The goal of logistic regression is to find the best-fit parameters for the logistic function that can accurately predict the probability of the binary outcome given the input features. In the case of SONAR, logistic regression could be used to classify the data as either indicating the presence of a rock or a mine. The input features would be derived from the SONAR data, such as the frequency and amplitude of the sound waves. The logistic regression model would then learn a decision boundary that separates the two classes based on the input features. To make predictions, the logistic regression model calculates the probability of the binary outcome (rock or mine) given the input features. If the probability is greater than a certain threshold, typically 0.5, the model predicts the positive outcome (mine), and if it is less than the threshold, the model predicts the negative outcome (rock). This threshold can be adjusted to prioritize either precision or recall, depending on the application’s needs. Overall, logistic regression is a powerful tool for binary

SONAR-Based Sound Waves’ Utilization for Rocks’ and Mines’ …

193

Fig. 1 Representing SONAR usage by submarines to detect difference between Rocks and Mines

classification tasks and has proven to be effective in various applications, including SONAR rock and mine detection.

2 Literature Review In [1], the usage of Creative Commons Attribution 4.0 International Licence 494 is carried out. Prediction of targets for underwater is done to discuss the classification of SONAR targets into rocks and mines using Meta-Cognitive Neural Network (MCNN) and Extreme Learning Machine (ELM) classifiers. It is done to achieve an acceptable efficiency in Classification of SONAR targets using advanced neural networks. In [2], researchers have tested a range of methods for linking and excluding noisy data, which is usually referred to as arbitrary chaos in the training dataset. In essence, these styles identify data exemplifications that confuse the training model and lessen the delicateness of brackets. They typically look for data abnormalities and analyze how they affect delicate categorization. In [3], numerous machine learning methods are analyzed, and various approaches to the detection of network intrusions are suggested. Firewalls, antivirus programmes, and other network intrusion detection systems are some of the various systems that make up the network security system. The primary goal of an intrusion detection system is to identify unauthorized system activity like copying and modification. In [4], using a big, intricate, and highly spatial SONAR dataset, this work was a basic case study that established a machine learning technique for the classification of rocks and minerals. In [5], by combining neural networks and online learning, Online Multiple Kernel Learning (OMKL), a technique created by Ravi et al. aims to build a kernel-based prediction function from a series of specified kernels. Here, SVM and NN algorithms were

194

A. Mitra et al.

used to separate the SONAR data. In [6], ocean mines are the primary threat to the safety of large ships and other marine life. It is a self-contained explosive device that is used to destroy submarines or ships. Due to several factors, such as variations in operating and target shapes, it is difficult to identify and classify SONAR pictures with relation to underwater objects. In [7], assuming an object is within the sound pulse’s range, and the sound pulse will reflect off the target and send an echo in the direction of the SONAR transmitter if the target is within the range of the sound pulse. The temporal delay between the production of the pulse and the reception of the signal it is linked with is set by the transmitter. In [8], in recent years, the DL area has rapidly grown and been successfully used to a wide range of conventional applications. More importantly, DL has outperformed well-known ML techniques in a number of sectors, including cybersecurity, natural language processing, bioinformatics, robotics and control, and the study of medical data. In [9], choosing a subset of features for a learning and statistics system’s model construction is referred to as feature selection. Local search algorithms can assist in reducing the number of attributes by using sequential search methods. Artificial neural networks are a wellknown artificial intelligence technology that can depict and capture complex relationships between data input and output. In [10], underwater mines are a vital military tactic for protecting any country’s maritime borders. They consist of an explosive charge, a sensing mechanism, and a totally autonomous device. Mines from earlier generations have to come into contact with the ship directly to detonate. In contrast, newly built mines are equipped with advanced sensors that often recognize different fusions of magnetic and auditory signals.

3 Proposed Work The proposed model is illustrated in Fig. 2. SONAR, which stands for sound navigation and ranging, is beneficial for exploring and charting the ocean since sound waves travel farther in water than radar and light waves do. NOAA scientists primarily employ SONAR to make nautical charts, identify underwater navigational hazards, locate and map objects on the seafloor, including shipwrecks, and find and map objects on the bottom [11–13]. SONAR employs sound waves to provide vision in the water. In this study, we use SONAR to broadcast and receive signals from rocks and metal cylinders after they reflect back to us; this allows us to determine whether or not a mine is there. We adjusted the machine learning model to accommodate this SONAR data. We needed to go through a procedure known as data preprocessing because we could not have used these SONAR data directly for modelling. Processing data increases accuracy and dependability. Preprocessing data can improve a dataset’s accuracy and dependability by removing missing or inconsistent data values that are the result of either human or computer error. The data become consistent as a result. Multi-processing is the preprocessing method used for this project since it allows two or more processors to operate simultaneously on the same dataset [14]. The same mechanism then stores this dataset. In a single computer system, data are divided

SONAR-Based Sound Waves’ Utilization for Rocks’ and Mines’ …

195

Fig. 2 Diagrammatic representation of the proposed model

into frames, and each frame is processed in parallel by two or more CPUs. We must split this dataset into training and testing halves after data processing. This phase is necessary because, after processing the dataset, there may be a large amount of data, most of which will likely be used for training and the remainder for testing. Given that this example involves a binary case model—that is, either we detect a rock or a mine—and since the logistic regression model works best in binary situations, we choose to employ this model. Logistic regression is one of the machine learning algorithms that is most frequently employed in the Supervised Learning category. It is used to forecast the categorical dependent variable using a specified set of independent variables. The dataset that is currently available must be used to train this data model. This well-trained logistic regression model will assist in identifying how a mine’s features differ from those of a rock. We will receive two abbreviations in the final forecast result, such as R stands for rock and M stands for mine [15].

4 Implementation Analysis The tested model of the data had a accuracy of 91.5%, whereas the trained model passed with an accuracy of 96.2% as shown in Fig. 3. The model is continually trained and tested, and the process is then repeated with more datasets to assess the model’s correctness. Since the accuracy kept varying within the previously mentioned range, we ultimately aggregated it to the following metrics. Figure 4 shows a heat map, which is a graphic representation of data that use a system of colour coding to represent different values. Heat maps can be used for a

196

A. Mitra et al.

Fig. 3 Accuracies of trained and tested models using Logistic Regression Classifier

wide range of statistics, although they are most often used to show user behaviour on certain websites or web page themes. The association between numerous variables that reflect different datasets is shown on the heat map [16]. The colours in the illustration demonstrate the intensity of the colours. Warmer colours have higher values, whereas warmer colours have lower values. There were 58 rows in total; thus, each value is being displayed separately. A − 0.4 to 1.0 indication that displays the strength of all the linked values in the dataset is also present. Less value in the indication equals a higher intensity value across the board, and vice versa. Figure 5 shows the box plot representation of a particular feature from the dataset. The figure clearly conveys how the numbers are particularly concentrated towards the 0.01–0.04 range, respectively. As we go deeper into the dataset, we will get a huge variance of data ranging as low as 0.07 to as high as 0.14. The scatter plot representation as shown in Fig. 6 of the first two features is given in the dataset. The huge number of dots in the range 0–0.05 shows how densely populated the dataset is within this range. Moreover, this also shows how minute the differences are between the detection of a rock and a mine in this dataset. Some significant societal benefits of the model include that it works well with linearly separable datasets and offers a good level of accuracy for many common datasets [17–19]. Regarding the distributions of the classes in feature space, it makes no assumptions. This is substantially simpler to set up and train than other machine learning and AI applications [20].

SONAR-Based Sound Waves’ Utilization for Rocks’ and Mines’ …

197

Fig. 4 Heat map representation of a feature of the proposed model

Fig. 5 Box plot representation of a feature of the proposed model

5 Conclusion and Future Scope Our research on “Underwater mine and rock prediction by evaluation of machine learning algorithms” identifies rocks and mines on the ocean floor. Naval mines are an effective tool for halting ships and restricting naval operations, but they have serious detrimental impacts on the economy and ecology. The two established methods for

198

A. Mitra et al.

Fig. 6 Scatter plot representation of first two features from the proposed model

locating mines are SONAR waves and manual labour. Given the increased risk, using SONAR signals has shown to be a more efficient strategy. The collected data are stored in a CSV file. We may explore and understand the nature of the prediction system by using a variety of machine learning approaches. We can confirm and evaluate the accuracy of algorithms through analysis, and we can use the results to create a system that works better. In addition to rocks, the ocean floor contains a number of undesirable elements that could affect the accuracy of our model’s predictions. Additionally, there are plastic wastes, radioactive wastes, and various other kinds of mines. Such a crucial calculation should have an accuracy of about 85– 90%. For our machine learning algorithm to accurately identify the kind of substance encountered, much more research and innovation are needed. The large data Hadoop architecture will be used to handle increasingly complicated data in the future studies. The work was primarily concerned with the SONAR’s backend capabilities; frontend development calls for familiarity with the flask or Django frameworks. Once we got the information, we could build the frontend and think about deployment.

References 1. Lepisto L, Kunttu I, Visa AJE (2005) Rock image classification using color features in Gabor space. J Electron Imag 14(4). Article ID 040503 2. Fong S, Deb S, Wong R, Sun G (2014) Aquatic sonar signals recognition by incremental data sluice mining with conflict analysis. Int J Distrib Sens Netw 10(5):635834 3. Ali SF, Rasool A (2020) SONAR data classification using multi-layer perceptrons. Int J 5(11)

SONAR-Based Sound Waves’ Utilization for Rocks’ and Mines’ …

199

4. Hossain MM, Paul RK (2019) Prediction of underwater surface target through SONAR: a case study of machine learning. Int J Inform Technol 11(1):51–57. https://link.springer.com/https:// doi.org/10.1007/978-981-15-0128-9_10 5. Siddhartha JB, Jaya T, Rajendran V (2018) RDNN for classification and prediction of rock/ mine in underwater acoustics. J Appl Sci Comput 5(1):1–5 6. Padmaja V, Rajendran V, Vijayalakshmi P (2016) Study on metal mine detection from underwater sonar images using data mining and machine learning techniques. Int J Adv Res Electr Electron Instrum Eng 5(7):6329–6336. https://link.springer.com/https://doi.org/10.1007/s12 652-020-01958-4 7. Khare A, Mani K (2020) Prediction of rock and mineral from sound navigation and ranging waves using artificial intelligence techniques. Int J Comput Intell Res 16(4):625–635 8. Alzubaidi L, Zhang J, Humaidi AJ, Al-Dujaili A, Duan Y, Al-Shamma O, Santamaría J, Fadhel MA, Al-Amidie M, Farhan L (2021) Review of deep learning: concepts, CNN architectures, challenges, applications, future directions. Neural Comput Appl 33(19):14173–14192 9. Abdul-Qader B (2016) Techniques for classification sonar: rocks vs. mines. J Comput Sci Technol 16(3):75–80 10. https://ieeexplore.ieee.org/abstract/document/10011104 11. Ho˙zy´n S (2018) A review of underwater mine detection and classification in sonar imagery. Arch Min Sci 63(1):149–164 12. Tripathy HK, Mishra S (2022) A succinct analytical study of the usability of encryption methods in healthcare data security. In: Next generation healthcare informatics. Springer Nature Singapore, Singapore, pp 105–120 13. Raghuwanshi S, Singh M, Rath S, Mishra S (2022) Prominent cancer risk detection using ensemble learning. In: Cognitive informatics and soft computing: proceeding of CISC 2021. Springer Nature Singapore, Singapore, pp 677–689 14. Mukherjee D, Raj I, Mishra S (2022) Song recommendation using mood detection with Xception model. In: Cognitive informatics and soft computing: proceeding of CISC 2021. Springer Nature Singapore, Singapore, pp 491–501 15. Sinha K, Miranda AO, Mishra S (2022) Real-time sign language translator. In: Cognitive informatics and soft computing: proceeding of CISC 2021. Springer Nature Singapore, Singapore, pp 477–489 16. Mishra Y, Mishra S, Mallick PK (2022) A regression approach towards climate forecasting analysis in India. In: Cognitive informatics and soft computing: proceeding of CISC 2021. Springer Nature Singapore, Singapore, pp 457–465 17. Patnaik M, Mishra S (2022) Indoor positioning system assisted big data analytics in smart healthcare. Connected e-health: integrated IoT and cloud computing. Springer International Publishing, Cham, pp 393–415 18. Periwal S, Swain T, Mishra S (2022) Integrated machine learning models for enhanced security of healthcare data. In: Augmented intelligence in healthcare: a pragmatic and integrated analysis. Springer Nature Singapore, Singapore, pp 355–369 19. De A, Mishra S (2022) Augmented intelligence in mental health care: sentiment analysis and emotion detection with health care perspective. In: Augmented intelligence in healthcare: a pragmatic and integrated analysis, pp 205–235 20. Dutta P, Mishra S (2022) A comprehensive review analysis of Alzheimer’s disorder using machine learning approach. In: Augmented intelligence in healthcare: a pragmatic and integrated analysis, pp 63–76

A Sampling-Based Logistic Regression Model for Credit Card Fraud Estimation Prapti Patra, Srijal Vedansh, Vishisht Ved, Anup Singh, Sushruta Mishra, and Anil Kumar

Abstract One of the most frequent problems that we are facing today is credit card fraud detection, and the most definite reason behind it is the phenomenal increase in online transactions. Currently, we are often facing such fraud cases due to unauthorized purposes of money transactions in our everyday life. Hence, to detect such fraudulent activities, we can use credit card deceit assessment model. In this paper, we propose our approach to detect such frauds. Our study mainly attempts to address upon the application of predictive techniques on this domain. The algorithms that we have used are logistic regression, decision tree classifier, and random forest classifier. The derived results are evaluated using accuracy, precision, recall, and F1-score. We have used all three algorithms for both undersampling and oversampling cases. The logistic regression technique generates the optimum result thereby giving the best accuracy, precision, recall, and F1-score. Thus, it can be inferred as the best alternative in detection of credit card frauds. Keywords Credit card · Fraud detection · Credit card fraud · Logistic regression · Decision tree

1 Introduction Fraud detection in credit cards involves tracking the activity of card holders so as to estimate and prevent unauthorized transactions and objectionable behavior. In the present world, credit card fraud is on rise especially in corporate and finance industries. Our population is highly dependent on the Internet today and that is one of the main reasons for online fraudulent transactions. Although, offline transactions go through similar fraud cases as well. We have data mining techniques to detect these P. Patra · S. Vedansh · V. Ved · A. Singh · S. Mishra (B) Kalinga Institute of Industrial Technology, Deemed to Be University, Bhubaneswar, India e-mail: [email protected] A. Kumar Tula’s Institute, Dehradun, India © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 A. Swaroop et al. (eds.), Proceedings of Data Analytics and Management, Lecture Notes in Networks and Systems 788, https://doi.org/10.1007/978-981-99-6553-3_16

201

202

P. Patra et al.

Fig. 1 Graph depicting growth of Internet users over time

frauds, but the result is not very accurate. Hence, we need some promising methods to minimize such credit card frauds. We can do that with the help of efficient machine learning algorithms [1]. As shown below in Fig. 1, due to the increasing growth of Internet users, the finance company issues credit cards to individuals. As far as card insurance is concerned, amount is to be paid back by card user and also the extra charge agreed by both parties. Predictive techniques are designed in order to assess all valid processing and track the ambiguous ones. Professionals investigate the documents and contact the cardholders to verify whether the transaction is legitimate or fraudulent [2]. We have used three algorithms. Logistic Regression It is a statistical method used in prediction-based classification domains, where the objective is to estimate if any input is associated with which category. The logistic regression algorithm uses a logistic function to calculate the probability of the input belonging to each category. Decision Tree Classifier This classification method is applied to classification-based problems. It is a type of supervised technique which utilizes a hierarchical framework to categorize samples. The tree is made up of nodes and branches, where the nodes denote testcase on a variable, while the branch denotes result of testcase. At each node, a decision is made based on the value of a feature or attribute, and the decision leads to the next node in the tree until a classification decision is made at the final node.

A Sampling-Based Logistic Regression Model for Credit Card Fraud …

203

Fig. 2 Rough architecture diagram for fraud detection

Random Forest Classifier It is a popular ensembling technique in intelligent learning. It is a supervised learning algorithm that constructs many trees to integrate their estimations to generate ultimate outcome. During the training phase, the algorithm constructs a forest of decision trees by repeatedly selecting a random subset of the data and features and then growing a decision tree using that subset. The method further integrates the predicted values of all trees in forest to provide overall estimation [3]. Figure 2 shows a rough architecture of the fraud detection system. Main contribution in the paper are as follows: . Our objective is to detect fraud credit card processing with predictive methods. . This study makes use of three predictive techniques, namely logistic regression, decision tree classifier, and random forest classifier. . It was observed that logistic regression provided the best accuracy on undersampling data with 95.78%, whereas on oversampling data, random forest classifier provides the best accuracy with 99.99%.

2 Literature Review Researchers have introduced several new techniques for credit card fraudulent analysis, involving computational intelligent techniques and cognitive units. Below listed are some related works in this regard. In 2019, Jain et al. [4] have researched a few fraud detection techniques like SVM, ANN, Bayesian networks, KNN, and fuzzy logic system. The authors inferred that the KNN, trees, and vector-based algorithms had average accuracy rate, while fuzziness-based system and regression methods had the least precision among all methods. On the other hand, neural networks, Naive

204

P. Patra et al.

Bayes, fuzzy systems, and KNN algorithms had a higher prediction degree. Multilevel regression, vector methods, and cluster trees gave superior predictive level at middle degree. However, there were few methods, including neural networks and Bayesian models, which performed well with different metrics, but they were costly for training. A significant demerit of all these models was that they did not produce the identical results in all types of environments. They provided good outcome in one sample set while inferior outcome in some other data. For instance, KNN and SVM algorithms performed well with small datasets, whereas logistic regression and fuzzy logic systems showed better efficiency with original unprocessed dataset. In 2019, Naika et al. [5] performed analysis on four algorithms, namely Naive Bayes, AdaBoost, logistic regression, and J48. Naive Bayes utilizes Bayes’ theorem to calculate the probability of occurrence of an activity. On the other hand, logistic regression is same as linear regression, but it is typically used for classification tasks. Linear regression is commonly used for prediction or forecasting values. J48 is an algorithm used for creating a decision tree and solving classification problems. It is an extension of the ID3 algorithm and is a popular intelligent learner method which operates primarily with categorical and constant variables. AdaBoost is designed for binary classification, primarily utilized for improving performance of decision trees. It is often used in fraud detection, such as classifying transactions as fraudulent or non-fraudulent. Researchers have found that both AdaBoost and logistic regression have almost similar efficiency; however, the AdaBoost algorithm is more suitable for detecting credit card fraud due to its faster processing time. In 2019, research has been done by authors in [6] where they introduced two significant algorithmic techniques—the whale optimization techniques (WOA) and synthetic minority oversampling techniques (SMOTE). The primary objective of these techniques is to enhance the convergent velocity and resolve the data skewing concern. SMOTE technique addresses the problem of class imbalance by generating synthetic transactions that are re-sampled to validate dataset effectiveness. The WOA technique is then applied to optimize the synthesized transactions. This algorithmic approach improves the reliability, efficiency, and convergence speed of the system. In 2018, authors in [7] investigated decision trees, random forest, SVM, and logistic regression on a highly skewed dataset. They evaluated the performance based on metrics such as accuracy, sensitivity, specificity, and precision. The results showed that the accuracy for logistic regression was 97.7%, for decision trees was 95.5%, for random forest was 98.6%, and for SVM classifier was 97.5%. The authors confirmed that random forest outperformed others and had the highest accuracy among the other algorithms for detecting fraud. They also found that the SVM algorithm had a data skewing issue and did not produce good outcome for determining credit card fraud. In a related domain, Yu and Wang in [8] proposed an outlier detection concept to detect suspicious variables in a data. Their method considers fraudulent points as a separate zone in vectored region, which can either appear independently or be part of a small group of clustered data points. According to findings, the approach achieves an accuracy of 89.4%, while the outlier limit is predefined as 12.

A Sampling-Based Logistic Regression Model for Credit Card Fraud …

205

3 Proposed Model Figure 3 is a flowchart showing the methodology of our fraud detection system. We collect the sample dataset from the customer transaction database, and we train it with the three models that we are using, namely logistic regression, decision tree classifier, and random forest classifier [9]. When a user performs a transaction, it is taken into the decision function using the fraud detection algorithm. The output is compared and analyzed. If found legitimate, transaction is approved. If found fraud, the respective bank is alerted for verification.

Fig. 3 Proposed methodology of fraud detection

206

P. Patra et al.

Fig. 4 Steps of project planning for the proposed work

Figure 4 shows the basic steps for project planning and execution. First we collect input from the data samples and split it into training and testing sets. Next, we prepare the data. We choose a model and train our dataset. We deploy our model and evaluate its performance by testing it. Finally, we use our model on testing data to make predictions accurately [10].

4 Result Analysis and Discussion We have aggregated the data samples from Kaggle [11], a widely used website for downloading datasets. A full cross-validation has been performed to validate the performance of the algorithms. For undersampling, after performing a complete mean of the following data it is observed that logistic regression provides the best results on the data thereby producing accuracy, precision, recall, and F1-score as 95.78%, 95.46%, 94.78%, and 95.22%, respectively. Table 1 highlights the overall analysis using the classifiers. Table 1 Performance results using credit fraud data samples Algorithms used

Accuracy

Precision

Recall

F1-score

Logistic regression

95.78

95.46

94.78

95.22

Decision tree

90.32

93.68

87.25

90.35

Random forest

94.736

93.77

90.196

94.84

A Sampling-Based Logistic Regression Model for Credit Card Fraud … 96.5 96 95.5 95 94.5 94 93.5 93 92.5 92 91.5 91

207

Accuracy

Precision

Recall

F1-Score

With Sampling

95.78

95.46

94.78

95.22

Without Sampling

92.77

93.56

93.11

92.85

With Sampling

Without Sampling

Fig. 5 Performance analysis in context to use of sampling method

In Fig. 5, effectiveness of logistic regression is validated based on use of sampling method on real-time credit card transactions data and their performance is calculated based on several metrics. The dataset is sampled based upon undersampling, achieving two sets of data distribution. The effectiveness of dataset is examined based on the accuracy, precision, recall, and F1-score [12, 13, 13, 14]. It is noted that use of sampling approach on classification enhances the effectiveness of prediction.

5 Conclusion and Future Scope This study shows the comparative performance of logistic regression, decision tree, and random forest. Increase in credit card frauds has been alarmingly addressed by the fraudulent control system in all the banks, so a machine learning based fraud detection system is used to provide both accuracy and transparency in assessing these frauds. All these three classifiers are trained based on real-time credit card transactions which will help us to reduce at least 40–50% of total fraud losses [15–18]. Given the flexibility of this study, various models can be combined as units, and their outcome can be embedded to enhance the final result’s efficiency. To further refine this model, additional algorithms can be integrated, as long as their output matches the others’ format. This modular approach allows for increased versatility and flexibility in the project. Another opportunity for improvement lies within the dataset. As demonstrated previously, the algorithms’ precision improves as the dataset’s size increases. Therefore, increasing the dataset’s size is likely to enhance the model’s ability to detect fraud and reduce false positives. However, gaining the necessary support from banks is essential to achieving this goal [11].

208

P. Patra et al.

References 1. Tripathy HK, Mishra S (2022) A succinct analytical study of the usability of encryption methods in healthcare data security. In: Next generation healthcare informatics. Springer Nature Singapore, Singapore, pp 105–120 2. Raghuwanshi S, Singh M, Rath S, Mishra S (2022) Prominent cancer risk detection using ensemble learning. In: Cognitive informatics and soft computing: proceeding of CISC 2021. Springer Nature Singapore, Singapore, pp 677–689 3. Mukherjee D, Raj I, Mishra S (2022) Song recommendation using mood detection with Xception model. In: Cognitive informatics and soft computing: proceeding of CISC 2021. Springer Nature Singapore, Singapore, pp 491–501 4. Jain Y, Tiwari N, Dubey S, Jain S (2019) A comparative analysis of various credit card fraud detection techniques. Int J Recent Technol Eng 7(5S2):402–407, ISSN: 2277-3878 5. Naik H, Kanikar P (2019) Credit card fraud detection based on machine learning algorithms. Int J Comput Appl 182(44):8–12 6. Mafarja MM, Mirjalili S (2017) Hybrid whale optimization algorithm with simulated annealing for feature selection. Neurocomputing 260:302–312 7. Khare N, Yunus S (2021) Credit card fraud detect ion using machine learning models and collating machine learning models. Int J Pure Appl Math 118(20):825–838, ISSN: 1314-3395. https://doi.org/10.30534/ijeter/2021/02972021 8. Yu W, Wang N (2009) Research on credit card fraud detection model based on distance sum. Int Joint Conf Artif Intell 2009:353–356 9. Credit Card Fraud Detection (2018) A realistic modeling and a novel learning strategy published. IEEE Trans Neural Netw Learn Syst 29(8) 10. Nadim A, Sayem IM, Mutsuddy A, Chowdhury MS (2019) Analysis of machine learning techniques for credit card fraud detection. IEEE 11. Mishra N, Mishra S, Tripathy HK (2023) Rice yield estimation using deep learning. In: Innovations in intelligent computing and communication: first international conference, ICIICC 2022, Bhubaneswar, Odisha, India, Dec 16–17, 2022, Proceedings. Springer International Publishing, Cham, pp 379–388 12. Chakraborty S, Mishra S, Tripathy HK (2023) COVID-19 outbreak estimation approach using hybrid time series modelling. In: Innovations in intelligent computing and communication: first international conference, ICIICC 2022, Bhubaneswar, Odisha, India, Dec 16–17, 2022, Proceedings. Springer International Publishing, Cham, pp 249–260 13. Verma S, Mishra S (2022) An exploration analysis of social media security. In: Predictive data security using AI: insights and issues of blockchain, IoT, and DevOps. Springer Nature Singapore, Singapore, pp 25–44 14. Singh P, Mishra S (2022) A comprehensive study of security aspects in blockchain. In: Predictive data security using AI: insights and issues of blockchain, IoT, and DevOps. Springer Nature Singapore, Singapore, pp 1–24 15. Swain T, Mishra S (2022) Evolution of machine learning algorithms for enhancement of selfdriving vehicles security. In: 2022 international conference on advancements in smart, secure and intelligent computing (ASSIC). IEEE, pp 1–5 16. Sahoo S, Mishra S (2022) A comparative analysis of PGGAN with other data augmentation technique for brain tumor classification. In: 2022 international conference on advancements in smart, secure and intelligent computing (ASSIC). IEEE, pp 1–7 17. Mohapatra SK, Mishra S, Tripathy HK (2022) Energy consumption prediction in electrical appliances of commercial buildings using LSTM-GRU model. In: 2022 international conference on advancements in smart, secure and intelligent computing (ASSIC). IEEE, pp 1–5 18. Stolfo SJ, Fan DW, Lee W, Prodromidis A, Chan PK (2000) Cost based modeling for fraud and intrusion detection: results from the JAM project. Proc DARPA Inf Survivability Conf Exposition 2(2000):130–144

A Sampling-Based Logistic Regression Model for Credit Card Fraud …

209

19. Deepti DP, Sunita MK, Vijay MW, Gokhale JA, Prasad SH (2010) Comput Sci Netw Secur 10(8)

iFlow: Powering Lightweight Cross-Platform Data Pipelines Supreeta Nayak, Ansh Sarkar, Dushyant Lavania, Nittishna Dhar, Sushruta Mishra, and Anil Kumar

Abstract With the advent of ML applications cutting across sectors, data preprocessing for the training and proper functioning of ML models has seen a rise in importance. This research paper represents a similar attempt by proposing iFlow, a software tool for the easy creation of cross-platform data flow pipelines based on the Python programming language. The tool leverages the default file system of the user’s operating system, enabling faster and real-time inflow and outflow of data for easier and more convenient data processing. The project plan emphasizes modularity and extensibility, with a focus on the automation of data pipelines, as well as the development of associated UI components for a better user experience. The paper highlights the potential applications of iFlow in the field of machine learning pipelines, positioning it as a lightweight and open-source MLOps framework for the future. Keywords iFlow · Data pipelines · Data processing · Cross-platform · Lightweight

S. Nayak · A. Sarkar · D. Lavania · N. Dhar · S. Mishra (B) Kalinga Institute of Industrial Technology, Deemed to be University, Bhubaneswar, India e-mail: [email protected] S. Nayak e-mail: [email protected] A. Sarkar e-mail: [email protected] D. Lavania e-mail: [email protected] N. Dhar e-mail: [email protected] A. Kumar DIT University, Dehradun, India © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 A. Swaroop et al. (eds.), Proceedings of Data Analytics and Management, Lecture Notes in Networks and Systems 788, https://doi.org/10.1007/978-981-99-6553-3_17

211

212

S. Nayak et al.

1 Introduction With the advent of data science and associated analytics for deriving conclusions on modern-day problems, new sophisticated tools have been developed by software engineers around the world to enable the automation of data flow pipelines allowing for faster and real-time in and outflow of data. These pipelines are generally used as stand-alone real-time feeds for procuring insights into data being transferred between (both inter and intra) systems. However, such tools often end up being platformspecific (mostly Linux-based platforms) and require a high initial effort for setting up thereby increasing the time required to get them up and running. The proposed software being developed as a part of this minor project (further referred to as “iFlow” in this document) allows for the easier creation of cross-platform pipelines based upon the Python programming language and leveraging the default file system exposed by the user’s OS. “iFlow” shall provide an easy and convenient way to set up such data (pre)processing pipelines which would be both lightweight and extensible to a wide variety of other possible use cases in the future. The inherent complexity involved in the project has been handled by taking specific design decisions related to the frameworks being used. Further implementation details can be found as we proceed through this document. The project plan has been created keeping in mind both modularity and extensibility which shall allow us to enhance support well into the future and possibly add more features and types of pipelines. In the real world, one of the major use cases of using data pipelines can be seen in the setting up of machine learning pipelines which comes under the umbrella of a currently upcoming field better known as MLOps. These pipelines allow the automation of the data cleaning steps and allow the passing of the processed data to subsequent pipelines defined in the workflow. Currently, the development of “iFlow” shall be solely focused on the automation of data pipelines and the development of associated UI components for a better and smoother user experience, but in the future, we plan to develop this further as an open-source, lightweight, and cross-platform MLOps framework. The following sections of this research paper are structured in order to provide the reader with a better and more in-depth understanding of the working of the system. The “Literature Review” focuses on making the reader aware of the various steps that have already been taken in this field of study and academic research. The “Proposed Model” forms the bulk of this paper and details each and every aspect of the entire system and how all the different components are to come together and act as an easy and efficient developer tool for easy setting up of data flow and preprocessing pipelines. The “Results” section on the other hand focuses on how we plan to implement the various components of the system as well as the expected advantages obtained as a result of the various design decisions that have been taken along the journey of developing the paper. The major contributions that we aim to make in the analysis are summarized as follows.

iFlow: Powering Lightweight Cross-Platform Data Pipelines

213

. Proposed “iFlow”, a lightweight and cross-platform software tool for easy creation of data flow pipelines in Python, leveraging the user’s OS file system for real-time inflow and outflow of data. . Emphasized “modularity and extensibility”, focusing on automation of data pipelines and development of UI components. Detailed proposed model for an easy and efficient developer tool for data flow and preprocessing pipelines. . Positioned iFlow as a “lightweight and open-source” MLOps framework for easier automation of data cleaning steps and processing of data in machine learning pipelines. . “Cross-platform tool” for easier creation of pipelines, lightweight, and extensible for a variety of future use cases. Specific design decisions related to frameworks are used to handle inherent complexity and enable faster data processing.

2 Literature Review Before diving deeper and attempting to create our very own data pipeline and preprocessing framework, it is necessary to understand the already available tools present in the market for the same purpose in order to tackle the problems faced by modernday developers while developing such streams for feeding and training of machine learning models. This literature review section attempts to summarize all such works of research and condense the matter that has been discussed in them. Machine learning and data pipelines are rapidly evolving fields, with researchers proposing various approaches to improve efficiency, scalability, and performance. One of the proposed approaches is the use of distributed computing technologies, as demonstrated by Bui et al. [1] in their data pipeline architecture that can handle large volumes of data with low latencies. Li et al. [2] took this approach further by introducing an automated pipeline optimization framework that uses a genetic algorithm to efficiently search for the best pipeline configuration based on performance metrics. However, integrating data pipelines and machine learning workflows efficiently in real-world scenarios remains a challenge. Islam et al. [3] proposed a conceptual architecture to seamlessly address this challenge. They identified the challenges and opportunities of implementing such an architecture in real-world scenarios. Cruz et al. [4] introduced a pipeline architecture that provides efficient integration and deployment of machine learning workflows, highlighting its benefits in terms of scalability, reusability, and easy integration. Another important aspect of data pipelines for machine learning is the choice of framework. Sivakumar et al. [5] compared different data pipeline frameworks based on factors such as ease of use, scalability, and performance. Furthermore, Onu et al. [6] discuss the challenges and opportunities of building an efficient and effective data pipeline for machine learning workflows. To ensure the reliability and performance of data pipelines, it is important to monitor them. Taranu et al. [7] provide a comprehensive review of existing research in pipeline monitoring and identify key challenges and opportunities in applying machine learning to this field.

214

S. Nayak et al.

Overall, these research papers provide valuable insights into various approaches for improving the efficiency and scalability of machine learning pipelines while identifying key challenges and opportunities in this rapidly evolving field. Data preprocessing is an essential step in machine learning tasks, and researchers have proposed various approaches to improve the efficiency and scalability of data preprocessing pipelines. One such approach is the use of a modular pipeline architecture, where each module performs a specific task such as data cleaning, transformation, or feature extraction. The pipeline employs parallelization techniques to improve processing speed [8, 9]. Another proposed approach is the use of cross-platform data preprocessing frameworks that leverage machine learning algorithms and cloud computing resources. The frameworks use deep neural networks (DNNs) to preprocess time series data or principal component analysis (PCA) and artificial neural networks (ANNs) to improve classification accuracy [10–12]. The pipeline architecture also supports cross-platform processing through the use of Apache Arrow as a cross-platform data format. Additionally, the pipeline employs Apache Spark for distributed processing and utilizes several optimization techniques, including caching and parallelization, to improve processing speed [13]. The proposed models ensure that the data can be processed efficiently on different platforms without the need for data format conversion or data movement, making the pipeline portable and allowing for seamless data preprocessing across different platforms, including Windows, Linux, and macOS [14]. Data mining primitives have increasingly been used in Customer Relationship Management (CRM) software. Open-source big data software stacks have emerged as an alternative to traditional enterprise database stacks. A large-scale industrial CRM pipeline is described that incorporates data mining and serves several applications using Kafka, Storm, HBase, Mahout, and Hadoop MapReduce [15]. MLCask is an end-to-end analytics system that supports Git-like version control semantics for machine learning pipelines. The system enables multiple user roles to perform branching and merging operations, while also reducing storage consumption and improving efficiency through reusable history records and pipeline compatibility information [16]. Data exploration through visualization is a crucial step for scientists to analyze and validate hypotheses [17]. Pipeline61 is a framework that supports the building of data pipelines across multiple environments by reusing the existing code of deployed jobs and providing version control and dependency management to deal with typical software engineering issues [18]. Apache StreamPipes is a graphical tool for pipeline management that utilizes container management tools like Kubernetes to manage and execute complex stream processing pipelines for big data. The proposed architecture and evaluation provide insights into the dependencies and interplay of the technologies involved in managing and executing big data stream processing pipelines [19]. In [20], the proposed data pipeline framework can improve the quality of Automatic Identification System (AIS) data and provide a foundation for various maritime management applications. The framework includes data collection, preprocessing, visualization, trajectory reconstruction, and storage, utilizing Apache Kafka for data streaming. The DFSR approach utilizes both data features and service associations to automatically generate machine learning pipelines for data

iFlow: Powering Lightweight Cross-Platform Data Pipelines

215

analysis, reducing the level of expertise required for domain workers and making automated decisions in data analysis more accessible [21]. The above-mentioned and cited papers along with their concise summaries aim to provide the readers with the required background for the proper understanding of the requirement of a new more customizable, lightweight, and cross-platform framework like “iFlow” and justify the development efforts that creating such a framework entails.

3 Proposed Methodology iFlow v1.0.0 will be focused solely on the task of creation and manipulation of data via data processing pipelines, workflows, and connectors via a Command Line Interface or CLI. The above architecture for the first implementation of iFlow consists of four major components denoted by four distinct colors. The section given below elaborates on these major components and the functions which they shall perform in the framework. Figure 1 shows the overall workflow model.

Fig. 1 Zoomed out architecture of iFlow at a glance

216

S. Nayak et al.

Data Source Manager The Data Source Manager allows the user to smoothly interact with datasets as well as their precise locations on the user’s file system via the CLI. Users can either manually add folders containing .csv files under the datasets directory, or they can directly download raw datasets from Kaggle via the public API. This remote access and download of datasets are done by the Kaggle API wrapper functions that are contained in the KIM or the Kaggle Interaction Module and provide a convenient CLI-based system that can be used by developers to manage remote datasets by obtaining structured local copies [22]. Implementation: The Data Source Manager is implemented via an interactive Command Line Interface developed in Python (CLI) that uses the Python “requests” library for making API calls to the public Kaggle Dataset API for fetching and downloading resources in the backend. It is also involved in managing the directory structure by making sure that the files and directories being created are consistent with the packages and modules installed by iFlow. Every task that needs to be performed by iFlow is represented by a script that is run on a .csv file represented as a 2D matrix and stored in temporary files (may or may not be the case depending on the options provided in the configuration files) during transit from one scheduled workflow job to another. The Script Source Manager is somewhat similar to the NPM package registry used by NodeJS to distribute, manage, and maintain packages at a global scale. It represents a marketplace that would contain various modules and script packages created by third-party independent users and developers that could be used by anyone to extend particular functionalities to a project being developed using the “iFlow” framework. Users can create their own custom scripts for data preprocessing and make those scripts available on the marketplace for global use thereby leading to a strong developer ecosystem and troubleshooting community. Implementation: The Script Source Manager represents an API interface that allows the uploading of new scripts (adding new scripts to the marketplace), downloading scripts via the CLI for usage in a project (installing modules) as well as making changes to an existing uploaded package by a developer who owns it [23, 24]. This entire system would be represented by a well-documented API ecosystem accompanied by a built-in admin interface, developed using the Django Rest Framework (DRF) thereby making it ideal for both scalability and ease of use for system admins. The Script Source Manger has wrapper functions defined in it that call the abovementioned DRF-based API endpoints in the backend. Once the script manager fetches the required scripts or modules from the API, it passes the data to the Data Source Manager which then decided where and how to structure the storage of the script files. iFlow Source Files As has been already mentioned in the architecture, these files are used by “iFlow” for running the entire framework and enabling the user to interact with all the other

iFlow: Powering Lightweight Cross-Platform Data Pipelines

217

components of the system. These files can be thought of as utility code that is frequently required or consumed by the other modules present in the system [25]. These may include functions that deal with creation, updation, deletion, or any other kind of management of the underlying file system. It can also include networkbased utility functions that are used for making specific API calls in a secure and session-oriented manner. Other possible auxiliary or utility functions include those that are concerned with encryption–decryption of files (intermediate iFlow files if the data required confidentiality), compression, and many more that are discovered as progress is made on the development of the framework. Implementation: The iFlow Source Files do not have a specific implementation language. They are represented by a mix of configuration files (either.yml or.csv or.txt files) that come together or are utilized by other modules as already mentioned previously. The scripts are written in Python (.py files) and are responsible for parsing the configuration and data files in order to perform and carry out useful functions [26]. Config Files The configuration files form the heart of “iFlow”. These files are used to create and define pipelines, workflows, and connectors which provide iFlow modularity and code reusability (not to mention code shareability via the “iFlow Developer Marketplace”). The three types of config files that are used by iFlow shown in Fig. 2 are as follows: 1. Jobs/Tasks: These are the smallest quantum or token of iFlow and define the script or code that is to be run on a particular piece of data. 2. Pipelines: These are a collection of jobs (scripts and Python commands) defined using YAML and form the building blocks of workflows. A pipeline can have multiple jobs, and each job processes the data and passes the modified or transformed data to the next stage or job. 3. Connectors: These are logical units that are used to glue or connect pipelines together. Whenever the data encounters a connector, the logical code inside the connector is executed to decide which pipeline should receive the data next

Fig. 2 Sample workflow using iFlow and the various constituent components

218

S. Nayak et al.

[27]. They allow for the creation of dynamic workflows based on certain data properties. Connectors are defined by YAML in conjunction with references to scripts that contain the Boolean logic based on which decisions are taken at the connectors. 4. Workflows: They refer to the entire system formed by connecting multiple pipelines together with connectors. Workflows are used to accomplish a particular data processing task. Different workflows can be created based on the end users of the final data. The above points aim at introducing the reader to the vocabulary used in the iFlow documentation as well as this paper as a whole [28]. The following subsection elaborates and dives in depth into the various definitions and schemas that are used to create Jobs, Pipelines, Connectors, and Workflow in order to provide a better and complete understanding of the framework for developers. iFlow Schemas: Every config file that represents a Job/Task, Pipeline, Connector, or Workflow in iFlow is represented by a YAML (.yml) file that conforms to a specification defined in the official iFlow master issues on Github under the “Master Issue/Schemas” heading. The following section aims at providing a quick developer-level description of the schema that we have developed over time for iFlow keeping performance, ease of parsing, and ease of defining by a user in mind. Job/Task Schema: Name, Description, and Script Pipeline Schema: Name, Description, and Jobs. Jobs further consist of the following options . . . .

execute (required): The name of the task that will be carried out. in (optional): A list of the filenames that will serve as the job’s input. out (optional): A list of the filenames that will serve as the job’s output. encr (optional): A Boolean value expressing whether or not the input data is encrypted.

Connector Schema Core YAML Structure: The Core YAML Structure defines the configuration options for creating a connector in a data processing pipeline framework. The following are the fields in the Core YAML Structure and their descriptions: Name, Description, and Script. Add On Branch for Intrinsic Branching (Not to be used now): The Add On Branch for Intrinsic Branching provides configuration options for defining branches for a connector. These branches are used for intrinsic branching, which is not recommended for use at the moment. The following are the fields in the Add On Branch for Intrinsic Branching and their descriptions: Branches, Branch, Assert, and Transfer.

iFlow: Powering Lightweight Cross-Platform Data Pipelines

219

Workflow Schema Recursive Workflow Declaration: The Recursive Workflow Declaration is used to represent a workflow with branching and sub-flows. It uses the “pipeline-exec” and “connector-exec” commands to execute pipelines and connectors, respectively. Linear Workflow Declaration: The Linear Workflow Declaration is used to represent a workflow without branching. It uses the “pipeline-exec” and “connector-exec” commands to execute pipelines and connectors, respectively. The recursive workflow schemas are easier to define by the user and provide a better developer experience due to their closer-to-nature representation. On the other hand, the recursive nature of the schema can lead to a recursive hell that makes it difficult to model larger and more complex recursive or branching relations between pipelines via connectors. Therefore in the case of highly complex workflows, the linear schema provides a more systematic and maintainable approach for defining branchings. Tech Stack Used for Implementation In terms of the tech stack, the majority of the codebase shall be written in Python. This includes the source files for iFlow as well as the servers for the marketplace (written using the Django Framework). The configuration files will be written in YAML, and all the other libraries that we shall be used shall be documented as the project proceeds and takes shape. We will be following various software conventions such as semantic commit messages, and proper git collaboration conventions as well as ensuring automated code coverage and testing by setting up appropriate CI/CD pipelines were required. APIs shall be documented using “Swagger” and framework-specific documentation integrated with CI/CD shall be maintained on “Docusaurus” and tickets raised on “Github Issues”.

4 Result Analysis and Discussion The Django REST Framework includes support for serialization, authentication, permissions, pagination, and filtering as some of its core features. Additionally, it supports other document types including YAML, XML, and JSON. It is critical to concentrate on constructing a simple and unified API architecture while implementing APIs using the Django REST Framework. This is possible by adhering to RESTful principles, which include using HTTP methods and status codes appropriately, offering clear and simple documentation, and making sure that API endpoints are locally organized. Python is designed to be inherently cross-platform, meaning that it supports any operating system such as Windows, macOS, and Linux. This is due to the fact that Python code is first converted into platform-independent bytecode and then it gets

220

S. Nayak et al.

translated by the Python interpreter on the target platform. This was the major influence on deciding Python as our primary choice of programming language. In addition to that, since Python is cross-platform, developers can create code once and run it on any system without the need of making any significant changes. This contributes to making Python a favorable language as it avoids wasting time and effort and guarantees that the program functions consistently on all platforms. The comparative analysis of the proposed iFLow and other existing approaches is given in Table 1. Along with being cross-platform, Python is renowned for having a lightweight architecture, which makes it a popular option for applications where effective resource use is crucial. Python has a simpler and more streamlined syntax since it is a dynamically typed language and does not need explicit variable definitions. Code is now simpler to develop and read and uses fewer system resources because of this. It also has a tiny footprint, which implies that running it uses fewer system resources. This makes it a fantastic option for areas with limited resources or low-powered devices. Overall, Python’s lightweight construction and cross-platform portability make it an excellent choice for transforming “iFlow” into a cross-platform application. This is further coupled with the fact that Python comes packaged with a wide variety of packaging, testing, coverage, and load testing tools such as “pip”, “pytest”, “codecov”, and “locust”, respectively, that allow for in-house testing and reduced developer expenses. In order to test the scalability of the Django Rest Framework on top of which the majority portion of the iFlow marketplace and script manager is built, we used the “locust” framework provided by Python in order to load test our API endpoints. We simulated the usage of iFlow by a carefully controlled and increasing user base requesting resources from the server. Our finds have been summarized in terms of two major factors: latency (response time) and error rates. Based on both these factors, we have tried to present a measure of how scalable iFLow is. Since iFlow takes an entirely different approach to the concept of data pipelines and simplifies Table 1 Comparison between iFLow and other similar frameworks Framework

iFlow

Other similar frameworks

Language

Python

Varies

Cross-platform

Yes

Mostly platform-specific (e.g., Linux-based)

Leveraging OS file system

Yes

Varies

Lightweight

Yes

Varies

Extensible

Yes

Varies

Emphasis on automation

Yes

Varies

Emphasis on UI

Yes

Varies

Potential application in MLOps

Yes

Varies

Open source

Yes (planned)

Varies

iFlow: Powering Lightweight Cross-Platform Data Pipelines

221

data pipelines to an entirely new level, no benchmark studies or competitors for the framework were found to be preexistent in the market. Figure 3 shows that two test runs (Run #1 and Run #2) represent the data for a total of 500 and 1000 users, respectively. It can be clearly seen that the number of failures recorded in both these cases was 0 which symbolizes the fact that the server/ framework is capable of handling at least a thousand users concurrently in its native non-optimized state by virtue of the Django Rest Framework. The response times however can be seen to undergo a sudden spike in the case of Run #2 indicating that the prolonged periods of such loads might not be optimal for the server. In both cases, it can be seen that the response times decrease drastically as the load becomes more or less constant. However, in Fig. 4 two test runs (Run #3 and Run #4) represent the data for a total of 1500 and 2000 users, respectively. It can be clearly seen that the number of failures recorded in both these cases was nonzero which symbolizes the fact that the server/framework is incapable of handling more than a thousand users concurrently in its native non-optimized state. The response times however can once again be seen to undergo a sudden spike in the case of both Run #3 and Run #4 indicating that the prolonged periods of such loads might not be optimal for the server. In both cases, it can be seen that the response times decrease drastically as the load becomes more or less constant. From the above-analyzed situations, we can conclude that in its native singlethreaded Django Application state, iFlow is capable of handling up to 1000 current users. For a single instance-based application, this performance is significant. In order

Fig. 3 Locust tests for the iFlow Django Rest Framework-based server, Run #1 and Run #2

222

S. Nayak et al.

Fig. 4 Locust tests for the iFlow Django Rest Framework-based server, Run #3 and Run #4

to enable higher scalability of iFlow, it is necessary to run it in a containerized form so that loads can be balanced across multiple instances enabling better scalability of the framework. The user progression for all four test cases was considered over a constant period of iterations with different user base sizes and has been displayed in Fig. 5. Therefore, the scalability of iFlow depends to a large extent on the method used to deploy it in a multi-instance environment with the help of containerized services.

Fig. 5 User progression graphs during load testing

iFlow: Powering Lightweight Cross-Platform Data Pipelines

223

5 Conclusion and Future Work The proposed software tool, iFlow, offers a quick and simple method for establishing Python-based cross-platform data flow pipelines by utilizing the operating system’s native file system. In order to support a wide range of potential use cases in the future, a strong emphasis has been placed on the modularity and extensibility of the project design places. However, there are plans to further expand it as an open-source, lightweight, and cross-platform MLOps framework. Overall, iFlow has the potential to revolutionize the field of data processing by making it easier and more convenient for developers to set up cross-platform data flow pipelines. With its modularity and extensibility, the software tool is well positioned to be a valuable addition to the MLOps toolkit and beyond.

References 1. Bui MTH, Park SS, Lee SH, Lee KR (2020) Towards an efficient data pipeline for machine learning on big data. Int J Mach Learn Comput 10(5):844–849 2. Li HYH, Wibowo LNV, Wu YL (2020) Automated machine learning pipeline optimization. IEEE Access 8:133712–133722 3. Islam MR, Rausch T, Hansson GK (2019) Challenges and opportunities in integrating data pipelines and machine learning workflows. arXiv preprint arXiv:1912.08088 4. Cruz AL, Rodríguez JM, Balaguer CM (2018) A pipeline for machine learning workflows. In: Proceedings of the 2018 IEEE international conference on big data, pp 3583–3588 5. Sivakumar SS, Kannan SR, Sullivan SE (2020) A comprehensive study of data pipeline frameworks for machine learning. Int J Adv Comput Sci Appl 11(2):210–218 6. Onu CA, Dike JD, Okpako DE (2020) Building a data pipeline for machine learning: challenges and opportunities. J Comput Inf Technol 28(1):91–102 7. Taranu DM, Sweeney JD, Driscoll CT, Herborg LE (2020) Machine learning for pipeline monitoring: a review of current research and future directions. Front Artif Intell 3:25 8. Bui DD, Nguyen TT, Moon T (2018) A parallel framework for efficient data preprocessing with a focus on data cleaning and normalization. IEEE Xplore 9. Bui T, Nguyen T, Moon T (2019) Modular pipeline architecture for efficient and scalable data processing. BioEssays 41(4):e1900004 10. Huang J, Li X, Zhang Y (2015) Principal component analysis and artificial neural networksbased data preprocessing for classification. Math Prob Eng 11. Liu B, Guo S, Zhang S, Jin H (2021) Cross-platform data preprocessing framework based on machine learning and cloud computing. MDPI 12. Liu C, Zhu C, Xu W, Yang X, Zhang L (2021) Time series data preprocessing with deep neural networks. IEEE Xplore 13. Sadat-Mohtasham M, Farajzadeh MA (2020) Cross-platform data preprocessing: a survey. Webology 17(2):52–68 14. Sun X, Guo Q, Zhou W, Jia H (2018) Cross-platform data preprocessing based on apache arrow. IEEE Xplore 15. Li K, Deolalikar V, Pradhan N (2015) Big data gathering and mining pipelines for CRM using open-source. In: 2015 IEEE international conference on big data (big data). Santa Clara, CA, USA, pp 2936–2938. https://doi.org/10.1109/BigData.2015.7364128 16. Luo et al Z (2021) MLCask: efficient management of component evolution in collaborative data analytics pipelines. In: 2021 IEEE 37th international conference on data engineering (ICDE), Chania, Greece, 2021, pp 1655–1666. https://doi.org/10.1109/ICDE51399.2021.00146

224

S. Nayak et al.

17. Callahan SP, Freire J, Santos E, Scheidegger CE, Silva CT, Vo HT (2006) Managing the evolution of dataflows with VisTrails. In: 22nd international conference on data engineering workshops (ICDEW’06), Atlanta, GA, USA, 2006, pp 71–71. https://doi.org/10.1109/ICDEW. 2006.75 18. Wu D, Zhu L, Xu X, Sakr S, Sun D, Lu Q (2016) Building pipelines for heterogeneous execution environments for big data processing. IEEE Software 33(2):60–67. https://doi.org/10.1109/MS. 2016.35 19. Faizan M, Prehofer C (2021) Managing big data stream pipelines using graphical service mesh tools. In: 2021 IEEE cloud summit (cloud summit), Hempstead, NY, USA, 2021, pp 35–40. https://doi.org/10.1109/IEEECloudSummit52029.2021.00014 20. . Krismentari NKB, Widyantara IMO, ER NI, Asana IMDP, Hartawan IPN, Sudiantara IG (2022) Data pipeline framework for AIs data processing. In: 2022 seventh international conference on informatics and computing (ICIC), Denpasar, Bali, Indonesia, 2022, pp 1–6. https:// doi.org/10.1109/ICIC56845.2022.10006941 21. Ru-tao Z, Jing W, Gao-jian C, Qian-wen L, Yun-jing Y (2020) A Machine learning pipeline generation approach for data analysis. In: 2020 IEEE 6th international conference on computer and communications (ICCC), Chengdu, China, 2020, pp 1488–1493. https://doi.org/10.1109/ ICCC51575.2020.9345123 22. Mishra N, Mishra S, Tripathy HK (2023) Rice yield estimation using deep learning. In: Innovations in intelligent computing and communication: first international conference, ICIICC 2022, Bhubaneswar, Odisha, India, Dec 16–17, 2022, Proceedings, pp 379–388. Springer International Publishing, Cham 23. Chakraborty S, Mishra S, Tripathy HK (2023) COVID-19 outbreak estimation approach using hybrid time series modelling. In: Innovations in intelligent computing and communication: first international conference, ICIICC 2022, Bhubaneswar, Odisha, India, Dec 16–17, 2022, Proceedings, pp 249–260. Springer International Publishing, Cham 24. Verma S, Mishra S (2022) An exploration analysis of social media security. In: Predictive data security using AI: insights and issues of blockchain, IoT, and DevOps, pp 25–44. Springer Nature Singapore, Singapore 25. Singh P, Mishra S (2022) A comprehensive study of security aspects in blockchain. In: Predictive data security using AI: insights and issues of blockchain, IoT, and DevOps, pp 1–24. Springer Nature Singapore, Singapore 26. Swain T, Mishra S (2022) Evolution of machine learning algorithms for enhancement of selfdriving vehicles security. In: 2022 international conference on advancements in smart, secure and intelligent computing (ASSIC). IEEE, pp 1–5 27. Sahoo S, Mishra S (2022) A comparative analysis of PGGAN with other data augmentation technique for brain tumor classification. In: 2022 international conference on advancements in smart, secure and intelligent computing (ASSIC). IEEE, pp 1–7 28. Mohapatra SK, Mishra S, Tripathy HK (2022) Energy consumption prediction in electrical appliances of commercial buildings using LSTM-GRU Model. In: 2022 international conference on advancements in smart, secure and intelligent computing (ASSIC). IEEE, pp 1–5

Developing a Deep Learning Model to Classify Cancerous and Non-cancerous Lung Nodules Rishit Pandey, Sayani Joddar, Sushruta Mishra, Ahmed Alkhayyat, Shaid Sheel, and Anil Kumar

Abstract The detection of lung nodules is critical for enhancing patient outcomes, as lung cancer is a major contributor to cancer-related deaths worldwide. Medical image analysis has benefited greatly from deep analytics approaches, more specifically CNN. In this study, we utilized a dataset of chest CT scans to train a ConvNet model which can automatically classify lung nodules as cancerous or non-cancerous. The model performed quite well in both tasks, achieving a high level of accuracy. The outcomes of this study suggest that CNNs have the potential to give more precise results in nodular tumour diagnosis and screening. Keywords Lung cancer · Deep learning · Classification · Accuracy rate · Machine learning

R. Pandey · S. Joddar · S. Mishra (B) Kalinga Institute of Industrial Technology, Deemed to Be University, Bhubaneswar, India e-mail: [email protected] R. Pandey e-mail: [email protected] S. Joddar e-mail: [email protected] A. Alkhayyat Faculty of Engineering, The Islamic University, Najaf, Iraq S. Sheel Medical Technical College, Al-Farahidi University, Baghdad, Iraq e-mail: [email protected] A. Kumar Tula’s Institute, Dehradun, India © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 A. Swaroop et al. (eds.), Proceedings of Data Analytics and Management, Lecture Notes in Networks and Systems 788, https://doi.org/10.1007/978-981-99-6553-3_18

225

226

R. Pandey et al.

1 Introduction A form of nodule that originates in the lung cells, resulting in uncontrolled development of abnormal cells in the lung tissue, that can form tumours and proliferate to other areas, is called lung cancer [1]. If not detected and treated early, lung cancer can be fatal, making it a prevalent and prominent reason of deaths due to malignancy in the world. Signs of lung tumour include chronic coughing, chest pain, difficulty breathing, fatigue, and unintended weight loss. While smoking mainly causes lung cancer, exposure to second-hand smoke, radon, asbestos, and other environmental factors can also increase the likelihood of developing the disease. Lung cancer can be diagnosed by examining a sample of lung cells in a laboratory. Pulmonary nodules, which are abnormal developments in the lungs, can also be detected through medical imaging such as CT scans. Although pulmonary nodules are typically non-cancerous, they can indicate the presence of cancer in some cases. CT scans are superior to other medical imaging techniques such as X-rays because they produce more accurate and less noisy results. Early prediction with treatment of lung cancer is crucial for reliable diagnosis and increased chances of survival. The four stages of lung cancer are determined by the extent to which the cancer spreads within the lungs along with different organs. Lung cancer has four stages, which are defined by how much the cancer has propagate [2]. . First stage: It is limited to the lung and has not reached lymphatic nodes or other organs. This is further categorized into I-A, where nodule is smaller than 3 cm in size, and I-B, where the nodule is larger than 3 cm. . Second stage: The malignancy has spread to adjacent lymphatic nodes or lung tissues. Stage II is further subdivided into II-A, where three centimetre is larger than nodule and has spread to nearby lymphatic nodes, and II-B, where the tumour is larger than 5 cm and has spread to nearby lymphatic nodes, or between 3 and 5 cm and has spread to lymphatic nodes located nearby. . Third stage: The lymphatic nodes in the mediastinum or nearby structures such as the chest wall, diaphragm, or oesophagus are affected by malignancy. This is further subdivided into III-A, where the cancer has spread to lymphatic nodes on the same side of the chest as the primary tumour, or to nearby structures such as the chest wall or diaphragm, and III-B, on the opposite side of the chest as the primary tumour or to structures such as the heart, major blood vessels, or trachea. . Fourth stage: The malignancy has propagated to different organs in the body, like liver, brain, and bones. The survival rate of each stage of cancer is different. The earlier the diagnosis more is the chance of surviving although there is no promising cure for cancer yet. Deep machine analytics, a domain of computational intelligence which applies neural network variants to acquire information from heaps of instances, has shown promise in detecting lung cancer. There are several approaches to using deep learning for identifying lung cancer, including using computed tomography (CT) scans and using X-rays. In CT scans, deep learning algorithms can be trained to identify lung

Developing a Deep Learning Model to Classify Cancerous …

227

nodules and other abnormalities that may indicate the presence of cancer. Convolutional neural networks (CNNs) or other deep learning algorithms capable of detecting patterns in medical images may be employed to do this. This research focuses on a deep learning-based convolutional neural network model to identify and categorize lung tumours as cancerous (malignant) or non-cancerous (benign and normal). This may help radiologists and other healthcare workers to arrive at more accurate and efficient diagnoses, which is particularly important in early-stage lung cancer as earlier detection can improve outcomes considerably.

2 Related Work In recent decades, there has been substantial progress in image recognition techniques. These methods have been widely utilized in different areas, including medical imaging, pattern recognition, video processing, robot vision, and more. One significant advancement in analysing medical image has been successful in detection of cancer using deep machine learning methodologies. To be specific, convolutional neural networks (ConvNets/CNNs) [1] have demonstrated favourable outcomes in diagnosing cancer. Some studies have even reported achieving accuracy levels similar to those of human radiologists. Authors in [2] put forth a CNN-driven approach, which is presently under review, to identify lung risks during initial phases to facilitate prompt treatment. They built the model using Python 3’s TensorFlow-Keras libraries. Initially, the dataset comprised 1097 images, but the researchers augmented it to 8461 images. The model yielded a remarkable accuracy of 99.45%. Sushruta et al. [3] conducted a review of deep learning techniques that have been employed in research related to lung cancer, particularly in detecting lung nodules from chest radiographs and computed tomography scans using smart IoT module. Their study revealed two key challenges in this domain. Firstly, there is a pressing need for more rigorous testing of deep learning algorithms in actual medical practice to establish their practical utility. Secondly, future research must focus on incorporating heterogeneity into the scenarios since real-world applications must be able to handle diverse types of patients. [4] In 2018, Asuntha et al. presented a novel approach for identifying cancerous lung nodules from input lung images, organizing the lung cancer, and assessing the extent of the disease. Their research incorporated advanced deep learning techniques for locating cancerous lung tumours. In this study, the authors have used a combination of techniques to extract features from medical images, including wavelet transforming features, histogram of oriented gradients, scale invariant feature transform, local binary pattern, and Zernike moment. Fuzzy particle swarm optimization approach was used next to select the most appropriate attributes for classification. The selected features were then classified using deep learning, with a novel FPSOCNN model designed to reduce the computational complexity of the CNN. The researchers tested their approach on a dataset from Arthi Scan Hospital and found that their FPSOCNN model performed better than other

228

R. Pandey et al.

methods. Overall, their approach shows promise for improving the accuracy and efficiency of medical image analysis [5]. In 2019, S. Bhatia et al., from the Department of Computer Science and Information Systems at BITS Pilani, developed a method for detecting lung nodule malignancy in CT scans using residual deep learning. To achieve this, they first created a preprocessing pipeline to identify the areas of the lung that are prone to cancer. They then retrieved attributes from these areas using Unet and ResNet models. The retrieved attributes were input to residual deep analytics prototype to categorize the images as either cancerous or non-cancerous. This technique has the potential to increase the precision of detecting lung cancer and provide a more efficient way of screening patients for the disease. They then used various classifiers such as XGBoost and random forest to classify the extracted features, and their individual outputs were ensemble to predict cancerous cells. Their proposed method achieved an accuracy of 84% on the LIDC-IRDI dataset [6]. In 2020, N. Kalaivani et al. from Sri Krishna College of Engineering and Technology and SACS MAVMM Engineering College proposed a deep neural network (DensNet) and adaptive boosting algorithm-based model to classify lung nodules as normal or malignant from CT scan imaging. They used a dataset that was composed of 201 lung images which were split into the ratio of 85:15 for training and testing. Upon experimenting, their model turned out to achieve an accuracy of 90.85% [7]. In 2022, researchers from Bharath Institute of Higher Education and Research, Chennai, led by N. Sudhir Reddy, conducted a study aimed at identifying early-stage malignancy in lung nodule using deep learning techniques. They found that convolutional neural networks were the best way for analysing medical images, classifying lung nodules, extracting attributes, and predicting lung cancer. To predict the growth of malignant tissue in CT imaging data, they used the improved dial’s loading algorithm (IDLA). The implementation of IDLA for lung malignancy diagnosis and prediction involves four stages: extortion localization, machine vision, bioinformatics which are AI enabled, and clinical CT image determination. They used a CNN with 2D convolutional layers, including input, convolutional layer, rectified linear unit (ReLU), pooling layer, and dense layer. Their proposed IDLA achieved an accuracy of 92.81% [8]. In 2019 I. M. Nasser et al. developed an artificial neural network (ANN) model for detecting the presence or absence of lung cancer in humans. The artificial neural network (ANN) was instructed to recognize the existence of lung cancer utilizing various input variables, including symptoms like wheezing, fatigue, chest pain, coughing, shortness of breath, swallowing difficulty, yellow fingers, anxiety, chronic disease, and allergy. The training, validation, and testing dataset utilized in the experiment was called “survey lung cancer”. The outcomes demonstrate that the ANN model attained a detection accuracy of 96.67% in identifying the presence or absence of lung nodule malignancy [9]. In 2018, W. Rahane et al. discussed the prevalence of lung cancer in India and the relevance of identifying it early as a means of treating the patient. The study introduces a system for lung cancer detection that integrates machine learning and image analysis techniques. The system can classify CT images and blood samples to determine the presence of lung cancer. The CT images are first categorized as normal or abnormal, and the abnormal images are segmented to isolate the tumour area. The system then extracts features from the images and applies SVM

Developing a Deep Learning Model to Classify Cancerous …

229

and image processing techniques to classify the images. The purpose of the study is to elevate the accuracy of lung cancer classification and staging [10]. In 2020, A. Elnakib et al. presented a CADe system for the early detection of lung nodules from LDCT images in their study. The proposed system included contrast enhancement of raw data, extraction of deep learning features from various networks, development of extracted features using a genetic algorithm, and testing different classifiers to identify lung nodules. The system achieved an accuracy as high as 96.25%, sensitivity measuring 97.5%, and specificity soaring 95% using Visual Geometric Group of 19 layers architecture, and support vector machine classifier on 320 LDCT images extracted 50 subjects in the I-ELCAP database. The proposed system surpassed other state-of-the-art approaches and demonstrated significant potential for early detection of lung nodules [11]. In 2018, Suren Makaju et al. highlighted the importance of early diagnosis and treatment of lung cancer and the challenges faced by doctors in accurately interpreting CT scan images to identify cancerous cells. The research addresses the limitations and drawbacks of several automated detection systems which involve processing images and machine learning techniques. The authors suggest a new model for finding malignant nodules in lung CT scan images which uses watershed segmentation for identification and SVM for categorization into malignant or benign. The proposed model achieves an accuracy of 92% for detection and 86.6% for classification, which is an improvement over the existing best model. Even so, the proposed system cannot classify the cancer into different stages, and the authors suggest further improvements in pre-processing and elimination of false objects to increase accuracy. The paper concludes that future work can focus on implementing classification into different stages and enhancing the accuracy of the proposed system [12]. In his research, Mokhled S. AL-TARAWNEH said that in medical fields, image processing methods are widely used for enhancing images to detect abnormalities in target images, particularly in cancers like lung and breast cancer, where time is crucial. This research project seeks to enhance image quality and accuracy with the use of minimal pre-processing techniques such as the use of Gaussian rules and Gabor filters. An enhanced region of interest is discovered and employed for feature extraction after the process of segmentation. The image’s normality is then compared using general characteristics, with pixel percentage and mask-labelling serving as the major features for reliable image comparison [13]. In 2019, Radhika P.R. et al conducted a comparison of detection of cancerous lung nodules using machine learning algorithms. Their paper focused on early detection of lung cancer through the analysis of various classification algorithms including Naïve Bayes theorem, support vector machine, decision tree logic, and logistic regression. The main aim is to evaluate the performance of these algorithms in predicting lung cancer. In 2022 [14], a study involved reviewing 65 papers that focused on prediction of different diseases using data science algorithms, with a goal of identifying scope for future refinement in detecting lung cancer in medical technology. Each approach was studied, and its drawbacks were brought forth. Also, the study examined the essence of data used for predicting diseases, whether benchmark or manually collected. At last, research directions were identified to help future researchers accurately detect lung cancer

230

R. Pandey et al.

patients at an early stage without any errors, based on the various methodologies used [15].

3 Proposed Method Here we have proposed a model which is using CNN. Our model consists of various convolutional and max pooling layers which are then flattened and dense to give the required results. The workflow model is displayed in Fig 1.

Fig. 1 Workflow model representation

Developing a Deep Learning Model to Classify Cancerous …

231

Fig. 2 Lung nodules scans before and after enhancement

3.1 Dataset We have taken the [16] IQ-OTH/NCCD lung cancer dataset from Kaggle which has three directories and one file. The directories are benign case, malignant case, and normal case, respectively, and the file is a text file regarding the same. There are 120 files in the benign case, 561 in the malignant case, and 416 files in the normal case. After collecting data, pre-processing is applied to it. The pre-processing scans are shown in Figure 2. For data preprocessing, we have resized the images to 256 X 256 so as to get a homogenous size of data to fit into model. For image enhancement, we have used the CLAHE [1]. CLAHE is an abbreviation for Contrast Limited Adaptive Histogram Equalization, an image processing technique that improves an image’s contrast. The popular technique of histogram equalization redistributes the pixel intensity values to improve contrast; however, it has a disadvantage of exaggerating the noise in the image [18, 19]. To counteract this, CLAHE applies histogram equalization to small, local regions of the image instead of the entire image. This method adapts contrast enhancement to the unique features of each region by constraining the amplification of contrast based on the amount of data available in each region. This technique is beneficial as it ensures that contrast enhancement is flexible and not uniform throughout the entire image, preventing noise over-enhancement and preserving overall image brightness. To increase the dataset size, we have applied data. After this, we moved the benign and normal cases to the non-cancerous folder and the malignant ones to the cancerous folder.

3.2 Model Architecture A number of convolutional layers (Every layer utilizes some kernels to retrieve variables, such as edges, corners, or textures, from the input image. These filters

232

R. Pandey et al.

consist of small weight matrices that slide over the input image in a window-like manner, computing a dot product at each position. This process produces a feature map that emphasizes the existence of the particular feature in the input image) [17], max pooling (Max pooling is a type of pooling layer used in CNNs to down sample the feature maps while retaining important information. It segregates the nonoverlapping regions of the feature map and the maximum value from every region is taken, as this reduces the spatial size of the feature map and introduces translational invariance [18]. Max pooling is employed before and after a convolutional layer to reduce the number of parameters and prevent overfitting), and dense layers (Dense layers, also known as fully connected layers, connect each neuron in the current layer to each neuron in the previous layer. They take the flattened input and multiply it by a weight matrix, then pass it through an activation function to introduce nonlinearity. Table 1 highlights the parameters and their values. Dense layers are commonly used in classification and regression tasks and are placed at the proximity of the model to transform the output of earlier layers into a vector of predicted outputs. The number of neurons should be adjusted based on the complexity of the problem, as too many or too few neurons can lead to overfitting or underfitting. Dense layers play a crucial role in neural networks by allowing the model to learn and classify complex patterns in the input data.) were employed in our suggested model to carry out the detection job. The table displays a summary of the suggested strategy [19]. The table lists each layer’s parameters, associated filters, activation function, output shape, and few levels. The nonlinearity has been added Table 1 Parameters of proposed model Kind of layers

Configuration of outcome

Metrics

Sequential

(32, 256, 256, 3)

0

Conv 2d

(32, 255, 255, 64)

832

Max pooling 2d

(32, 127, 127, 64)

0

Conv 2d 1

(32, 126, 126, 64)

16,448

Max pooling 2d 1

(32, 63, 63, 64)

0

Conv 2d 2

(32, 62, 62, 32)

8224

Max pooling 2d 2

(32, 31, 31, 32)

0

Conv 2d 3

(32, 30, 30, 16)

2064

Max pooling 2d 3

(32, 15, 15, 16)

0

Flattened

(32, 3600)

0

Drop out

(32, 3600)

0

Densed

(32, 32)

115,232

Dense 1

(32, 3)

99

Metric: 141, 898 Training metrics: 141,898 Without training metrics: 0

Developing a Deep Learning Model to Classify Cancerous …

233

using the rectified linear unit (ReLU). ReLU is an activation function that introduces nonlinearity in neural networks. It sets any input value less than zero to zero and keeps any positive value unchanged. ReLU has several advantages over other activation functions, including faster convergence during training and better performance in deep neural networks due to avoiding the vanishing gradient problem. Three types of lung cancer—benign, malignant, and normal—were classified as cancerous or non-cancerous using the softmax activation function. The softmax function takes a vector of real-valued scores, such as the yield of a fully connected layer, and applies exponential function to each element to ensure non-negative values. It then normalizes the resulting vector to sum up to 1, representing a probability distribution over classes.

4 Results and Discussion Our model was built using Python 3, TensorFlow, and Keras with an initial dataset size of 1097. The dataset was later increased to 3411 using data augmentation techniques. The entire dataset was then divided into three parts, with a ratio of 0.7 training, 0.15 validation, and 0.15 testing. These steps were taken to ensure that the model was trained on enough data and tested thoroughly to achieve the best possible results. Overall, the data augmentation techniques used to increase the dataset size helped improve the model’s accuracy. The division of the dataset into three parts helped to prevent overfitting and ensure that the model was robust enough to handle new data. With these steps taken, the model is expected to perform well on future datasets with similar characteristics. Our model achieved much comparable accuracy to other models with a much lesser number of parameters and much smaller number of resources. The accuracy achieved was 96.26% on training and 97.4% on the test set. The plots (Figure 3a, b) illustrate the performance of a model by plotting the accuracy of training and validation as well as the loss in training and validation over the number of epochs or iterations. During the process of training, the accuracy of model increases while its loss decreases. The validation accuracy and loss curves show how the model performs on unseen validation data, which is used to prevent overfitting. Ideally, the validation accuracy should increase, and the loss in validation should decrease as the number of epochs escalates. However, if the model undergoes overfitting, the validation accuracy may stop increasing, and the validation loss may start increasing. Therefore, the plot of training and validation accuracy and loss provides insights into the model’s performance and helps to identify issues such as overfitting. The accuracy can be determined by different metrics used in data science. We considered the value 0 as cancerous and 1 as non-cancerous on the basis of which we got our confusion matrix for depicting the predictions. The confusion matrix of our model is also shown in Fig. 4. The classification report for the same is also summarized in Table 2.

234

R. Pandey et al.

Fig. 3 Training accuracy versus validation loss, b training loss versus validation loss Fig. 4 Confusion matrix of the model

Table 2 Summary on lung cancer classification Class

Precision (%)

Recall (%)

F1-score (%)

Support

Cancer-0

97

95

95

278

Non-cancer-1

95

96

95

266

We have obtained a precision of 0.97 for cancerous and 0.95 for non-cancerous, recall of 0.95 for cancerous and 0.96 for non-cancerous, and an F1-score of 0.95 for both.

Developing a Deep Learning Model to Classify Cancerous …

235

5 Conclusion The CNN model proposed here has demonstrated high level of accuracy in accurately classifying cancerous and non-cancerous cases. The training set accuracy was derived to be 96.26%, and the testing set accuracy was found out to be 97.4%. The model performs comparably to other models that require much more resources and have a higher number of parameters. The precision, recall, and F1-score for both cancerous and non-cancerous cases were also high. The model’s effectiveness in detecting lung cancer early on can improve patients’ prognosis and treatment options. Earlystage lung cancer can be treated with surgery, radiation therapy, chemotherapy, or a combination of these treatments. Late-stage lung cancer, on the other hand, has few therapy options and can have a poor prognosis.

References 1. Swain T, Mishra S (2022) Evolution of machine learning algorithms for enhancement of selfdriving vehicles security. In: 2022 international conference on advancements in smart, secure and intelligent computing (ASSIC). IEEE, pp 1–5 2. Shimazaki A, Ueda D, Choppin A, Yamamoto A, Honjo T, Shimahara Y, Miki Y. Deep learningbased algorithm for lung cancer detection on chest radiographs using the segmentation method. Sci Rep 12(1):727. https://doi.org/10.1038/s41598-021-04667-w. PMID: 35031654; PMCID: PMC8760245 3. Mishra S, Thakkar HK, Mallick PK, Tiwari P, Alamri A (2021) A sustainable IoHT based computationally intelligent healthcare monitoring system for lung cancer risk detection. Sustain Cities Soc 72:103079 4. Asuntha A, Srinivasan A (2020) Deep learning for lung cancer detection and classification. Multimedia Tools Appl 79:7731–7762 5. Bhatia S, Sinha Y, Goel L (2018) Lung cancer detection: a deep learning approach. Soft Comput Probl Solving 699–705 6. Kalaivani N, Manimaran N, Sophia DS, Devi DD (2020) Deep learning based lung cancer detection and classification. IOP Conf Ser: Mater Sci Eng 994(1):012026. https://doi.org/10. 1088/1757-899X/ 7. Reddy N, Khanaa V (2023) Intelligent deep learning algorithm for lung cancer detection and classification. Bull Elect Eng Inf 12(3):1747–1754. https://doi.org/10.11591/eei.v12i3.4579 8. Nasser IM, Abu-Naser SS (2019) Lung cancer detection using artificial neural network. Int J Eng Inf Syst (IJEAIS) 3(3):17–23 9. Rahane W, Dalvi H, Magar Y, Kalane A, Jondhale S (2018) Lung cancer detection using image processing and machine learning healthcare. In: 2018 international conference on current trends towards converging technologies (ICCTCT). IEEE, pp 1–5 10. Elnakib A, Amer HM, Abou-Chadi FE, Early lung cancer detection using deep learning optimization 11. Makaju S, Prasad P, Alsadoon A, Singh A, Elchouemi A (2018) Lung cancer detection using CT scan images. Procedia Comput Sci 125:107–114 12. Al-Tarawneh MS (2012) Lung cancer detection using image processing techniques. Leonardo Electron J Pract Technol 11(21):147–158 13. R PR, Nair RA, V G (2019) A comparative study of lung cancer detection using machine learning algorithms. In: 2019 IEEE international conference on electrical, computer and communication technologies (ICECCT), pp 1–4. https://doi.org/10.1109/ICECCT.2019.886 9001

236

R. Pandey et al.

14. Pradhan K, Chawla P (2020) Medical internet of things using machine learning algorithms for lung cancer detection. J Manage Anal 7(4):591–623. arXiv:https://doi.org/10.1080/23270012. 2020 15. Verma S, Mishra S (2022) An exploration analysis of social media security. In: Predictive data security using AI: insights and issues of blockchain, IoT, and DevOps. Springer Nature Singapore, Singapore, pp 25–44 16. Singh P, Mishra S (2022) A comprehensive study of security aspects in blockchain. In: Predictive data security using AI: insights and issues of blockchain, IoT, and DevOps. Springer Nature Singapore, Singapore, pp 1–24 17. Sahoo S, Mishra S (2022) A comparative analysis of PGGAN with other data augmentation technique for brain tumor classification. In: 2022 international conference on advancements in smart, secure and intelligent computing (ASSIC). IEEE, pp 1–7 18. Mohapatra SK, Mishra S, Tripathy HK (2022) Energy consumption prediction in electrical appliances of commercial buildings using LSTM-GRU model. In: 2022 international conference on advancements in smart, secure and intelligent computing (ASSIC). IEEE, pp 1–5 19. Tripathy HK, Mishra S (2022) A succinct analytical study of the usability of encryption methods in healthcare data security. In: Next generation healthcare informatics. Springer Nature Singapore, Singapore, pp 105–120

Concrete Crack Detection Using Thermograms and Neural Network Mabrouka Abuhmida, Daniel Milne, Jiping Bai, and Ian Wilson

Abstract In the field of building integrity testing, the structural integrity of concrete structures can be adversely affected by various impact actions, such as conflict and warfare. These actions can result in subsurface defects that compromise the safety of the buildings, even if the impacts are indirect. However, detecting and assessing these hidden defects typically require significant time and expert knowledge. Currently, there is a lack of techniques that allow for rapid evaluation of usability and safety without the need for expert intervention. This study proposes a non-contact method for testing the integrity of structures, utilising the unique characteristics of thermography and deep learning. By leveraging these technologies, hidden defects in concrete structures can be detected. The deep learning model used in this study is based on the pretrained ResNet50 model, which was fine-tuned using simulated data. It achieved an impressive overall accuracy of 99.93% in classifying defected concrete blocks. The training process involved two types of thermograms. The first type consisted of simulated concrete blocks that were heated and subjected to pressure. The second type involved real concrete blocks from the laboratory, which were subjected to pressure using a pressure machine. Keywords Convolution deep learning · Thermography · Concrete structures · Feature extraction

M. Abuhmida (B) · D. Milne · J. Bai · I. Wilson University of South Wales, Cardiff, UK e-mail: [email protected] D. Milne e-mail: [email protected] J. Bai e-mail: [email protected] I. Wilson e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 A. Swaroop et al. (eds.), Proceedings of Data Analytics and Management, Lecture Notes in Networks and Systems 788, https://doi.org/10.1007/978-981-99-6553-3_19

237

238

M. Abuhmida et al.

1 Introduction This paper aims to demonstrate the abilities of an autonomous system for detecting subsurface-level cracks in thermograms of concrete structures. These structures, which can be a safety concern, can be in areas overlooked by human experts. By establishing a proof-of-concept, this research aims to highlight the potential effectiveness of an automated approach, thereby providing even non-experts an indication of a building’s structural safety. This paper is divided into four sections. Firstly, an introduction provides an overview of the topic and highlights related work and literature. Secondly, a discussion of the methods is utilised, including the dataset creation and description of the AI systems. Thirdly, the findings are presented and analysed in detail. The last section is the conclusion, summarising the main findings and providing the key takeaways, including potential areas for further investigation. The built environment comprises diverse structures, including commercial and residential buildings, schools, hospitals, and civic institutions. These structures rely on essential infrastructure such as water, sanitation, power, communications, and transport systems, which are vital for the local population. When actions impact these environments, there is an increased risk of damage to these structures, potentially harming the civilian population [1]. Ensuring structural integrity is a crucial aspect of engineering, aiming to ensure that structures and their components are suitable for their intended purposes and can withstand normal operating conditions. They should also remain safe even if conditions exceed the original design specifications. This involves supporting the structure’s weight and preventing deformation, breaking, and catastrophic failure throughout its expected lifespan. Like any built environment, concrete structures require testing and monitoring to assess their structural integrity [1, 2]. Nondestructive testing techniques are used to conduct structural non-intrusive testing [1]. These techniques are non-invasive technologies such as ground penetrating radar (GPR), thermography, microwaves, and infrared. They allow assessment without compromising the integrity of the structure. On the other hand, partially destructive testing techniques are commonly used when minor damage is permissible. These methods include pull-out and pull-off tests, penetration resistance, and breakoff testing. However, certain destructive testing techniques have become necessary, in which methods involve extracting samples from the structure’s material for off-site laboratory analysis [2]. When rapid testing and quick testing are required in destructive testing, methods become impractical and costly [3–5]. Furthermore, Both partial and destructive techniques often involve repairs, which increase the complexity of performing such tests. Considering these limitations, this study focuses on advancing nondestructive contactless testing methods. Surface-level defects observed in concrete structures may warrant structural safety tests. However, subsurface-level defects that pose potential safety risks are not always easily detectable, even by experts. Consequently, concrete testing may be deemed

Concrete Crack Detection Using Thermograms and Neural Network

239

unnecessary [6]. This issue is particularly concerning as the assessment of concrete structures is often neglected beyond areas directly affected [7]. Nevertheless, these areas may harbour subsurface-level defects that can compromise structural integrity. In such cases, experts may hesitate to perform safety tests due to the lower cost and low probability of structural damage [8]. In addition to the practical challenges mentioned earlier, NDT techniques for concrete typically require on site intrusive interventions from experts to carry out the tests and interpret the data. This study addresses these issues by proposing that the use of thermograms and utilising thermography can be useful an alternative approach. Defective regions in thermal imaging exhibit different and distinguishable features to the surrounding concrete, enabling differentiation and identification of these areas [8–10]. A deep learning model trained to classify defected and none defected concrete blocks is employed to overcome this. AI has proven effective in enhancing the identification process in numerous studies [11–14]. Combining the AI system with thermal imaging allows for the efficient evaluation of large sections of a structure, making this technique significantly time effective compared to existing methods. Thermography is a technique that measures temperature by detecting infrared (IR) light within the wavelength range on the electromagnetic spectrum. It uses specialised cameras or sensors to capture the infrared radiation emitted by objects and converts it into a visual representation of temperature. This non-contact method allows for temperature measurement and visualisation in various applications, including building diagnostics, industrial inspections, medical imaging, and surveillance systems [15]. Thermograms take several formats, such as greyscale, or overlay. Like RGB where each pixel represents the colour density in thermography, each pixel represents the temperature [16]. Thermography is contactless and safe means of collecting useful data about an object [16, 17]. Thermograms provide valuable information that helps in understanding specific specifications, and these specifications are summarised in Table 1. Table 1 Thermal imaging capture parameters

Description

Value

Full emissivity (1)

100%

Zero emissivity (0)

0%

Emissivity for concrete [18]

0.95

Distance from camera to target

1m

Room temperature

20 °C

240

M. Abuhmida et al.

1.1 Related Work Artificial intelligence techniques have become increasingly popular in the field of image processing and object recognition. Deep learning, a subset of artificial intelligence, has made significant advancements in object detection and recognition. Unlike traditional machine learning, deep learning uses different types of neural network such as convolution layers to enhance the feature extraction. Deep learning models, such as deep convolutional neural networks (DCNNs), have complex structures and multiple hidden layers [16]. This allows them to abstract features from data more effectively, capturing the intricacies of the data. Additionally, the diverse convolutional layers in DCNNs enable more robust feature extraction [15]. Deep learning models operate on raw data, allowing for end-to-end processing with minimal human intervention. This capability expands the potential for recognition detection in complex scenarios. Researchers have made significant progress in pavement assessment in complex scenarios, by integrating multi-sensors and deep learning techniques. Zhou and Song [19] utilised deep convolutional neural networks (DCNN) in combination with laser-scanned range images, incorporating depth mapping information to accurately identify cracks while mitigating the impact of oil stains and shadows on pavement analysis. Researchers have employed transfer learning to adapt image recognition networks in pavement crack detection. Gopalakrishnan et al. [20] used the VGG16 network, a pretrained network trained on a large dataset of images. They fine-tuned the network for the task of pavement distress detection and achieved high accuracy. Guan et al. [21] used stereo vision and deep learning to implement an automatic pavement detection based, leveraging a 3D imaging to detect cracks and potholes effectively. The depth information provided by the 3D images enabled volume measurement of potholes. Cha et al. [22] The fast R-CNN architecture was modified to accurately classify five concrete cracks types. They achieved high accuracy for each type of crack, and their method can handle multiple cracks in the same image. Zhang et al. [23] employed transfer learning from a model such as AlexNet to classify background regions and sealed cracks. These transfer learning-based deep learning models explore new application scenarios while outperforming traditional image processing methods. However, all these approaches are crack recognition methods; they do not address complex factors like oil markings, joint, etc. In their research, Yehia et al. [24] conducted a comparison of various nondestructive evaluation (NDE) methods to identify defects in concrete bridge decks. To carry out their experiments, they utilised a laboratory-created bridge slab as a representative sample. The authors employed infrared thermography (IRT) and ground penetrating radar (GPR) techniques on the laboratory samples of bridge deck slabs. Zhou and Song [19] employed DCNN with laser-scanned range images mapping information to evaluate the effect of oil stains by classifying cracks and shadows on pavement.

Concrete Crack Detection Using Thermograms and Neural Network

241

Thermal imaging has also been utilised for pavement crack detection, as the distribution pattern of surface temperature directly correlates with crack profiles, serving as an indicator of crack depth [25]. Seo et al. [26] conducted experimental studies using infrared thermograms and confirmed the effectiveness of infrared thermal imagers in crack detection, particularly for different widths of cracks. Thermal imagers offer advantages such as real-time efficiency, cost-effectiveness, and direct compatibility with deep learning networks, making them valuable tools for practical pavement inspection.

2 Experiment Design The data pre-processing stage involves several steps to prepare the captured videos for analysis. Firstly, the videos are sliced into individual images, with a frame extracted regularly. This interval can be adjusted to control the resulting image size. Next, the sliced images undergo data augmentation to increase their diversity and quantity. Various transformations, such as rotation, flipping, and cropping, are applied to the images. This augmentation process helps prevent the model from becoming overly specialised to the training data, reducing overfitting. The dataset is divided into three sets following data augmentation: training, validation, and testing. The training set allows the model to learn the classes. The validation set evaluates the model’s performance and fine-tunes its parameters. Lastly, the testing set assesses the model’s performance on unseen data, providing an unbiased measure of its effectiveness. The model’s performance is then assessed using the validation set, allowing for further optimisation if necessary. Finally, the model is tested to estimate its ability to generalise and perform well on new, unseen data (Fig. 1).

Fig. 1 Experiment phases

242

M. Abuhmida et al.

2.1 Simulation Dataset Creation A simulation is first generated using ABAQUS. This simulation aims to produce data that can be utilised to assess the ability of a deep learning model to predict the correct class of concrete defects. The main idea is to classify concrete structures as safe or unsafe. Previous research such as [24, 27, 28] has successfully demonstrated the use of thermography for detecting subsurface defects; however, there was no accessible dataset to replicate the studies. Consequently, our simulation generated a dataset of 12,020 RGB and corresponding thermographs samples with two main class to represent defected and non-defected concrete. Figure 2 presents an example of concrete block simulated images and thermographs. To initiate the thermal analysis, every specimen is initially set at of 0 °C. Subsequently, to simulate the effect of sunrise, we applied flux of 60 °C at the rear of the panel. The simulation process generates results balanced dataset with 501 frames for both classes, which are exported in video format from ABAQUS. Each step in the simulation represents one second, ensuring adequate iterations for the specimen’s thermal properties to undergo changes and eventually stabilise. The exported frames present the grayscale representation of the front side of the ABAQUS model, utilising a fixed temperature scale. Within the model, nodes located in various regions possess distinct temperature values. Figure 3 displays an example of the images utilised for training the model, depicting the aforementioned characteristics. Fig. 2 Visualisation of simulated RGB and thermography of concrete blocks

Concrete Crack Detection Using Thermograms and Neural Network

243

Fig. 3 Examples of the simulation dataset

In the case of void-free specimens, the parameter variation is comparatively lesser. Therefore, the simulation is designed to encompass 1000 steps, resulting in the export of 1001 frames exhibiting the front of the specimen in video format from ABAQUS. This approach ensures a balanced dataset. The video is cropped to include the specimen for each simulation, and every frame is exported accordingly.

2.2 Camera and Concrete Blocks Specifications The FLIR E8 thermal camera was used in this study. It can capture images both in grayscale or colour. The concrete blocks used in the experiment are standard 100 × 100 × 100 mm blocks that have been water cured for 28 days. The concrete blocks were made using pulverised fuel ash (PFA) at varying levels of 10, 20, and 30%. The FLIR E8 is a thermal imaging camera that can capture images in colour or grayscale. It has a resolution of 640 × 480 pixels and a thermal sensitivity of < 0.05 °C. The camera is also equipped with various features, such as a laser pointer and a built-in Wi-Fi module. Concrete blocks. The concrete blocks used in the experiment are standard 100 × 100 × 100 mm blocks. These blocks are commonly used for strength testing purposes. The blocks were water cured for 28 days to ensure they were fully cured before the experiment. PFA is a by-product of the coal-fired power industry. It is a pozzolanic cement which can form cementitious compounds. PFA is often used as a substitute for cement in concrete mixtures. The thermal imaging camera used in this study has a field of view of 25° × 19°, allowing for capturing a wide scene area. It has a temperature measurement range of – 20 °C to 650 °C, enabling the detection of both low- and high-temperature variations. The camera operates at a frequency of 9 Hz, providing enough images per second for accurate temperature analysis. The accuracy of temperature measurements is maximised, with a maximum deviation of ± 2 °C or ± 2% of the measured value. The captured thermal images are saved in BMP file format, while the recorded videos are stored in MP4 format. The resolution of the images is 640 × 512 pixels, ensuring clear and detailed thermal visualisations.

244

M. Abuhmida et al.

2.3 Compression-Exposed Concrete Data Collection To collect data, the concrete specimen was first loaded into the compression machine. The thermal imaging camera was then positioned on a tripod at a distance of 1 m from the specimen. The camera’s emissivity was set to 0.95 to ensure accurate temperature measurements. The data was recorded as pressure was applied to the specimen using the compression machine. The recording continued until the specimen fractured. The thermal imaging camera was able to capture the dynamic thermal changes occurring throughout the experiment, which allowed for the identification of defects in the specimen, such as cracks and voids. Gradual pressure was then applied to the specimen using the compression machine, allowing for controlled stress on the material. The experiment proceeded until visible cracks became evident on the specimen’s surface, at which point the recording was promptly stopped. This entire process was repeated a total of eight times, employing different concrete specimens for each repetition. As a result, a substantial dataset was generated, consisting of thermal recordings depicting the evolution of visible cracks in the concrete specimens. The average video duration was approximately two minutes. Irrelevant data was discarded, and the recordings were visually examined to separate them into two categories: non-defected and defected. This has allowed for the creation of two distinct classes within the dataset. The frames from the videos were then extracted to form an image dataset. The images were converted to grayscale to facilitate training, enhance generalisation, and avoid local optima. Additionally, each image was normalised within the range of zero to one. Blurred and unclean data were removed from the dataset to reduce inconsistencies. Figure 4 illustrates an example of the images obtained. It includes an image of the specimen without defects before applying pressure (Fig. 4a) and images showing cracks forming during the application of pressure (Fig. 4b–d). The minimum temperature recorded in the image is around 19 °C, while the maximum temperature reaches approximately 63.5 °C. A total of 255 distinct grey thresholds were utilised in the processed image, each corresponding to a specific temperature value.

(a) No Defect

(b) Defected (c) Visible Defect (d) Clear Defect

Fig. 4 Blocks thermal defects

Concrete Crack Detection Using Thermograms and Neural Network

245

Fig. 5 Data augmentation

The acquired videos were influenced by various factors related to the image acquisition system, including non-uniform illumination, noise, and others. Among these factors, one of the most significant issues was the range of the colour bar used for thermal imaging. To address this, a set of image augmentation had to be performed on the sliced images; Figure 5 highlights the impact of frames.

2.4 Simulation Dataset Model The residual network (ResNet50), a pretrained deep model, consists of 50 layers of a specific type of convolutional neural network (CNN). The laboratory model was using ResNet50 developed and trained on the simulation dataset to learn how to classify specimens with and without voids. ResNet50 is recognised for its deep architecture [27], which benefits from residual connections to retain knowledge during training and improve network capacity, resulting in faster training [29]. Compared to other CNNs, ResNet50 has consistently demonstrated superior performance [28]. The model consists of 48 CNN layers, one max-pool layer, and one average-pool layer [24, 30]. ResNet50 is widely used for image classification tasks and performs better than other pretrained models like VGG16 [31, 32]. The input size for the model is (256 × 256 × 1), indicating grayscale input images. The model is trained for twenty epochs, with a training time of 37.35 minutes. The data is split into a train-test split, with 25% reserved for testing. A batch size of eight and a learning rate of 0.001 are used. The model is compiled with the Adam optimiser, sparse categorical cross-entropy loss, and the ‘accuracy’ metric, which evaluates prediction correctness compared to the labels. The model setup function hyperparameters are summaries in the following Table 2:

246

M. Abuhmida et al.

Table 2 Model hyperparameters summaries Layer (type)

Output shape

Param #

InputLayer

[(Height, width, 3)]

0

Conv2D

[(Height/2, width/2, 64)]

9472

BatchNormalization

[(Height/2, width/2, 64)]

256

block1_Conv2D

[(Height/4, width/4, 64)]

4160

block1_conv2_ BatchNorm

[(Height/4, width/4, 64)]

256

block1_2_Conv2D

[(Height/4, width/4, 64)]

36,928

block1_2 _BatchNorm

[(Height/4, width/4, 64)]

256

block1_0_Conv2D

[(Height/4, width/4, 256)]

16,640

block1_3_ Conv2D

[(Height/4, width/4, 256)]

16,640

block1_0_BatchNorm

[(Height/4, width/4, 256)]

1024

block1_3_ BatchNorm

[(Height/4, width/4, 256)]

1024

block2_1 Conv2D)

[(Height/4, width/4, 64)]

16,448

block2_1_BatchNorm

[(Height/4, width/4, 64)]

256

block2_2_Conv2D

[(Height/4, width/4, 64)]

36,928

Optimizer

optimizers.Adam(0.001)

0.001

2.5 Laboratory Dataset Model The deep learning model initially underwent training using a dataset consisting of simulated thermal images. This training approach involved generating computersimulated images, enabling the model to grasp the fundamental characteristics of thermal images depicting concrete specimens. The training laboratory dataset was used to adapt the model for the laboratory experiment with a collection of 11,140 thermographs acquired during the experiment. These images were captured under real-world conditions, allowing the model to learn the specific features of thermal images of concrete specimens. The dataset is balanced; it involved 5500 thermographs for each class for both defected and not defected concrete blocks. This balanced dataset ensured an equal representation of both specimens, facilitating the model’s ability to distinguish between them. The ResNet50 model and corresponding hyperparameters were employed for the training process. This entailed utilising the ResNet50 deep learning architecture and a set of optimised hyperparameters tailored to this task. The entire retraining procedure was completed in approximately 37.75 min, indicating that it took roughly 38 min to train the model anew using the updated dataset.

Concrete Crack Detection Using Thermograms and Neural Network

247

Fig. 6 Training and validation performance—simulation

3 Results and Analysis 3.1 Simulation Dataset Model Results The model trained on the simulated data demonstrated successful classification between simulated specimens with and without voids, even when the data included unseen parameters. It achieved high accuracy rates on the unseen dataset, with 0.9992 accuracy for void-free simulations and 0.996 accuracy for simulations with voids. Although the model did not attain 100% accuracy on the unseen dataset with voids, this can be attributed to the time required for heat to propagate through the specimen, making subsurface cracks visible. Overall, the model performed well, correctly classifying most images, which is a positive outcome as shown in Fig. 6. It is recommended to incorporate additional data from ABAQUS simulations to enhance the model’s performance on the simulated dataset. This additional data should encompass a broader range of parameters to enhance the model’s ability to generalise predictions. For example, the concrete material can be varied to include different concrete mixes. Given the model’s already high accuracy, hyperparameter optimisation was not conducted. The potential gains from further fine-tuning were deemed insignificant, especially considering that this project serves as a proof-of-concept.

3.2 Laboratory Dataset Model Results Achieving a model performance of accuracy of 100%, and validation of 0.99, train loss of 7.0 × 10−6 , and validation loss of 9.4 × 10−7 during training, plus and F1score of 1.0, this model demonstrates high confidence when tested on testing data, even when not using simulation of the sun effect like in [33]. Figure 7 shows the results of the simulation dataset model.

248

M. Abuhmida et al.

Fig. 7 Training and validation performance—laboratory Fig. 8 Model image visualisation during training

Recording an RGB video alongside the thermal images is recommended to enhance visibility and facilitate visual analysis. This will enable a direct side-byside comparison between the two, highlighting the impact and demonstrating that the cracks detected in the thermographs. Figure 8 shows a visualisation of a random image on the test dataset during the testing phase of the model. The figure displays both the predicted and actual labels for the image. In this instance, the predicted label is 1, indicating that the image is classified as ‘unsafe’. This label signifies the presence of a crack that is not visible to the naked eye but detectable based on its thermal properties alone.

3.3 The Challenges of Using Thermal Images Using thermal images for crack detection in concrete structures offers advantages but also presents challenges and limitations. Thermal imaging is sensitive to environmental factors, such as temperature and airflow, which can introduce noise and affect

Concrete Crack Detection Using Thermograms and Neural Network

249

accuracy. Also, thermal cameras may have lower resolution and struggle to detect hidden or subsurface cracks lacking significant thermal variations. Variability in crack patterns and the need for validation against ground truth data pose challenges. False positives and false negatives need to be carefully considered. Cost and equipment requirements, including calibration and training, can be limiting factors. Despite these limitations, thermal imaging can provide valuable insights for crack detection and contribute to maintenance efforts when combined with other techniques.

4 Conclusion This paper was conducted in two separate experimental tests. The first experiment focuses on simulating the surface temperature of a concrete structure, examining the thermal changes and variations when a hidden crack is present or absent. The second experiment involves a concrete specimen in a laboratory setting, where the thermal camera captures the thermal changes as pressure is applied until cracks become visible. The data collected from both experiments is utilised to train two independent deep learning models, enabling them to autonomously detect hidden defects. The technique employed in these experiments proves to be highly effective for detecting minor subsurface cracks by analysing thermograms of concrete block surfaces. It is recommended to collect additional data from larger concrete blocks to enhance the investigation further and improve the model’s predictions. This would provide a more realistic representation of structural walls and enhance the model’s ability to make predictions in diverse scenarios. Additionally, other specifications can be considered to enhance the laboratory experiments including the distance between the concrete surface and the camera. This can contribute to a better understanding of the system’s capabilities and performances.

References 1. Onyeka FC (2020) A comparative analysis of the rebound hammer and pullout as nondestructive method in testing concrete. Eur J Eng Technol Res 5(5):554–558 2. Khan AA (2002) Guidebook on nondestructive testing of concrete structures. Int Atomic Energy Agency 3. Wankhade RL, Landage AB (2013) Nondestructive testing of concrete structures in Karad region. Procedia Eng 51:8–18 4. Rende NS (2014) Nondestructive evaluation and assessment of concrete barriers for defects and corrosion, (pp 290–297 5. Thiagarajan G, Kadambi AV, Robert S, Johnson CF (2015) Experimental and finite element analysis of doubly reinforced concrete slabs subjected to blast loads. Int J Impact Eng 75:162– 173 6. Scuro C, Lamonaca F, Porzio S, Milani G, Olivito RS (2021) Internet of Things (IoT) for masonry structural health monitoring (SHM): overview and examples of innovative systems. Constr Build Mater 290:123092

250

M. Abuhmida et al.

7. Jain A, Kathuria A, Kumar A, Verma Y, Murari K (2013) Combined use of nondestructive tests for assessment of strength of concrete in structure. Procedia Engineering 54:241–251 8. Wang Z (2022) Integral fire protection analysis of complex spatial steel structure based on optimised Gaussian transformation model. Comput Intell Neurosci 9. Cheng C, Shen Z (2018) Time-series based thermography on concrete block void detection. In: Construction research congress 2018, pp 732–742 10. Farrag S, Yehia S, Qaddoumi N (2016) Investigation of mix-variation effect on defect-detection ability using infrared thermography as a nondestructive evaluation technique. J Bridg Eng 21(3):04015055 11. Liu JC, Zhang Z (2020) A machine learning approach to predict explosive spalling of heated concrete. Arch Civ Mech Eng 20:1–25 12. Gupta S (2013) Using artificial neural network to predict the compressive strength of concrete containing nano-silica. Civ Eng Archit 1(3):96–102. https://doi.org/10.13189/cea.2013.010306 13. Gupta S (2015) Use of triangular membership function for prediction of compressive strength of concrete containing nanosilica. Cogent Eng 2(1):1025578 14. Hosseinzadeh M, Dehestani M, Hosseinzadeh A (2023) Prediction of mechanical properties of recycled aggregate fly ash concrete employing machine learning algorithms. J Build Eng 107006 15. Ignatov I, Mosin O, Stoyanov C (2014) Fields in electromagnetic spectrum emitted from human body. Applications in medicine. J Health, Med Nurs 7(1–22) 16. Miao P, Srimahachota T (2021) Cost-effective system for detection and quantification of concrete surface cracks by combination of convolutional neural network and image processing techniques. Constr Build Mater 293:123549 17. Choi H, Soeriawidjaja BF, Lee SH, Kwak M (2022) A convenient platform for real-time non-contact thermal measurement and processing. Bull Korean Chem Soc 43(6):854–858 18. Park BK, Yi N, Park J, Kim D (2012) Note: development of a microfabricated sensor to measure thermal conductivity of picoliter scale liquid samples. Rev Sci Instrum 83(10) 19. Zhou S, Song W (2020) Deep learning-based roadway crack classification using laser-scanned range images: a comparative study on hyperparameter selection. Autom Constr 114:103171 20. Gopalakrishnan K, Khaitan SK, Choudhary A, Agrawal A (2017) Deep convolutional neural networks with transfer learning for computer vision-based data-driven pavement distress detection. Constr Build Mater 157:322–330 21. Ramzan B, Malik MS, Martarelli M, Ali HT, Yusuf M, Ahmad SM (2021) Pixel frequency based railroad surface flaw detection using active infrared thermography for structural health monitoring. Case Stud Therm Eng 27:101234 22. Cha KH, Sahiner B, Pezeshk A, Hadjiiski LM, Wang X, Drukker K, Summers RM, Giger ML (2019) Deep learning in medical imaging and radiation therapy. Med Phys 46(1):e1–e36 23. Kaige Z, Cheng HD, Zhang B (2018) Unified approach to pavement crack and sealed crack detection using preclassification based on transfer learning. J Comput Civ Eng 32:04018001 24. Jang K, Kim N, An YK (2019) Deep learning–based autonomous concrete crack evaluation through hybrid image scanning. Struct Health Monit 18(5–6):1722–1737 25. Li Z, Yoon J, Zhang R, Rajabipour F, Srubar III WV, Dabo I, Radli´nska A (2022) Machine learning in concrete science: applications, challenges, and best practices. NPJ Comput Mater 8(1):127 26. Seo H (2021) Infrared thermography for detecting cracks in pillar models with different reinforcing systems. Tunn Undergr Space Technol 116:104118 27. Qin Z, Zhang Z, Li Q, Qi X, Wang Q, Wang S (2018) Deepcrack: learning hierarchical convolutional features for crack detection. IEEE Trans Image Process 28:1498–1512 28. Rajadurai RS, Kang ST (2021) Automated vision-based crack detection on concrete surfaces using deep learning. Appl Sci 11(11):5229 29. Mascarenhas S, Agarwal M (2021) A comparison between VGG16, VGG19 and ResNet50 architecture frameworks for Image Classification. In: 2021 international conference on disruptive technologies for multi-disciplinary research and applications (CENTCON), 2021, vol 1, pp 96–99

Concrete Crack Detection Using Thermograms and Neural Network

251

30. Fan Z, Li C, Chen Y, Wei J, Loprencipe G, Chen X, Di Mascio P (2020) Automatic crack detection on road pavements using encoder-decoder architecture. Materials 13:2960 31. Islam MM, Hossain MB, Akhtar MN, Moni MA, Hasan KF (2022) CNN based on transfer learning models using data augmentation and transformation for detection of concrete crack. Algorithms 15(8):287 32. Guo M-H et al (2022) Attention mechanisms in computer vision: a survey. Comput Vis Media 8(3):331–368 33. Aggelis DG, Kordatos EZ, Strantza M, Soulioti DV, Matikas TE (2011) NDT approach for characterisation of subsurface cracks in concrete. Constr Build Mater 25(7):3089–3097. https:// doi.org/10.1016/j.conbuildmat.2010.12.045 34. Wiggenhauser H (2002) Active IR-applications in civil engineering. Infrared Phys Technol 43(3–5):233–238 35. Abuhmida M, Milne D, Bai J, Sahal M (2022) ABAQUS-concrete hidden defects thermal simulation. Mendeley Data. https://doi.org/10.17632/65nbxg9pr3.1 36. Hu D, Chen J, Li S (2022) Reconstructing unseen spaces in collapsed structures for search and rescue via deep learning based radargram inversion. Autom Constr 140:104380

Wind Power Prediction in Mediterranean Coastal Cities Using Multi-layer Perceptron Neural Network Youssef Kassem , Hüseyin Çamur , and Abdalla Hamada Abdelnaby Abdelnaby

Abstract Wind energy refers to a form of energy conversion where wind turbines convert the kinetic energy of the wind into electrical energy that can be used as a source of clean energy. Thus, estimating wind power is important for wind farm planning and design. This study aims to predict the wind power density (WPD) in Mediterranean coastal cities using multi-layer perceptron neural network (MLPNN) model. For this aim, two scenarios were proposed. In scenario 1, the developed model was utilized global meteorological data (GMD) as input variables, including precipitation (PP), maximum temperature (Tmax), minimum temperature (Tmin), actual evapotranspiration (AE), wind speed at 10 m height (WS), and solar radiation (SR). However, the input variables in scenario 2 were geographical coordinates (GC) and GMP, which aim to estimate the influence of GC on the accuracy of the WPD prediction. The results indicated that scenario 2 has decreased the RMSE and MAE by 46%. Keywords MLPNN · Mediterranean coastal cities · WPD · Meteorological parameters · Geographical coordinates

Y. Kassem (B) · H. Çamur · A. H. A. Abdelnaby Faculty of Engineering, Mechanical Engineering Department, Near East University, 99138 Nicosia, North Cyprus, Cyprus e-mail: [email protected] H. Çamur e-mail: [email protected] A. H. A. Abdelnaby e-mail: [email protected] Y. Kassem Faculty of Civil and Environmental Engineering, Near East University, 99138 Nicosia, North Cyprus, Cyprus Near East University, Energy, Environment, and Water Research Center, 99138 Nicosia, North Cyprus, Cyprus © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 A. Swaroop et al. (eds.), Proceedings of Data Analytics and Management, Lecture Notes in Networks and Systems 788, https://doi.org/10.1007/978-981-99-6553-3_20

253

254

Y. Kassem et al.

1 Introduction Due to population growth, urbanization, and economic development, energy demand has been increasing significantly in developing countries. The need for energy, particularly electricity, is increasing rapidly in developing countries. However, many developing countries face significant challenges in meeting their energy demands. Also, the International Energy Agency (IEA) reports that more than 1.2 billion individuals across the globe do not have electricity access [1]. Moreover, many developing countries rely heavily on imported fossil fuels for their energy needs, which can make their energy systems vulnerable to price fluctuations and supply disruptions. Therefore, renewable energy (RE) sources have significant potential to provide clean and affordable energy to developing countries. Investing in RE can help these countries reduce their dependence on imported fuels, and increase their energy security [2–5]. Accordingly, RE has received growing attention as a means to reduce emissions and decrease reliance on fossil fuels for energy production. They have significant potential to reduce emissions and consumption of fossil fuels. Recently, the use of RE has been increasing rapidly in many countries around the world such as German, China, the USA, Denmark, Brazil, and Costa Rica [6–8]. Wind energy has several advantages over traditional energy sources. As it is powered by natural wind, wind energy is a clean and environmentally friendly source of energy [9]. It is also virtually unlimited and abundantly available worldwide, making it a promising domestic energy source in many countries. Furthermore, with the ongoing advancements in wind energy technology, it has become one of the most cost-effective renewable energy sources available [10]. Generally, wind speed is a critical parameter for evaluating wind potential at a specific location because it directly influences the amount of energy that can be harnessed from the wind [9, 10]. Furthermore, the amount of wind power generated is proportional to the cube of the wind speed [11]. Furthermore, wind speed is affected by several factors, including topography, surface roughness, and local weather patterns [11]. Accordingly, wind power estimation (WPE) plays a crucial role in the context of renewable energy, particularly in the planning and design of wind farms [12]. Wind power, which is harnessed from the kinetic energy of wind, is one of the most abundant and widely available sources of renewable energy [13]. Accurate WPE potential is essential for maximizing the efficiency and economic viability of wind energy projects. Besides, wind power estimation assists in selecting the appropriate wind turbine models and optimizing their placement within a wind farm [14]. Therefore, by considering the estimated wind power, developers can determine the most suitable turbine size and layout configuration to maximize energy capture and overall project performance. Consequently, accurate wind power estimation enables developers to estimate the potential energy production of a wind farm. Additionally, by estimating the potential energy generation, developers have the opportunity to assess the environmental advantages in terms of greenhouse gas emissions reduction, air pollution mitigation, and conservation of natural resources.

Wind Power Prediction in Mediterranean Coastal Cities Using …

255

Recently, it has been demonstrated that artificial neural network (ANN) is a powerful tool for predicting wind power. Several studies have utilized ANN to predict wind speed and wind power using different meteorological/weather data [15–18]. For instance, Noorollahi et al. [15] predicted the wind speed in Iran using three ANN models. The results showed that the adaptive neuro-fuzzy inference system model stands out as a superior tool for accurately predicting wind speeds. Ghorbani et al. [16] presented a case study that focuses on modeling monthly wind speed values using meteorological data in Iran. Ghanbarzadeh et al. [17] utilized air temperature, relative humidity, and vapor pressure data as input variables for the ANN model to estimate the future wind. Kassem et al. [18] evaluated the effectiveness of various models in predicting wind power density (WPD) in Ercan, Northern Cyprus. Consequently, this study aims to predict monthly wind power density in Mediterranean coastal cities (MCCs) using a multi-layer perceptron feedforward neural network model (MLPNN). To this aim, the meteorological input variables used in this study were collected from TerraClimate from 2010 to 2021.

2 Material and Method 2.1 Study Area and Dataset The Eastern Mediterranean region is known for its rich wind energy potential, particularly in the coastal cities located along the Mediterranean Sea. These cities have the advantage of being situated in a region that experiences high wind speeds due to the unique climatic conditions of the area. Additionally, the topography of the region, including the surrounding mountains and valleys, can further enhance the wind flow and increase the wind energy potential. Several studies have highlighted the potential of wind energy in cities such as Alexandria, Beirut, Haifa, and Izmir, among others. These cities have shown promising wind speed patterns, which make them suitable for the installation of wind turbines and the generation of wind energy. The utilization of wind energy in these cities can provide numerous benefits, including reducing dependence on fossil fuels, mitigating greenhouse gas emissions, and promoting sustainable development. Figure 1 shows the details regarding the selected MCCs. In general, GMD has been employed to understand the impact of weather parameters on wind power density (WPD) prediction due to the limited availability of actual weather parameter data. It should be noted that WPD is estimated using Eq. (1) [16]. 1 P = ρv 3 A 2

(1)

In this study, GMD is obtained from the TerraClimate dataset. Generally, TerraClimate, Developed by a team of researchers at the University of California, Santa Barbara, is a comprehensive and widely used global gridded climate dataset that

256

Y. Kassem et al.

Fig. 1 Latitude, longitude, and elevation for all selected locations

provides monthly estimates of various climate variables. It offers valuable insights into the historical climate conditions across the globe and is particularly useful for climate research, impact assessment, and modeling studies. The TerraClimate dataset incorporates a wide range of data sources, including ground-based meteorological station observations, satellite measurements, and reanalysis products. These sources are integrated using advanced statistical techniques to create a consistent and highquality dataset. The dataset covers the entire globe with a spatial resolution of 2.5 arcminutes, which translates to approximately 0.04° × 0.04°. One of the key advantages of the TerraClimate dataset is its extensive temporal coverage. It spans from 1958 to the present, providing over six decades of climate data. This long-term coverage enables the study of climate variability, trends, and changes over time, aiding in the understanding of climate dynamics and informing future projections. The TerraClimate dataset includes several essential climate variables. These variables encompass temperature, precipitation, vapor pressure, solar radiation, and wind speed. Each variable is provided at a monthly resolution, allowing for a detailed examination of seasonal and interannual variations [19–22]. Thus, the data maximum temperature (Tmax), minimum temperature (Tmin), downward radiation (DR), wind speed (WS), actual evapotranspiration (AE), and precipitation (PP) data for the period of 2010–2021.

2.2 MLPNN Model MLPNN is a specific model of artificial neural network (ANN) comprising numerous layers of interconnected nodes [23, 24]. This type of neural network follows feedforward architecture, enabling the flow of information in a unidirectional manner [23, 24]. The key feature of the MLPNN model is its ability to learn complex nonlinear relationships between input and output data. During the training phase, the model

Wind Power Prediction in Mediterranean Coastal Cities Using …

257

adjusts the weights and biases associated with each neuron to minimize the difference between the predicted output and the desired output. This process, known as backpropagation, utilizes an optimization algorithm to iteratively update the model parameters and improve its performance. This process is called backpropagation and is illustrated in Fig. 2. In general, the activation functions such as linear, hyperbolic tangent, and logistic functions are commonly employed. Linear = x

Fig. 2 Flowchart of MLPNN model

(2)

258

Y. Kassem et al.

Hyperbolic Tan Function = Logistic =

1 1 + e−x

e x − e−x , e x + e−x

(3) (4)

where x is the input of the activation functions.

2.3 Statistical Indices (SI) The performance evaluation of the developed models involves the utilization of several statistical metrics. In the current study, four statistical metrics were employed to estimate the performance of the model. (a) Coefficient of Determination (R2 ) R-squared assesses regression model fit by measuring the proportion of variance explained by independent variables. Values range from 0 to 1, with higher values indicating better fit. (b) Root Mean Squared Error (RMSE) RMSE directly quantifies the deviations or errors between the predicted values and the corresponding observed values. It measures the average magnitude of these deviations, providing a straightforward indication of how closely the model’s predictions align with the actual data. A smaller RMSE signifies a better fit and indicates that the model’s predictions are closer to the observed values. (c) Mean Absolute Error (MAE) MAE is a statistical index that calculates the average absolute difference between the predicted and observed values in a regression model. It provides a measure of the model’s accuracy without considering the direction of the errors. Like RMSE, lower MAE values indicate better performance. (d) Nash–Sutcliffe efficiency (NSE) NSE is a statistical index commonly used in hydrological and environmental modeling. It quantifies the proportionate difference between the residual variance and the observed variance, providing a measure of the relative magnitude between the two. NSE ranges from negative infinity to 1, with 1 indicating a perfect fit and values below zero suggesting poor performance. NSE assesses the model’s ability to reproduce the mean and variability of the observed data. The mathematical expressions for these metrics, as used in this study, are presented in Eqs. (5–8). )2 ∑n ( i=1 aa,i − a p,i R = 1 − ∑n ( )2 i=1 a p,i − aa,ave 2

(5)

Wind Power Prediction in Mediterranean Coastal Cities Using …

| n | 1 ∑( )2 aa,i − a p,i RMSE = | n i=1 MAE =

n | 1 ∑|| aa,i − a p,i | n i=1

)2 ∑n ( i−1 aa,i − a p,i NSE = 1 − ∑n ( )2 i−1 aa,i − aa,ave

259

(6)

(7)

(8)

3 Results and Discussions Evaluating the wind potential of a specific location is a crucial initial step in the effective planning of wind energy systems. In this paper, the influence of GC on the accuracy of WPD prediction was investigated. To achieve this objective, the proposed models were implemented and evaluated in two various scenarios. Scenario 1(S#1) : WPD = f (Tmax, Tmin, PPmAE, WS, SR)

(8)

Scenario 2(S#2) : WPD = f (GC, Tmax, Tmin, PPmAE, WS, SR)

(9)

Generally, the partitioning of data can influence the model’s performance [16]. Moreover, Gholamy et al. [25] concluded that the empirical models achieve optimal performance when approximately 70–80% of the data is allocated for training and the remaining 20–30% is set aside for testing purposes. Therefore, the data were divided randomly (75% for training and 25% for testing). Table 1 displays the descriptive statistics for selecting data (Fig. 3). In this work, a trial-and-error approach was employed to find the optimum network configuration. Table 2 lists the optimum network parameter. Moreover, Figs. 3 and 4 show the architecture model for S#1 and S#2 (Fig. 4). The scenario’s performance is compared with each other to investigate the effect of geographical coordinates on the accurate prediction of WPD. The values of R2 , RMSE, and MAE are tabulated in Table 3. It is found that S#2 with the combination of geographical coordinates and global meteorological data has produced the highest value of R2 and minimum value of RMSE and MAE. The scatter plots of observed and estimated data are shown in Fig. 5. In the literature [26–29], the geographical coordinates significantly impact the accuracy of predicting wind power density. The latitude and longitude of a specific location determine its proximity to prevailing wind patterns, topographical features,

260

Y. Kassem et al.

Table 1 Descriptive statistics data (GMD and GC (Latitude (La.), Longitude (Long.), and Elevation (El.)) for all selected locations Variable

Unit

Mean

Standard deviation

Minimum

Maximum

Lat

°

34.187

1.91

31.132

36.897

Long

°

34.25

1.712

29.919

36.176

Alt

m

183

0

1798

Tmax

°C

25.546

5.762

12.21

37.39

Tmin

°C

16.381

5.741

3.57

27.66

PP

mm

50.64

68.13

0

444.6

AE

mm

40.185

31.585

0

142.5

SR

W/m2

216.2

73.13

70.75

338.31

WS

m/s

3.0165

0.682

0.91

5.49

WPD

W/m2

19.419

12.739

0.462

101.35

359.51

Fig. 3 MLPNN structure for S#1

and atmospheric conditions. These factors directly influence wind speed and direction, ultimately affecting the potential energy available for harnessing. Therefore, considering the geographical coordinates is crucial for accurately predicting the wind power density at a given site. Incorporating this information improves the precision of wind energy assessments and facilitates optimal planning and design of wind farms.

Wind Power Prediction in Mediterranean Coastal Cities Using … Table 2 Optimum parameters for the developed models Scenario

Parameter

Value

S#1

Number of HL

1

Number of units in HL

2

AF (HL)

Hyperbolic tangent

AF (OL)

Linear

Number of HL

1

Number of units in HL

6

AF (HL)

Hyperbolic tangent

AF (OL)

Linear

S#2

Fig. 4 MLPNN structure for S#2

261

262

Y. Kassem et al.

Table 3 Value of SI for the proposed models Statistical indicator

S#1

S#2

R2

0.993

0.998

RMSE [m/s]

1.077

0.574

MAE [m/s]

0.394

0.212

NSE

0.993

0.998

Fig. 5 Comparing and correlation between the observed and predicted data using MLPNN

4 Conclusions Although the results of the present study were derived from a mathematical model utilizing various gridded data, it is important to acknowledge that this study possesses certain limitations that could be explored and addressed in future research. Firstly, utilizing data coming from satellite measurements and various reanalyses is key to the next-generation wind resource assessment and forecasting. Thus, the results should be compared with the data collected from reanalysis datasets such as EAR5 to show the accuracy of the models. Moreover, terrain analysis was not considered in this study. However, previous studies [30, 31] have indicated that regions become less suitable for wind turbine installations as elevation and slope increase. Therefore, future research should focus on the site selection of wind energy power plants using GIS-multi-criteria evaluation. The prediction of wind power density (WPD) is an important key for the designing and planning of wind farms. Accurate WPD predictions enable engineers and planners to make informed decisions regarding the optimal placement and layout of wind

Wind Power Prediction in Mediterranean Coastal Cities Using …

263

turbines, considering factors such as wind resource availability, energy production estimates, and overall project feasibility. By accurately predicting WPD, the design and planning process of wind farms can be optimized, leading to efficient utilization of wind energy resources. Based on the value of SI, S#2 has the best predictive performance compared to S#1 for the WPD estimations in MCCs. Moreover, the findings indicate that MLPNN with the combination of geographical and global meteorological data could increase the average performance of the model by 46%. In the end, geographical coordinates are essential in wind farm planning and design in Mediterranean coastal cities. They provide crucial information for assessing wind resources, determining turbine placement, considering environmental factors, planning infrastructure connections, and monitoring the performance of wind farms. Accurate geographical coordinates enable developers to make informed decisions and optimize the design and operation of wind energy projects in these regions.

References 1. Muh E, Amara S, Tabet F (2018) Sustainable energy policies in Cameroon: a holistic overview. Renew Sustain Energy Rev 82:3420–3429 2. Seriño MNV (2022) Energy security through diversification of non-hydro renewable energy sources in developing countries. Energy Environ 33(3):546–561 3. Elum ZA, Momodu AS (2017) Climate change mitigation and renewable energy for sustainable development in Nigeria: a discourse approach. Renew Sustain Energy Rev 76:72–80 4. Urban F (2014) Low carbon transitions for developing countries. Routledge 5. Kaygusuz K (2007) Energy for sustainable development: key issues and challenges. Energy Sources Part B 2(1):73–83 6. Martinot E (2016) Grid integration of renewable energy: flexibility, innovation, and experience. Annu Rev Environ Resour 41:223–251 7. Juarez-Rojas L, Alvarez-Risco A, Campos-Dávalos N, de las Mercedes Anderson-Seminario M, Del-Aguila-Arcentales S (2023) Effectiveness of renewable energy policies in promoting green entrepreneurship: a global benchmark comparison. In: Footprint and entrepreneurship: global green initiatives. Springer Nature Singapore, Singapore, pp 47–87 8. Wu X, Tian Z, Guo J (2022) A review of the theoretical research and practical progress of carbon neutrality. Sustain Oper Comput 3:54–66 9. Kassem Y, Gökçeku¸s H, Zeitoun M (2019) Modeling of techno-economic assessment on wind energy potential at three selected coastal regions in Lebanon. Model Earth Syst Environ 5:1037– 1049 10. Alayat MM, Kassem Y, Çamur H (2018) Assessment of wind energy potential as a power generation source: a case study of eight selected locations in Northern Cyprus. Energies 11(10):2697 11. Kassem Y, Gökçeku¸s H, Janbein W (2021) Predictive model and assessment of the potential for wind and solar power in Rayak region Lebanon. Model Earth Syst Environ 7:1475–1502 12. Kassem Y, Çamur H, Aateg RAF (2020) Exploring solar and wind energy as a power generation source for solving the electricity crisis in Libya. Energies 13(14):3708 13. Gökçeku¸s H, Kassem Y, Al Hassan M (209) Evaluation of wind potential at eight selected locations in Northern Lebanon using open source data. Int J Appl Eng Res 14(11):2789–2794 14. Xu Y, Li Y, Zheng L, Cui L, Li S, Li W, Cai Y (2020) Site selection of wind farms using GIS and multi-criteria decision-making method in Wafangdian China. Energy 207:118222 15. Noorollahi Y, Jokar MA, Kalhor A (2016) Using artificial neural networks for temporal and spatial wind speed forecasting in Iran. Energy Convers Manage 115:17–25

264

Y. Kassem et al.

16. Ghorbani MA, Khatibi R, Hosseini B, Bilgili M (2013) Relative importance of parameters affecting wind speed prediction using artificial neural networks. Theoret Appl Climatol 114:107–114 17. Ghanbarzadeh A, Noghrehabadi AR, Behrang MA, Assareh E (2009) Wind speed prediction based on simple meteorological data using artificial neural network. In: 2009 7th IEEE international conference on industrial informatics. IEEE, pp 664–667 18. Kassem Y, Gökçeku¸s H, Çamur (2019) Analysis of prediction models for wind power density, case study: Ercan area, Northern Cyprus. In 13th international conference on theory and application of fuzzy systems and soft computing—ICAFS-2018 13. Springer International Publishing, pp 99–106 19. Abatzoglou JT, Dobrowski SZ, Parks SA, Hegewisch KC (2018) TerraClimate, a highresolution global dataset of monthly climate and climatic water balance from 1958–2015. Scientific data 5(1):1–12 20. Cepeda Arias E, Cañon Barriga J (2022) Performance of high-resolution precipitation datasets CHIRPS and TerraClimate in a Colombian high Andean Basin. Geocarto Int 1–21 21. Wiwoho BS, Astuti IS (2022) Runoff observation in a tropical Brantas watershed as observed from long-term globally available TerraClimate data 2001–2020. Geoenvironmental Disasters 9(1):12 22. Kassem Y, Gökçeku¸s H, Mosbah AAS (2023) Prediction of monthly precipitation using various artificial models and comparison with mathematical models. Environ Sci Pollut Res 1–27 23. Kassem Y, Çamur H, Zakwan AHMA, Nkanga NA (2023) Prediction of cold filter plugging point of different types of biodiesels using various empirical models. In: 15th international conference on applications of fuzzy systems, soft computing, and artificial intelligence tools– ICAFS-2022. Springer Nature Switzerland, Cham, pp 50–57 24. Kassem Y (2023) Analysis of different combinations of meteorological parameters and well characteristics in predicting the groundwater chloride concentration with different empirical approaches: a case study in Gaza Strip Palestine. Environ Earth Sci 82(6):134 25. Gholamy A, Kreinovich V, Kosheleva O (2018) Why 70/30 or 80/20 relation between training and testing sets: a pedagogical explanation. Departmental Technical Reports (CS). 1209. https:// scholarworks.utep.edu/cs_techrep/1209 26. Manwell JF, McGowan JG, Rogers AL (2009) Wind energy explained: theory, design, and application. Wiley 27. Wood DH (2012) Wind energy: Fundamentals, resource analysis, and economics. Springer Science & Business Media 28. Li C, Yuan Y (2013) Wind resource assessment and micro-siting: science and engineering. Springer Science & Business Media 29. Hasager CB, Nielsen M, Pena A (eds) (2016) Wind energy systems: optimising design and construction for safe and reliable operation. Woodhead Publishing 30. Zalhaf AS, Elboshy B, Kotb KM, Han Y, Almaliki AH, Aly RM, Elkadeem MR (2021) A highresolution wind farms suitability mapping using GIS and fuzzy AHP approach: a national-level case study in Sudan. Sustainability 14(1):358 31. Shorabeh SN, Firozjaei MK, Nematollahi O, Firozjaei HK, Jelokhani-Niaraki M (2019) A riskbased multi-criteria spatial decision analysis for solar power plant site selection in different climates: a case study in Iran. Renew Energy 143:958–973

Next Generation Intelligent IoT Use Case in Smart Manufacturing Bharati Rathore

Abstract Smart manufacturing has become a significant topic of interest among manufacturing industry professionals and researchers in recent times. Smart manufacturing involves the incorporation of cutting-edge technologies, including the Internet of things, cyber-physical systems, cloud computing, and big data. This is evident in the Industry 4.0 framework. Henceforth, next generation smart manufacturing showcases a deep amalgamation of artificial intelligence (AI) tech and highly developed production technologies. It is woven into each stage of the design, production, product, and service cycle and influences the entire life cycle. The subsequent form of smart manufacturing is the central propellant for the novel industrial revolution and is expected to be the principal catalyst for the transformation and betterment of the manufacturing industry for the generations ahead. Through this research, we proposed a ‘4*S Model’ via conceptualization of smart, sensorable, sustainable, and secure concepts at various stages of manufacturing. The evolution of smart manufacturing for Industry 4.0 is an ongoing process and this research will provide insights for further developments in manufacturing. Keywords Smart manufacturing · Industry 4.0 · 4*S model

1 Introduction Countries around the world are actively participating in the new industrial revolution by adopting advanced technologies like AI, IoT, cloud computing, CPS, and big data into their manufacturing processes through smart manufacturing. Through adopting SM, they are able to gain access to improved visibility into production, value, and performance in real-time, boost manufacturing agility and flexibility and enhance predictive maintenance and analytics [1]. The inculcation of SM is believed to be a crucial determinant in creating a competitive edge for the manufacturing industry of B. Rathore (B) Birmingham City University, Birmingham B5 5JU, UK e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 A. Swaroop et al. (eds.), Proceedings of Data Analytics and Management, Lecture Notes in Networks and Systems 788, https://doi.org/10.1007/978-981-99-6553-3_21

265

266

B. Rathore

widely recognised countries on the global scale. Through SM, nations have access to smart devices, technologies, and tools that allow them to gain a comprehensive picture of their manufacturing activities, the environmental and market conditions, and the customer requirements, ultimately resulting in the highest levels of productivity, efficiency, and cost savings among others [2]. Germany has crafted the corporate initiative of ‘Industrie 4.0’, and the UK has introduced their ‘UK Industry 2050’ policy as a response to the Fourth Industrial Revolution and its use of expanded smart technologies and smart manufacturing (SM) [3]. The afore-mentioned strategies fundamentally aim to bring to the forefront the use of new technologies as a tool for revolutionising the classical manufacturing industry and its practices. In this light, Germany’s Industrie 4.0 is focused on the vertical integration of highly developed technologies such as AI, big data, the IoT, and the like, whereas the UK Industry 2050 brings forth the overall vision of a brave new world in the rapidly evolving world of industrial technology. In addition to that, France has launched the ‘New Industrial France’ initiative, Japan has set forward the ‘Society 5.0’ plan, and Korea has initiated the ‘Manufacturing Innovation 3.0’ course of action to join the bandwagon of countries embracing the industrial revolution and its use of highly developed technologies, as part of their smart manufacturing (SM) methods [1, 4]. The New Industrial France brings about the idea of an ecosystem for SM in France, Society 5.0 of Japan throws a light on the full use of SM and its products, and Korea’s Manufacturing Innovation 3.0 focuses on economic outcomes. All these strategies were made to enhance SM and its multiple effects and comply with the principles of the Industrial Revolution 4.0 [4, 5]. The adoption of intelligent manufacturing is considered crucial for major countries to tackle the challenges posed by the Fourth Industrial Revolution and to stay ahead in the manufacturing industry [6]. It is seen as a key strategy to gain a competitive edge. Intelligent manufacturing refers to the integration of advanced digital technologies like IoT, AI, ML, robotics, and automation to enhance manufacturing efficiency, improve product quality and innovation, and promote production agility and flexibility. Intelligent manufacturing provides a convergence of business, operational, and engineering processes, enabling a comprehensive collection of data which can be used to drive agile, predictive, and real-time decision-making [7]. As such, intelligent manufacturing can play a significant role in helping countries to develop innovative products and services, improve resource utilisation, and gain the competitive edge in future of manufacturing [8, 9]. Since the beginning of the twenty-first century, new-generation information technology has seen an explosive uptick. Modern technology like smartphones, tablets, cloud computing, and social media have changed the way we interact, shop, consume information, and create art. The emergence of the Internet of things, artificial intelligence, and machine learning has spurred the creation of innovative digital technologies such as autonomous vehicles, speech recognition systems, virtual reality, and robotics that are constantly evolving [10, 11]. A new method of producing goods and services, new-generation smart manufacturing makes use of cutting-edge technologies including IoT, data analytics, machine learning, and 3D printing [12]. In order to allow more effective, secure, and cost-effective operations, this new approach

Next Generation Intelligent IoT Use Case in Smart Manufacturing

267

makes use of automated procedures including predictive maintenance, supply chain optimisation, and monitoring of numerous performance factors [13]. The Fourth Industrial Revolution, known as Industry 4.0, is the cornerstone of smart production and involves extensive automation, technological integration, and data sharing. Smart manufacturing enables manufacturers to utilise their resources more effectively and produce/serve clients more quickly and accurately than ever before [14].

2 Literature Review Industry 4.0, or ‘smart manufacturing’, is a rapidly expanding industry that uses cutting-edge computing and communication technologies to enhance the efficiency of automated production processes. In recent years, there have been a lot of fascinating advancements and studies in this field. The application of artificial intelligence (AI) and machine learning (ML) to enhance industrial processes has received attention. Few companies, for instance, are investigating how AI may be used to optimise production schedules in order to increase productivity, decrease downtime, and cut costs. Others are focusing on utilising ML to foresee equipment breakdowns in order to save costly repairs and extend the lifespan of the equipment. The application of collaborative robots, or ‘cobots’, to enhance industrial processes is another field of research. Because cobots are made to operate alongside human operators, they may do risky or repetitive jobs, freeing up human employees to concentrate on more difficult ones. Cobots may also pick up skills from human operators, making their work more efficient and secure. Smart manufacturing has benefited greatly from Internet of things (IoT). Real-time monitoring of equipment performance, energy utilisation, and other important parameters is possible with IoT sensors and devices. Then, by utilising this information, the output may be improved while wasting less. Smart manufacturing technologies have the potential to completely change the manufacturing sector, making it more effective, sustainable, and lucrative.

2.1 Research Objectives of This Study . To evaluate the potential advantages of smart manufacturing for the manufacturing sector and to pinpoint the crucial components necessary for its success. . To assess the influence of new technologies on the manufacturing sector and their potential uses for smart manufacturing, such as AI, IoT, CPS, cloud computing, and big data. . To put forth a framework that incorporates the ideas of smart, sensorable, sustainable, and secure operations at various phases of the manufacturing process.

268

B. Rathore

. To provide insights for further developments in smart manufacturing and its potential contributions to the industrial revolution and the betterment of the manufacturing industry.

2.2 Research Methodology The research was conducted using qualitative data obtained from several secondary sources, such as journals, newspapers, publications, magazines, books, and online and offline websites. The information was collected from libraries and through online searches and was thoroughly examined and verified for accuracy.

3 Next Generation Technology Development The Fourth Industrial Revolution is an advanced technology which encompasses the use of industrial Internet of things, 3D printing, robotic technology, deep learning, artificial intelligence, blockchain technology, and cloud computing into manufacturing processes [15]. This combination of technologies will enable seamless communication between machines, people, and systems as well as accelerated automation and optimization of entire processes whether it is development, management, or operations [16]. It will also bring down the cost of production with improved production outputs. Industry 4.0 fundamentally changes the way of managing and controlling the manufacturing processes by introducing cyber-physical systems in production [17]. By using connected computers and smart sensors, cyber-physical systems allow manufacturers to simulate and visualise entire production processes in a virtual environment, improve performance, and make real-time decisions [15, 18]. Moreover, through real-time analytics and predictive maintenance, manufacturers can achieve higher precision and improved traceability. Finally, Industry 4.0 makes possible production sharing, collaboration, and open innovation across borders thanks to smart data exchange [19, 20].

4 Defining ‘4*S Model’ A paradigm for smart manufacturing called the ‘4*S Model’ places an emphasis on four fundamental principles: smart, sensorable, sustainable, and secure. The term ‘smart’ describes the use of cutting-edge digital technology, such as cloud computing, artificial intelligence, and the Internet of things (IoT), to optimise industrial processes. Manufacturers may save costs, boost production, and improve product quality by incorporating these technologies.

Next Generation Intelligent IoT Use Case in Smart Manufacturing

269

The term ‘Sensorable’ highlights the significance of sensors in the production process. Manufacturers may monitor and analyse vital parameters, such as temperature, pressure, and humidity, to enhance efficiency and decrease waste by utilising sensors to capture real-time data on tools, machinery, and products. The term ‘sustainable’ emphasises the significance of socially and ecologically conscious production methods. This entails minimising waste, cutting back on carbon emissions, and encouraging moral employment practices. The word ‘secure’ emphasises the necessity of strong cybersecurity measures to guard against theft, hacking, and other security risks. Overall, the ‘4*S Model’ offers a thorough framework for incorporating smart, sensorable, sustainable, and secure technologies into the manufacturing process, allowing manufacturers to increase productivity, cut costs, and improve product quality while giving environmental and ethical considerations top priority.

4.1 Conceptualization of ‘4*S Model’ Planning Stage: Smart: Employ AI-based planning algorithms that can optimise production schedules and reduce energy consumption [21]. Utilise energy-saving AI-based planning algorithms that can optimise manufacturing schedules. By analysing data and forecasting future production requirements, AI-based planning algorithms may be utilised to optimise production schedules and lower energy usage [22]. These algorithms can calculate the best production schedule by taking into consideration factors including production capacity, inventory levels, demand forecasts, and energy consumption trends [23]. To take into account unforeseen changes in demand or supply chain problems, this timetable may be modified in real time. AI-based algorithms that analyse historical data on energy use can find trends in that data that can be used to develop production plans that are energy-efficient [24–26]. Sensorable: Sensors can be deployed to monitor the quality of raw materials and to track inventory levels. Sensors can be very useful in monitoring the quality of raw materials and tracking inventory levels in a manufacturing facility [27]. For example, sensors can be placed on conveyor belts or in storage bins to monitor the weight and quantity of raw materials as they are received and used in the production process. This data can be sent to a central system for analysis and used to optimise inventory management and reduce waste [28, 29]. Here is a model in Fig. 1 that incorporates smart, sensorable, sustainable, and secure concepts at various stages of manufacturing: Sustainable: Implement eco-friendly production techniques that reduce waste and minimise carbon footprint. Implementing eco-friendly production techniques is crucial for reducing waste and minimising the carbon footprint of a manufacturing facility [30].

270

B. Rathore

Fig. 1 Proposed ‘4*S model’

Secure: Cybersecurity measures need to be taken to secure the planning data and algorithms [31]. AI-based planning algorithms become more prevalent in manufacturing facilities, it is critical to ensure that proper cybersecurity measures are in place to protect the planning data and algorithms from cyberthreats [32]. Design Stage: Smart: Employ 3D printing, virtual reality, and other digital tools to design and test products before production [33]. Employing 3D printing, virtual reality, and other numerous advantages for product development might come from using digital tools throughout the design phase [34]. Sensorable: Data on product performance and usage may be gathered using sensors and utilised to inform product design [35]. Sensors may be used to gather useful information on how a product is used and how well it performs, which can then be analysed and utilised in the design process. This information can shed light on how users are interacting with the product, the aspects they value most, and potential areas for development [36–38]. Sustainable: Use long-lasting materials, and create goods that can be recycled [39]. For products to have a less negative impact on the environment and to result in long-term cost savings, sustainability must be included into product design [40, 41].

Next Generation Intelligent IoT Use Case in Smart Manufacturing

271

Delivery Stage: Smart: Use IoT and machine learning to optimise transportation routes and delivery schedules [42]. Sensorable: Sensors can be used to track product delivery and monitor temperature and humidity levels during transportation [43]. Sustainable: Optimise delivery routes to minimise fuel consumption and reduce carbon emissions. Secure: Implement security measures to protect product data and prevent theft during transportation [44, 45]. Maintenance Stage: Smart: Use predictive maintenance techniques that employ machine learning algorithms to predict equipment failures before they occur. Sensorable: Sensors can be used to monitor equipment health and detect anomalies. Sustainable: Use eco-friendly maintenance practices that reduce waste and energy consumption. Secure: Implement security measures to prevent unauthorised access to maintenance data and control systems [46].

5 Challenges in Smart Manufacturing 73% of manufacturing companies report having less than two years of expertise with smart manufacturing. On their roadmap for smart manufacturing, 70% of them assert that they are moving slowly or not at all. Numerous industrial organisations appear to still be in the early phases of deploying smart manufacturing technology, according to statistics from several publications. Many businesses are having trouble moving forward with their smart manufacturing roadmaps despite the potential advantages of smart manufacturing, like improved productivity, quality control, and personalization. There are a number of explanations as to why this could be the case [22, 47–49]. First, implementing smart manufacturing technologies can be expensive and timeconsuming. Many organisations may not have the resources or expertise to make the necessary investments and changes to their operations [50–52]. Second, there may be a lack of understanding or awareness of the potential benefits of smart manufacturing. Some businesses might not perceive the benefit of investing in these technologies, particularly if they have had prior success using conventional production techniques [53–56]. Another deciding factor may be how challenging it will be to implement smart manufacturing technologies and integrate them with existing systems. Financial constraints and a shortage of qualified staff may impede certain firms’ progress

272

B. Rathore

towards smart manufacturing [57–59]. With the right knowledge, training, and investment, the potential benefits of smart manufacturing—such as increased effectiveness, productivity, and cost savings—can, nevertheless, get over the first obstacles [60]. Third, issues with data management and security can arise. Large-scale data gathering and analysis are required for smart manufacturing systems, which can be challenging to protect and manage. The amount of data produced by machines and systems has increased as a result of the adoption of smart manufacturing technologies [61, 62]. This data needs to be managed efficiently to derive insights and make informed decisions. Additionally, there is a need for secure data storage and transmission to protect sensitive information from cyberthreats. Organisations must invest in robust data management and security systems to address these challenges and ensure the smooth functioning of their smart manufacturing operations [63].

6 Advantages in Smart Manufacturing In order to increase productivity, quality, and efficiency in the manufacturing process, ‘smart manufacturing’ incorporates digital technologies and data analytics [47]. The following are some rewards of smart manufacturing:

6.1 Direct Cost Savings Businesses can benefit from various immediate cost-saving benefits of smart manufacturing. Here are a few illustrations: Reduced work Costs: Many procedures that would typically need human work are automated by smart manufacturing technologies. Businesses may considerably lower their employment expenses by doing this. Enhanced Efficiency: Real-time monitoring and process optimisation are possible with smart manufacturing technologies. This contributes to waste reduction, downtime reduction, and production efficiency improvement, all of which can result in cost savings [47]. Lower Maintenance Costs: Real-time equipment and machinery monitoring is another capability of smart manufacturing systems. By doing this, companies may spot possible faults before they develop into bigger ones, which can cut down on maintenance expenses and expensive downtime. Reduced Energy Consumption: Smart manufacturing systems can optimise energy consumption by identifying areas where energy is being wasted and making adjustments to reduce consumption. This can lead to significant cost savings on energy bills [48].

Next Generation Intelligent IoT Use Case in Smart Manufacturing

273

6.2 Indirect Cost Savings In addition to the direct cost savings advantages of smart manufacturing, there are also several indirect cost savings benefits that businesses can enjoy. Here are a few examples: Improved Quality: Smart manufacturing systems can monitor processes in real time and make adjustments as needed to maintain consistent product quality. Cost reductions may occur from fewer faults, reduced rework, and fewer warranty claims as a consequence [49]. Enhanced Safety: By monitoring and managing hazardous processes, smart manufacturing systems may also increase worker safety. As a result, there may be a decrease in the likelihood of accidents and injuries, which might lead to decreased insurance premiums and workers’ compensation expenses. Better Inventory Management: Real-time visibility into inventory levels and production schedules may be provided by smart manufacturing systems. This can assist companies in minimising inventory carrying costs, preventing stockouts and overstocks, and optimising inventory levels—all of which can result in cost savings [22, 49]. Improved Customer Satisfaction: Smart manufacturing systems can help businesses deliver products that meet or exceed customer expectations. This can lead to higher customer satisfaction, repeat business, and positive word-of-mouth referrals, all of which can result in increased revenue and profitability [22].

7 Limitations of This Study There are a few limitations to this study that should be taken into account, even if the suggested ‘4*S Model’ for smart manufacturing is a promising and innovative way to improve the production process. The qualitative data used in this study was gathered from secondary sources. Furthermore, the research just considers the suggested model and does not offer a thorough examination of all feasible methods for enhancing smart manufacturing. The model described in this study might not be appropriate for all industrial processes and could need to be further customised and adapted for certain industries or applications.

8 Conclusion The term ‘Industry 4.0’ is frequently used to describe the trend towards geographically dispersed, Internet-connected, medium size smart manufacturing. This change is being fuelled by advances in IoT, cloud computing, and AI technologies as well as the growing availability of inexpensive and dependable Internet access. These

274

B. Rathore

innovations make it possible for factories to be more adaptive and flexible, allowing them to swiftly change their production methods in response to shifting consumer needs. Real-time data analytics and the usage of smart sensors can also assist to decrease waste and improve manufacturing processes. The development of new business models and income sources, such as the offering of value-added services or the development of new goods, is another potential outcome of the move towards smart factories. Using cutting-edge technology like IoT, AI, and machine learning, smart manufacturing is a viable strategy for changing the industrial sector. The potential advantages of smart manufacturing, such as greater efficiency, productivity, and cost savings, are enormous even if many organisations are still in the early phases of implementation. To fully realise the advantages of smart manufacturing technologies, organisations must handle challenges like data management, security, and workforce development. Organisations may put themselves in a position for long-term success in a market that is changing quickly by overcoming these obstacles and investing in smart manufacturing. The ‘Proposed 4*S Model’ may be improved even more by incorporating it with current frameworks, creating performance measures, using it in certain sectors, exploring the social elements, and creating decision support systems. The 4*S model may be modified to suit the particular challenges and potential of smart manufacturing in various sectors and environments by addressing these research areas.

References 1. Wang B, Tao F, Fang X, Liu C, Liu Y, Freiheit T (2021) Smart manufacturing and intelligent manufacturing: a comparative review. Engineering 7(6):738–757 2. Davis J, Edgar T, Porter J, Bernaden J, Sarli M (2012) Smart manufacturing, manufacturing intelligence and demand-dynamic performance. Comput Chem Eng 47:145–156 3. Tao F, Qi Q, Liu A, Kusiak A (2018) Data-driven smart manufacturing. J Manuf Syst 48:157– 169 4. Yang H, Kumara S, Bukkapatnam ST, Tsung F (2019) The internet of things for smart manufacturing: a review. IISE Trans 51(11):1190–1216 5. Rathore B (2022) Textile Industry 4.0 transformation for sustainable development: prediction in manufacturing & proposed hybrid sustainable practices. Eduzone: Int Peer Rev/Refereed Multidisciplinary J 11(1):223–241 6. Kusiak A (2017) Smart manufacturing must embrace big data. Nature 544(7648):23–25 7. Ramakrishna S, Khong TC, Leong TK (2017) Smart manufacturing. Proc Manuf 12:128–131 8. Ghobakhloo M (2020) Determinants of information and digital technology implementation for smart manufacturing. Int J Prod Res 58(8):2384–2405 9. Rathore B (2023) Integration of artificial intelligence and it’s practices in apparel industry. Int J New Media Stud (IJNMS) 10(1):25–37 10. Qu YJ, Ming XG, Liu ZW, Zhang XY, Hou ZT (2019) Smart manufacturing systems: state of the art and future trends. Int J Adv Manuf Technol 103:3751–3768 11. Phuyal S, Bista D, Bista R (2020) Challenges, opportunities and future directions of smart manufacturing: a state of art review. Sustain Futures 2:100023 12. Kusiak A (2019) Fundamentals of smart manufacturing: a multi-thread perspective. Annu Rev Control 47:214–220

Next Generation Intelligent IoT Use Case in Smart Manufacturing

275

13. Zenisek J, Wild N, Wolfartsberger J (2021) Investigating the potential of smart manufacturing technologies. Proc Comput Sci 180:507–516 14. Li L, Lei B, Mao C (2022) Digital twin in smart manufacturing. J Ind Inf Integr 26:100289 15. Zhou J, Li P, Zhou Y, Wang B, Zang J, Meng L (2018) Toward new-generation intelligent manufacturing. Engineering 4(1):11–20 16. Leng J, Ye S, Zhou M, Zhao JL, Liu Q, Guo W, Cao W, Fu L (2020) Blockchain-secured smart manufacturing in industry 4.0: a survey. IEEE Trans Syst Man Cybern Syst 51(1):237–252 17. Zheng P, Wang H, Sang Z, Zhong RY, Liu Y, Liu C, Mubarok K, Yu S, Xu X (2018) Smart manufacturing systems for Industry 4.0: conceptual framework, scenarios, and future perspectives. Front Mech Eng 13:137–150 18. Namjoshi J, Rawat M (2022) Role of smart manufacturing in industry 4.0. Mater Today Proc 63:475–478 19. Mahmoud MA, Ramli R, Azman F, Grace J (2020) A development methodology framework of smart manufacturing systems (Industry 4.0). Int J Adv Sci Eng Inf Technol 10(5):1927–1932 20. Çınar ZM, Zeeshan Q, Korhan O (2021) A framework for industry 4.0 readiness and maturity of smart manufacturing enterprises: a case study. Sustainability 13(12):6659 21. Zuo Y (2021) Making smart manufacturing smarter—a survey on blockchain technology in Industry 4.0. Enterp Inf Syst 15(10):1323–1353 22. Ahuett-Garza H, Kurfess T (2018) A brief discussion on the trends of habilitating technologies for Industry 4.0 and smart manufacturing. Manuf Lett 15:60–63 23. Ludbrook F, Michalikova KF, Musova Z, Suler P (2019) Business models for sustainable innovation in industry 4.0: smart manufacturing processes, digitalization of production systems, and data-driven decision making. J Self-Gov Manage Econ 7(3):21–26 24. Valaskova K, Nagy M, Zabojnik S, L˘az˘aroiu G (2022) Industry 4.0 wireless networks and cyber-physical smart manufacturing systems as accelerators of value-added growth in Slovak exports. Mathematics 10(14):2452 25. Bajic B, Cosic I, Lazarevic M, Sremcev N, Rikalovic A (2018) Machine learning techniques for smart manufacturing: applications and challenges in industry 4.0. Department of Industrial Engineering and Management Novi Sad, Serbia, 29 26. Evjemo LD, Gjerstad T, Grøtli EI, Sziebig G (2020) Trends in smart manufacturing: role of humans and industrial robots in smart factories. Curr Robot Rep 1:35–41 27. Hopkins E, Siekelova A (2021) Internet of things sensing networks, smart manufacturing big data, and digitized mass production in sustainable industry 4.0. Econ Manage Financ Markets 16(4) 28. Saleh A, Joshi P, Rathore RS, Sengar SS (2022) Trust-aware routing mechanism through an edge node for IoT-enabled sensor networks. Sensors 22(20):7820 29. Machado CG, Winroth MP, Ribeiro da Silva EHD (2020) Sustainable manufacturing in Industry 4.0: an emerging research agenda. Int J Prod Res 58(5):1462–1484 30. Davim JP (ed) (2013) Sustainable manufacturing. John Wiley & Sons 31. Sharma R, Jabbour CJC, Lopes de Sousa Jabbour AB (2021) Sustainable manufacturing and industry 4.0: what we know and what we don’t. J Enterp Inf Manage 34(1):230–266 32. Petrillo A, Cioffi R, De Felice F (eds) (2018) Digital transformation in smart manufacturing. BoD–Books on Demand 33. Abikoye OC, Bajeh AO, Awotunde JB, Ameen AO, Mojeed HA, Abdulraheem M, Oladipo ID, Salihu SA (2021) Application of internet of thing and cyber physical system in Industry 4.0 smart manufacturing. In: Emergence of cyber physical system and IoT in smart automation and robotics: computer engineering in automation. Springer International Publishing, Cham, pp 203–217 34. Maheswari M, Brintha NC (2021) Smart manufacturing technologies in industry-4.0. In: 2021 Sixth international conference on image information processing (ICIIP), vol 6. IEEE, pp 146– 151 35. Bhatnagar D, Rathore RS. Cloud computing: security issues and security measures. Int J Adv Res Sci Eng 4(01):683–690 36. Vaidya S, Ambad P, Bhosle S (2018) Industry 4.0—a glimpse. Proc Manuf 20:233–238

276

B. Rathore

37. Wade K, Vochozka M (2021) Artificial intelligence data-driven internet of things systems, sustainable industry 4.0 wireless networks, and digitized mass production in cyber-physical smart manufacturing. J Self-Gov Manage Econ 9(3):48–60 38. Frontoni E, Loncarski J, Pierdicca R, Bernardini M, Sasso M (2018) Cyber physical systems for industry 4.0: towards real time virtual reality in smart manufacturing. In: Augmented reality, virtual reality, and computer graphics: 5th international conference, AVR 2018, Otranto, Italy, June 24–27, Proceedings, Part II 5. Springer International Publishing, pp 422–434 39. Shin KY, Park HC (2019) Smart manufacturing systems engineering for designing smart product-quality monitoring system in the industry 4.0. In: 2019 19th International conference on control, automation and systems (ICCAS). IEEE, pp 1693–1698 40. Muthu SS (ed) (2017) Sustainability in the textile industry. Springer, Singapore 41. Lombardi Netto A, Salomon VA, Ortiz-Barrios MA, Florek-Paszkowska AK, Petrillo A, De Oliveira OJ (2021) Multiple criteria assessment of sustainability programs in the textile industry. Int Trans Oper Res 28(3):1550–1572 42. Nayyar A, Kumar A (eds) (2020) A roadmap to industry 4.0: smart production, sharp business and sustainable development. Springer, Berlin, pp 1–21 43. Kumar K, Zindani D, Davim JP (2019) Industry 4.0: developments towards the fourth industrial revolution. Springer, Cham, Switzerland 44. Kandasamy J, Muduli K, Kommula VP, Meena PL (eds) (2022) Smart manufacturing technologies for industry 4.0: integration, benefits, and operational activities. CRC Press 45. Affatato L, Carfagna C (2013) Smart textiles: a strategic perspective of textile industry. In: Advances in science and technology, vol 80. Trans Tech Publications Ltd, pp 1–6 46. Büchi G, Cugno M, Castagnoli R (2020) Smart factory performance and Industry 4.0. Technol Forecast Soc Chang 150:119790 47. Liu Y, Xu X (2017) Industry 4.0 and cloud manufacturing: a comparative analysis. J Manuf Sci Eng 139(3) 48. Osterrieder P, Budde L, Friedli T (2020) The smart factory as a key construct of industry 4.0: a systematic literature review. Int J Prod Econ 221:107476 49. Longo F, Nicoletti L, Padovano A (2017) Smart operators in industry 4.0: A human-centered approach to enhance operators’ capabilities and competencies within the new smart factory context. Comput Ind Eng 113:144–159 50. Pascual DG, Daponte P, Kumar U (2019) Handbook of industry 4.0 and SMART systems. CRC Press 51. Oztemel E, Gursev S (2020) Literature review of Industry 4.0 and related technologies. J Intell Manuf 31:127–182 52. Misra S, Roy C, Mukherjee A (2021). Introduction to industrial internet of things and industry 4.0. CRC Press 53. Friedman T (2018) Hot, flat, and crowded: why we need a green revolution—and how it can renew America, Farrar, Straus and Giroux, 2008. ISBN 978-0-312-42892-1 54. Rathore B (2022) Supply chain 4.0: sustainable operations in fashion industry. Int J New Media Stud (IJNMS) 9(2):8–13 55. Porter M, Heppelmann J (2014) How smart, connected products are transforming competition. Harvard Bus Rev 56. Chavarría-Barrientos D, Camarinha-Matos LM, Molina A (2017) Achieving the sensing, smart and sustainable “everything”. In: Camarinha-Matos L, Afsarmanesh H, Fornasiero R (eds) Collaboration in a data-rich world. PRO-VE 2017, vol 506. IFIP Advances in Information and Communication Technology. Springer, Cham 57. Kumar S, Rathore RS, Mahmud M, Kaiwartya O, Lloret J (2022) BEST—blockchain-enabled secure and trusted public emergency services for smart cities environment. Sensors 22(15):5733 58. Molina A, Ponce P, Ramirez M, Sanchez-Ante G (2014) Designing a S2-enterprise (smart x sensing) reference model, collaborative systems for smart networked environments. IFIP Adv Inf Commun Technol 434:384–395 59. Rosling H, Rosling O, Rosling R (2018) Factfulness. Factfulness AB. ISBN 978-1-250-10781-7

Next Generation Intelligent IoT Use Case in Smart Manufacturing

277

60. Bainbridge BS, Roco MC (eds) (2006) Managing nano-bio-info-cogno innovations. Converging Technologies in Society, Springer 61. Miranda J, Cortes D, Ponce P, Noguez J, Molina JM, López EO, Molina A (2018) Sensing, smart and sustainable products to support health and well-being in communities. In: 2018 International conference on computational science and computational intelligence (CSCI’18). IEEE 62. Kang HS, Lee JY, Choi S, Kim H, Park JH, Son JY, Kim BH, Noh SD (2016) Smart manufacturing: past research, present findings, and future directions. Int J Precis Eng manuf Green Technol 3:111–128 63. Tuptuk N, Hailes S (2018) Security of smart manufacturing systems. J Manuf Syst 47:93–106

Forecasting Financial Success App: Unveiling the Potential of Random Forest in Machine Learning-Based Investment Prediction Ashish Khanna, Divyansh Goyal, Nidhi Chaurasia, and Tariq Hussain Sheikh

Abstract In the complicated and dynamic financial market, the forecast of financial investment choices is essential for assisting investors in making decisions. In order to predict financial investment alternatives, this research article focuses on using machine learning techniques, specifically the random forest algorithm. The performance of each algorithm such as KNN, decision tree, logistic regression, and random forest was evaluated based on its accuracy and precision using testing set. Thus, the study concludes that the random forest algorithm outperformed the other algorithms with greater accuracy and is most suitable for predicting the profitability of financial investments. Additionally, we demonstrate the creation of a web application that incorporates the predictive model and enables users to enter pertinent data and obtain real-time predictions. To train and validate the random forest model, historical financial data, including market indexes, business fundamentals, and macroeconomic indicators, is gathered and preprocessed to eliminate any missing or inconsistent data points. The algorithm’s collection of decision trees demonstrates that it is reliable and adaptable when processing complex data. The Django-based web application’s userfriendly interface allows users to enter parameters and get projections for various investment options. Assessment criteria including accuracy and review certify that the arbitrary random forest computation is adequate. The incorporation of the vision display into the web application provides a practical tool for making data-driven project decisions. This research contributes to the development of a user-friendly web application and observational evaluation of the algorithm’s performance, as A. Khanna · D. Goyal · N. Chaurasia (B) Maharaja Agrasen Institute of Technology, Guru Gobind Singh Indraprastha, University Delhi, Delhi, India e-mail: [email protected] A. Khanna e-mail: [email protected] T. H. Sheikh Department of Computer Science, Shri Krishan Chander, Government Degree College, Poonch 185101, India © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 A. Swaroop et al. (eds.), Proceedings of Data Analytics and Management, Lecture Notes in Networks and Systems 788, https://doi.org/10.1007/978-981-99-6553-3_22

279

280

A. Khanna et al.

well as to the interests of profit-seeking investors, financial experts, and analysts interested in machine learning for financial decision-making. Keywords Financial investment · Machine learning · Random forest algorithm · K-nearest neighbors · Decision tree · Logistic regression

1 Introduction The financial investment landscape is complex and uncertain, forcing investors to make educated judgments in a setting where many variables are at play. Fundamental analysis, technical indicators, and professional judgment are frequently used in traditional investment strategies. A growing number of people are interested in using machine learning techniques to improve investing decision-making, nevertheless, as a result of quick technological improvements and the accessibility of enormous volumes of financial data. A branch of artificial intelligence called machine learning has the capacity to find and extract patterns from vast volumes of data [1]. It uses a variety of techniques and models to classify or predict using past data without having any accurate knowledge. Machine learning has the capacity to recognize relationships, pinpoint flaws, and adjust to market fluctuations in the context of financial investing. This term paper’s goal is to study how machine learning calculations might be used to foresee budgetary speculative options. We want to develop foresight models that can assist financial specialists in making more informed decisions by utilizing historical data and significant highlights, such as showcase lists, corporate fundamentals, and macroeconomic pointers. Exact forecasting of speculative decisions can result in rewarding experiences, enabling speculators to optimize portfolio allocation, control risk, and identify potentially fruitful opportunities. Data-driven decision-making is a benefit provided by machine learning technologies, which can supplement and enhance traditional venture methodologies. This consideration will center on assessing the execution of different machine learning calculations [2]. Through broad experimentation and investigation, we will compare the prescient precision, vigor, and computational efficiency of these calculations. Moreover, we are going to explore the effect of distinctive highlight sets and information preprocessing strategies on the prescient execution. The leftover portion of this paper is organized as follows: Segment 2 gives an outline of related work and the existing writing within the field of monetary venture forecast utilizing machine learning. Segment 3 portrays the technique, counting information collection, preprocessing, and the chosen machine learning calculations. Segment 4 presents the exploratory comes about and execution assessment. At long last, Segment 5 summarizes the conclusions and talks about the impediments of the ponder and diagrams potential roads for future inquiry about.

Forecasting Financial Success App: Unveiling the Potential of Random …

281

2 Literature Review See Table 1.

3 Concept 3.1 Financial Investment A financial product, like a stock or a cryptocurrency, that has been purchased primarily with the hope of making money is referred to as an investment. Every investment comes with its own set of risks, rewards, and disadvantages, all of which have an impact on how and when investors decide to buy or sell assets.

3.2 Machine Learning A subset of artificial intelligence called machine learning enables software programs to increase the precision of their predictions without having to explicitly program them. These apps can analyze previous data and produce predictions for fresh output values by using machine learning algorithms. Types of learning in machine learning: (i) supervised learning, (ii) unsupervised learning, and (iii) reinforcement learning. We have used a supervised learning approach in the model and trained it using a random forest algorithm. Assuming trees are free to grow to maximum height O(log n), training takes with maximum height O(log n), training of random forest will take O(tunlogn), where t is the number of trees and u is the number of features considered for splitting. The prediction of a new sample takes O(tlogn) [13] (Fig. 1). In the above flowchart, these generalized steps have taken place into account. Data preprocessing: Prepare the data containing a dataset (taken from Kaggle) of 1 lakh records by performing necessary cleaning, transformation, and normalization steps. Fit the random forest algorithm to the training set: Train the Random Forest model using the training data. Predict the test results: Use the trained model to predict the outcomes for the test set. Assess the accuracy of the predictions: Evaluate the accuracy of the predictions by creating a confusion matrix.

282

A. Khanna et al.

Table 1 Contribution in the field of finance and machine learning S. No. Reference of paper

Technology/key areas

Contributions

1

[3]

Investment, return, prediction, machine learning

Performance of investment return prediction model is analyzed

2

[4]

Prediction with data mining techniques, efficient market hypothesis

Multi-modal regression, different techniques like co-integration test, granger causality, etc., were implemented

3

[5]

Auto-encoder; covariance matrix; dimensionality reduction; machine learning

Subperiod analysis, portfolio performance during different subperiods as defined by market volatility, inflation, and credit spread

4

[6]

Deep learning, finance, Using deep learning techniques for risk prediction, machine intelligent assessment, financial investment learning risk prediction can be significantly enhanced

5

[7]

Machine learning, financial markets, economists, data analysis

In this study, machine learning techniques are used to investigate the predictability of financial markets. They are compared to techniques that affect accuracy and profitability

6

[8]

Deep learning, financial risk management, artificial intelligence, taxonomy, risk analysis

The thorough review of machine learning research in financial risk management presented in this paper covers tasks, approaches, difficulties, and new trends

7

[9]

Machine learning, deep learning, quantitative analysis financial risk management

The use of machine learning approaches for quantitative issues is demonstrated in the study, which significantly speeds up fitting while retaining levels of accuracy that are acceptable

8

[10]

Convolutional neural network, deep learning, artificial intelligence

The rise of artificial intelligence has propelled numerous algorithms, making them popular technologies across diverse domains

9

[11]

Gray relational analysis, In order to improve the creation of multilayer analytic hierarchy neural networks, this research investigates process, variational data augmentation using generative models auto-encoders

10

[12]

Forecasting techniques, sentiment analysis

Main objective of this research explores diverse approaches from fields such as data mining, machine learning, and sentiment analysis

Forecasting Financial Success App: Unveiling the Potential of Random …

283

Fig. 1 Model architecture elucidating sequential execution of the process pipeline

Visualize the test set results: Display the results of the test set using appropriate visualization techniques.

4 Methodology The random forest algorithm is a widely used supervised learning technique in machine learning. It is capable of addressing both classification and regression problems [14]. This algorithm leverages ensemble learning, which involves combining multiple classifiers to effectively tackle complex problems and enhance model performance. In Fig. 2, we have tried depicting a dataset instance extracted from a training set and generating a decision tree based on the inputs. Random forest classifier consists of multiple decision trees trained on different subsets of the dataset [15]. By averaging the predictions of these trees, the algorithm enhances the accuracy of predictions. Instead of relying on a single decision tree, the random forest considers the majority votes from all trees to determine the final output [16]. Increasing the number of trees in the forest improves accuracy and mitigates overfitting concerns. Despite being highly precise, random forests are harder to interpret. The use of feature importance analysis, partial dependence graphs, and SHAP values can get around this restriction. The interpretability of the random forest algorithm is improved by these techniques, which offer insights into significant factors, connections between characteristics and forecasts, and individual prediction explanations. Steps implemented for building our machine learning model: 1. Import libraries such as NumPy and Pandas for data manipulation, as well as csv and warnings for handling CSV files and suppressing warnings, respectively. 2. The invest_data DataFrame is created by reading the ‘invest.csv’ file using the pd.read_csv() which contains a dataset of 1 lakh records with columns of gender,

284

A. Khanna et al.

Fig. 2 Random forest algorithm flowchart

3. 4.

5.

6.

age, savings objective, time period for investment, purpose of investment, return rate, etc. Categorical columns in the invest_data DataFrame are encoded by replacing their values with numerical equivalents using the replace() method. The column ‘Which investment avenue do you mostly invest in?’ is dropped from the invest_data DataFrame to create the feature set X. The target variable is assigned to the y variable, which contains the values from the ‘Which investment avenue do you mostly invest in?’ column of dataset. The code then imports the necessary libraries for model training, including train_ test_split from scikit-learn for splitting the data, accuracy_score for evaluating the model’s accuracy, and DecisionTreeClassifier for the decision tree classifier. The features X and the target variable y are split into training and testing sets using the train_test_split() function. The training set (X_train and y_train) is used to train the model, and the testing set (X_test and y_test) is used to evaluate the model’s performance. NOTE—Data is split into training and testing sets using train_test_split, with test size of 0.3 and random_state 42.

Forecasting Financial Success App: Unveiling the Potential of Random …

285

7. Next, the code imports RandomForestClassifier from scikit-learn to create an instance of the random forest classifier. The random forest classifier (rf) is trained using the fit() method on the training data. The joblib library is imported to save the trained random forest classifier to ‘trained_model.sav’ file using the dump() function. This file is then loaded using the joblib.load() function to retrieve the trained model for financial investment prediction. In summary, this code reads investment data from a CSV file, encodes categorical columns, splits the data into training and testing sets, trains a random forest classifier on the training data, and saves the trained model to a file [17].

5 Results In this section, we have dealt with the results inferred with our machine learning predictive model that suggests us investment options based on user inputs [18]. In Table 2, it can be seen that logistic regression has the minimum training and testing accuracy in contrast to others. The underlying cause is the absence of a linear relationship between the target label and the features. Consequently, logistic regression struggles to accurately predict targets, even when trained on the available data. And with an increase in K-value, the K-nearest neighbors (KNN) algorithm fits a more gradual curve to the data. This occurs because a larger K-value incorporates a greater amount of data, resulting in reduced sharpness or abruptness, ultimately decreasing the overall complexity and flexibility of the model [18]. In Fig. 3, we can see that the random forest has the maximum testing accuracy. In the graph shown in Fig. 4, we can see that the random forest has the maximum training accuracy similar to the decision tree algorithm. By calculating the proportion of examples that are correctly classified to all occurrences, accuracy provides a simple and intuitive evaluation of performance. When working with balanced classes, it works well. However, in situations of class imbalance or variable misclassification costs, precision may not be sufficient on its own. In these cases, precision, recall, or F1 score ought to be taken into account for a more thorough assessment of the model’s performance. Accuracy on Testing Data of random forest classifier is 0.4987(50% approx..). Table 2 Comparison of several machine learning algorithms on the model’s prediction Algorithms

K-nearest neighbors

Decision tree

Gaussian Naive Bayes

Random forest

Logistic regression

Training accuracy

0.68825

0.9920125

0.5039375

0.9920125

0.119485714

Testing accuracy

0.5031

0.49905

0.49525

0.4987

0.108033333

286

A. Khanna et al.

Fig. 3 Graph depicting the testing accuracy of each algorithm in the model

Fig. 4 Graph showing the training accuracy of several algorithms

Accuracy on Training Data of random forest: 0.9920125(99% approx..). In Fig. 5, data analysis is done for females for analyzing which investment avenues they mostly invest in and creates a displot between count and age of females using seaborn library in machine learning written in Python [19]. In Fig. 6, data analysis is done for females for analyzing which investment avenues they mostly invest in and creates a displot between count and age of males using seaborn library in machine learning written in Python similar to analysis of females [20]. Based on the model generated, we created an application for predicting the best option for financial investment based on the user inputs and giving output in the real-time basis. Here is a glimpse of the dashboard of our Web App containing user inputs like gender, age, savings objective, time period for investment, purpose of

Forecasting Financial Success App: Unveiling the Potential of Random …

287

Fig. 5 Count versus age displot for which investment females mostly invest in

Fig. 6 Count versus age displot for which investment males mostly invest in

investment, return rate, and many more that fits in for every age group and individual with different investing goals. Based on the inputs entered, the model predicts the output and also shows some examples based on the output as shown in Figs. 7 and 8. Hence, in Figs. 9 and 10 we made the ML model work in the backend along with some frontend technologies that helped us create a UI/UX for the predictions that have been shown.

288

A. Khanna et al.

Fig. 7 Generic form to take basic user details like (gender, age, purpose for investment, etc.) and predict appropriate results

Fig. 8 You get the flexibility to rank your investment option and see what is actually best for your investment goals

6 Discussions The talk area of this term paper centers on the discoveries and suggestions of utilizing the arbitrary woodland calculation for budgetary venture alternative expectation, as well as the advancement of the Internet application utilizing Django. This comes about to illustrate the viability of the irregular woodland calculation in precisely anticipating money-related venture alternatives. The algorithm’s capacity to capture

Forecasting Financial Success App: Unveiling the Potential of Random …

289

Fig. 9 Output shows the suggestions in a particular investment option (e.g., we can see the output as cryptocurrency, for the inputs provided)

Fig. 10 Point-plot depicting the trend of investing in result-based investment option for the last 5 years, fetching real-time data from an API

designs and connections in verifiable money-related information contributes to its vigorous performance. The integration of the prescient show into a web application built with Django improves client openness and comfort. Clients can input their parameters and get real-time expectations, giving them profitable data for making speculation decisions. The aim of this study is to create a good system that suggests financial investments that will allow users to decide on which assets they should invest wisely, with personalized investing options that fit their own requirements and preferences. Our goal is to develop a robust system that can predict the assets that will yield the most

290

A. Khanna et al.

profitability of investment assets by analyzing previous financial data and using the power of machine learning. The talk highlights the potential benefits of encouraging algorithmic investigation, such as investigating other machine learning calculations like KNN, decision tree, random forest, Naive Bayes, and logistic regression. Comparing their execution with the arbitrary random forest calculation may lead to moved forward expectation accuracy [20]. Overall, this investigation illustrates the potential of machine learning and web application advancement in moving forward money-related venture choice expectation and gives profitable bits of knowledge for future investigation and advancement in this field.

7 Limitations In this section, we have discussed the limitations faced during the dataset compilation, data preprocessing challenges, and numerous errors that we handled to increase the model’s efficiency. While the research paper explores the prediction of financial investment choices using machine learning techniques, such as the random forest algorithm, it is important to acknowledge certain limitations associated with the study: (i) (ii) (iii) (iv) (v) (vi) (vii)

Limited generalizability to other algorithms and approaches. Potential impact of data availability and quality on model reliability. Possibility of overfitting or model selection bias. Lack of comparison to established benchmarks or existing approaches. Inadequate consideration of external factors and market volatility. Absence of longitudinal analysis to assess model stability over time. Limited evaluation of user feedback and experience with the web application.

8 Conclusion and Future Scope In conclusion, our study has shown how well the random forest algorithm predicts potential financial investment opportunities. The incorporation of this algorithm into a Django-built web application has given consumers an easy-to-use platform to enter their parameters and get real-time forecasts. The outcomes demonstrate the algorithm’s capacity to identify links and trends in historical financial data, producing precise forecasts. This study advances the subject of machine learning for financial decision-making by demonstrating the usefulness of the random forest algorithm. In future, investigating elective machine learning calculations, improving highlight building, consolidating progressed chance evaluation procedures, joining realtime information, gathering client input, extending the application’s scale, moving forward interpretability, and coordination with exchanging stages will develop the

Forecasting Financial Success App: Unveiling the Potential of Random …

291

field of monetary speculation choice forecast utilizing machine learning. These endeavors point to supply speculators with more exact, effective, and user-friendly devices for making educated venture choices.

References 1. Omar B, Zineb B, Cortés Jofré A, González Cortés D (2018) A comparative study of machine learning algorithms for financial data prediction. In: 2018 International symposium on advanced electrical and communication technologies (ISAECT). Rabat, Morocco, pp 1–5. https://doi.org/ 10.1109/ISAECT.2018.8618774 2. Dhokane RM, Sharma OP (2023) A comprehensive review of machine learning for financial market prediction methods. In: 2023 International conference on emerging smart computing and informatics (ESCI). Pune, India, pp 1–8. https://doi.org/10.1109/ESCI56872.2023.100 99791 3. Ralevic N, Glisovic NS, Djakovic VD, Andjelic GB (2014) The performance of the investment return prediction models: theory and evidence. In: 2014 IEEE 12th International symposium on intelligent systems and informatics (SISY). https://doi.org/10.1109/sisy.2014.6923590 4. Pawar P, Nath S. Machine learning applications in financial markets 5. Brennan Irish MJ. Machine learning and factor-based portfolio optimization 6. Sun Y, Li J (2022) Deep learning for intelligent assessment of financial investment risk prediction. Comput Intell Neurosci 2022:11, Article ID 3062566. https://doi.org/10.1155/2022/306 2566 7. Ma T, Hsu (2016) Bridging the divide in financial market forecasting: machine learners versus financial economists. Expert Syst Appl 61(C):215–234 8. Mashrur A, Luo W, Zaidi NA, Robles-Kelly A (2020) Machine learning for financial risk management: a survey. IEEE Access 8:203203–203223. https://doi.org/10.1109/ACCESS. 2020.3036322 9. De Spiegeleer J, Madan DB, Reyners S, Schoutens W (2018) Machine learning for quantitative finance: fast derivative pricing, hedging and fitting. Quant Finance 18(10):1635–1643 10. Xing FZ, Cambria E, Welsch RE (2018) Natural language based financial forecasting: a survey. Artif Intell Rev 50(1):49–73 11. Das SP, Padhy S (2018) A novel hybrid model using teaching–learning-based optimization and a support vector machine for commodity futures index forecasting. Int J Mach Learn Cybern 9(1):97–111 12. Brabazon A, O’Neill M (2008) An introduction to evolutionary computation in finance. IEEE Comput Intell Mag 3(4):42–55 13. Marco Virgolin. Time complexity for different machine learning algorithms. https://marcovirg olin.github.io/extras/details_time_complexity_machine_learning_algorithms/ 14. Chen C, Zhang P, Liu Y, Liu J (2020) Financial quantitative investment using convolutional neural network and deep learning technology. Neurocomputing 390:384–390. ISSN 0925-2312. https://doi.org/10.1016/j.neucom.2019.09.092 15. Chen M, Chiang H, Lughofer E, Egrioglu E (2020) Deep learning: emerging trends, applications and research challenges. Soft Comput A Fusion Found Methodologies Appl 24(11):7835–7838. https://doi.org/10.1007/s00500-020-04939-z 16. Xiao Y, Huang W, Wang J (2020) A random forest classification algorithm based on dichotomy rule fusion. In: 2020 IEEE 10th International conference on electronics information and emergency communication (ICEIEC). Beijing, China, pp 182–185. https://doi.org/10.1109/ICEIEC 49280.2020.9152236 17. Nalabala D, Nirupamabhat M (2021) Financial predictions based on fusion models—a systematic review. In: 2021 International conference on emerging smart computing and informatics (ESCI). Pune, India, pp 28–37. https://doi.org/10.1109/ESCI50559.2021.9397024

292

A. Khanna et al.

18. Goyal. Investment option-prediction. Available at: https://rb.gy/xd5id 19. White H (1988) Economic prediction using neural networks: the case of IBM daily stock returns. IEEE Int Conf Neural Networks II:451–458 20. Unadkat V, Sayani P, Kanani P, Doshi P (2018) Deep learning for financial prediction. In: 2018 International conference on circuits and systems in digital enterprise technology (ICCSDET). Kottayam, India, pp 1–6. https://doi.org/10.1109/ICCSDET.2018.8821178

Integration of Blockchain-Enabled SBT and QR Code Technology for Secure Verification of Digital Documents Ashish Khanna, Devansh Singh, Ria Monga, Tarun Kumar, Ishaan Dhull, and Tariq Hussain Sheikh

Abstract Incorporating blockchain technology and QR codes presents a potential solution to the issues educational bodies encounter while handling and authenticating student records in today’s digital age. This research paper puts forth a method to handle the security and authenticity of academic documents by incorporating these technologies. The system proposed makes efficient use of cloud storage, namely Amazon Web Services (AWS), to store student data in a CSV format, enabling efficient addition and retrieval of student data. This method uses a distinctive tokenURI for each student, forming QR codes that act as verifiable connections to their academic records. When these QR codes are used, students are guided to link their web3 wallet address or form a new wallet, resulting in the production of a non-transferable Soulbound token (SBT) that encompasses all academic information. Additionally, the QR codes direct users to recognized non-fungible tokens (NFT) marketplaces where the SBTs can be publicly verified. This innovative system ensures the authenticity and integrity of student records, providing a reliable and decentralized means for document verification in educational institutions. This research significantly contributes to the advancement of secure and trustworthy digital credential management systems, addressing the evolving needs of the academic community. Keywords Blockchain · QR code · Soulbound token · Web3 · Smart contract · Document verification · AWS

A. Khanna · D. Singh · R. Monga · T. Kumar (B) · I. Dhull Department of Computer Science and Engineering, Maharaja Agrasen Institute of Technology, GGSIPU, Delhi, India e-mail: [email protected] A. Khanna e-mail: [email protected] T. H. Sheikh Department of Computer Science, Shri Krishan Chander Government Degree College Poonch, Jammu and Kashmir, India © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 A. Swaroop et al. (eds.), Proceedings of Data Analytics and Management, Lecture Notes in Networks and Systems 788, https://doi.org/10.1007/978-981-99-6553-3_23

293

294

A. Khanna et al.

1 Introduction In recent years, the use of blockchain technology has received a lot of attention in various industries due to its potential to create secure and transparent systems. Decentralization, immutability, transparency, and security are some aspects of blockchain. One area that stands to benefit from blockchain implementation is higher education, where the need for secure document verification and credential authentication is of utmost importance as highlighted in the study conducted by Gräther et al. [1]. Forgery of digital documents is a significant risk in the digital age, enabling the creation of fraudulent records with serious consequences. This compromises data integrity and trust and can result in legal and financial consequences [2]. Verifying digital documents takes a significant amount of time and human effort. By utilizing blockchain technology, this burden can be greatly decreased, and institutions can use QR codes to make the procedure simpler. Traditional document storage and verification methods in academic institutions involve CSV files, which despite their usefulness are prone to security breaches and unauthorized access. To overcome these issues, we propose a blockchain and QR code-based solution for a more secure and tamperproof document verification system. Initially, student data, usually in Excel format, is migrated to a secured cloud storage service like Amazon Web Services’ S3 bucket, maintaining the integrity and privacy of each student’s information. Each student is assigned a unique tokenURI, encompassing pertinent details from the uploaded data. This tokenURI is subsequently translated into a QR code, acting as a digital signature linked to the students’ documents, enhancing accessibility and streamlining the verification process. On receiving a document such as a degree, students scan the QR code, which directs them to provide their web3 wallet address. If not previously available, a new wallet can be created. The introduction of the wallet address allows for the minting of a unique Soulbound token (SBT) linked to each student’s details and the prior tokenURI, securing the document’s future verifiability. The minting of the SBT culminates in the QR code redirecting to vetted NFT marketplaces, where the SBT can be publicly scrutinized and authenticated. Parties such as potential employers or academic institutions can thus verify the document’s authenticity, leveraging the unchangeable nature of blockchain technology. Highlights of the proposed work . Integration of blockchain and QR codes for secure document verification. . Enhanced data integrity and protection against manipulation and unauthorized access. . Transparent and auditable verification process with immutable blockchain records. . Streamlined verification procedures, reducing time and potential errors.

Integration of Blockchain-Enabled SBT and QR Code Technology …

295

2 Literature Review The rapid development of digital documents and the need for secure verification methods have led to a growing interest in the use of blockchain technology. This literature review aims to explore the existing research and advancements in integrating blockchain-enabled Soulbound tokens (SBT) and QR code technology for the secure verification of digital documents. Blockchain technology aims to create a decentralized environment where thirdparty control is not necessary [3]. Blockchain technology has gained significant attention due to its decentralized and tamper-resistant nature. Several studies have demonstrated its potential in document verification and authentication. For instance, Kumutha and Jayalakshmi [4] proposed a blockchain-based document verification system that ensures the immutability and integrity of documents, providing a reliable verification method. The utilization of blockchain-based solutions for the secure storage of medical data has garnered considerable attention in recent times [5]. Chen et al. [6] proposed a blockchain-based searchable encryption scheme for electronic health records (EHRs) to address the issue of data leakage and enhance patient privacy. The scheme enables different medical organizations and individuals to securely access and share EHRs stored on the blockchain, ensuring a higher level of confidence in data privacy and integrity. Various concerns persist that hinder the widespread adoption of blockchain technology in the education sector. These concerns encompass legal complexities, challenges related to immutability, and issues of scalability [7]. Alam [8] discussed how blockchain technology can help monitor student accomplishments precisely. QR code technology has emerged as a widely adopted method for encoding and storing information in a compact format. Its popularity can be attributed to its ease of use and compatibility with smartphones. For example, Wellem et al. [9] used digital signatures and a QR code-based document verification system that enables efficient and convenient verification. The study demonstrated the potential of QR codes in combating document counterfeiting and improving verification processes. Researchers have explored the integration of QR codes in document verification systems to enhance the security and accessibility of digital documents. Blockchain’s decentralized and secure nature has shown promise in revolutionizing industries, especially in digital document verification [10, 11]. One approach is to use a public blockchain, where anyone can participate in the network and verify the authenticity of the Soulbound tokens. Sharma et al. [12] have designed a blockchainbased application to generate, maintain, and validate healthcare certificates. Li et al. [13] presented a workflow for credentialing, identifying issues in the industry and proposed ideal attributes for credentialing infrastructure. It also presented a framework for evaluating blockchain-based education projects and discussed factors hindering their adoption. Weyl et al. [14], in their whitepaper, propose the concept of Soulbound tokens that can function as persistent records for credit-relevant history,

296

A. Khanna et al.

encompassing education credentials, work history, and rental contracts. This innovative approach enables individuals to stake meaningful reputations based on these tokens. Currently, universities and institutes are leveraging blockchain technology in education primarily for managing academic degrees and evaluating learning outcomes [15, 16]. The University of Nicosia (UNIC) [17] utilizes a program that generates and stores certificates on the Bitcoin platform. Furthermore, UNIC became the first university in the world to issue academic certificates whose authenticity can be verified through the Bitcoin blockchain. Sony Global Education [18] is creating a blockchain network to store academic records securely and allow for different configurations and distribution of educational data. Blockcerts [19] is an open standard that facilitates the creation, issuance, viewing, and verification of certificates using blockchain technology. These digital records are securely registered on a blockchain, digitally signed, resistant to tampering, and easily shareable. Overall, the use of Soulbound tokens [14] for digital document verification has the potential to enhance the security and verifiability of digital documents and has been an active area of research in recent years.

3 Methodology The proposed methodology consists of the following steps: Step 1: Data Extraction and Formatting . Obtain the student data from the institute in the form of an Excel file (CSV format). . Extract relevant information for each student, such as name, student ID, program, and other necessary details. Step 2: Uploading Data to AWS S3 Bucket . Set up an AWS S3 bucket to store the student data securely. . Upload the extracted student data to the S3 bucket, ensuring appropriate access controls and encryption mechanisms are implemented. Step 3: TokenURI Generation and Soulbound Token (SBT) Creation . For each student, generate a unique tokenURI based on their data stored in the S3 bucket. . Utilize blockchain technology (e.g., Ethereum) to create an SBT that contains the student’s tokenURI and additional metadata. . Mint the SBT to the student’s wallet address or guide them to create a new web3 wallet for receiving the SBT. Step 4: QR Code Generation and Linking . Create a QR code specific to each student, embedding the link to their SBT on the blockchain.

Integration of Blockchain-Enabled SBT and QR Code Technology …

297

. Associate the QR code with the student’s data and add it to the corresponding row in the Excel file. Step 5: Document Verification Process . At the time of issuing a document or degree, scan the QR code on the document using a QR code reader or a dedicated application. . The QR code will direct the user to an interface where they can either enter their web3 wallet address or create a new wallet. . After adding the address, the SBT will be minted to the student’s wallet, containing all the details (tokenURI) generated during QR code generation. . Verify the authenticity and integrity of the document by cross-referencing the tokenURI and associated SBT properties. Step 6: Soulbound Token Visibility and Verification . Once the SBT has been successfully minted, the QR code will redirect to major or verified NFT marketplaces. . The Soulbound token will be visible on these marketplaces, allowing users to verify the details and properties of the SBT, ensuring its authenticity and legitimacy (Figs. 1, 2, and 3). Complete Algorithm: QR_Verified_Docu_Auth QR_Verified_Docu_Auth assists higher education institutions in integrating QR codes into document verification processes. In this algorithm, student data is extracted from an Excel file and uploaded to an AWS S3 bucket. Token URIs are generated, and Soulbound tokens (SBTs) are created for each student. SBTs are then linked to

Fig. 1 Process for institutions

298

A. Khanna et al.

Fig. 2 Process for student

QR codes specific to each student. QR codes are scanned during document issuance, prompting users to enter their web3 wallet addresses or create new ones. Students receive SBTs containing all the necessary information in their wallets. The authenticity of the documents is verified by cross-referencing the tokenURI and SBT properties. The QR codes also redirect to verified NFT marketplaces where the SBTs can be examined, ensuring their legitimacy. This algorithm provides a secure and efficient method for verifying student documents through QR codes and blockchain technology.

4 Performance Analysis The process of evaluating the effectiveness and efficiency of the proposed algorithm was explored. The results of the explored system were examined and it was determined how it differed from the available manual methodology. This involved measuring various metrics, such as time, scalability, authentication, security, or automation, to determine the overall performance of the explored system.

Integration of Blockchain-Enabled SBT and QR Code Technology …

299

Fig. 3 Complete working

4.1 Time Considering the average number of students in an institute [20], the proposed methodology takes only 1–2 h to complete depending on the availability of resources as compared to traditional methods, which require more time. Uploading data on Amazon AWS takes only a few minutes. Generating QR codes and linking them to Soulbound tokens can be done within an hour, while tokenURI generation and

300

A. Khanna et al.

Soulbound token generation can take several minutes per student. QR code provides scalability and efficiency which helps to reduce the efforts to claim and verify the document with the help of SBTs. The verification of digital documents can be done in relatively less time as compared to manual verification.

4.2 Scalability Using QR code to both claim and verify the digital documents makes the verification process easier. The time consumed is significantly reduced as compared to manual verification. It is simpler to handle and verify information for a big number of students with the help of this effective system’s ability to manage massive amounts of student data. The use of blockchain technology increases the trust and security in document verification, further improving scalability by reducing the need for human intervention and increasing the speed and accuracy of document verification. The integration of QR codes with SBTs offers an effective solution for student document verification by meeting the rising demand for verification with efficiency, accuracy, and adaptability.

4.3 Authentication and Security Blockchain technology provides strong security through cryptographic hashing, decentralization, immutability, and consensus algorithms which ensure that the data recorded on the blockchain is valid, distributed across various nodes, and tamperproof. The integration of QR codes and blockchain technology enhances document authentication by providing an easier, more secure, and more reliable verification process. Each document is assigned a unique QR code. Users can quickly access the information they need about the document and check its legitimacy by scanning the QR code. A distinct token is created for each document and kept on the blockchain. The associated data of the document is safe and cannot be modified. Data can be protected from unauthorized access by using encryption, access control methods, and secure data sharing.

4.4 Automation Manual document verification requires a lot of time and is not automated. The verifier must speak with the institution and request information or use the services of a thirdparty verification authority to validate a student’s individual document. With the help of Soulbound tokens, a student who holds those tokens can share a link to them which contains all relevant information about the document to be verified which can

Integration of Blockchain-Enabled SBT and QR Code Technology …

301

be automatically confirmed with the help of blockchain technology. The proposed methodology for document verification uses QR codes and blockchain technology to make the authentication process simpler. The collection and structuring of student data from the college or university’s reserve serve as the first step in the automation process. The AWS S3 bucket is immediately populated with the extracted student data. A Soulbound token (SBT) is created and can be claimed by the student and verified by the employer.

5 Conclusion and Future Scope The proposed methodology focuses on making the process of sharing student documents authentic and secure. The integration of QR codes and blockchain technology offers a promising solution to securely store and process student data, generate immutable Soulbound tokens (SBTs) on the blockchain, and facilitate seamless verification through QR codes. It makes the process of validating the integrity of student documents tamper-free and efficient. The methodology contributes to the growth of confidence and credibility in the higher education industry by combining the simplicity of QR codes with the transparency and immutability of the blockchain. By investigating compatibility with other blockchains and standards, implementing modern cryptographic methods, conducting customer research, and applying machine learning algorithms, the proposed methodology can be further enhanced. The results of the following study may help refine the strategy and support the continued improvement of document authentication systems in the higher education sector.

References 1. Gräther W, Kolvenbach S, Ruland R, Schütte J, Torres C, Wendland F (2018) Blockchain for education: lifelong learning passport. In: Proceedings of 1st ERCIM blockchain workshop 2018. European Society for socially embedded technologies (EUSSET) 2. Grolleau G, Lakhal T, Mzoughi N (2008) An introduction to the economics of fake degrees. J Econ Issues 42(3):673–693 3. Yli-Huumo J, Ko D, Choi S, Park S, Smolander K (2016) Where is current research on blockchain technology?—a systematic review. PLoS ONE 11(10):e0163477 4. Kumutha K, Jayalakshmi S (2022) Blockchain technology and academic certificate authenticity—a review. Exp Clouds Appl Proc ICOECA 2021:321–334 5. Mahajan HB, Rashid AS, Junnarkar AA, Uke N, Deshpande SD, Futane PR, Alkhayyat A, Alhayani B (2023) Integration of Healthcare 4.0 and blockchain into secure cloud-based electronic health records systems. Appl Nanosci 13(3):2329–2342 6. Chen L, Lee WK, Chang CC, Choo KKR, Zhang N (2019) Blockchain based searchable encryption for electronic health record sharing. Futur Gener Comput Syst 95:420–429 7. Loukil F, Abed M, Boukadi K (2021) Blockchain adoption in education: a systematic literature review. Educ Inf Technol 26(5):5779–5797

302

A. Khanna et al.

8. Alam A (2022) Platform utilising blockchain technology for eLearning and online education for open sharing of academic proficiency and progress records. In: Smart data intelligence: proceedings of ICSMDI 2022. Springer Nature Singapore, Singapore, pp 307–320 9. Wellem T, Nataliani Y, Iriani A (2022) Academic document authentication using elliptic curve digital signature algorithm and QR code. JOIV Int J Inform Vis 6(3):667–675 10. Imam IT, Arafat Y, Alam KS, Shahriyar SA (2021) DOC-BLOCK: a blockchain based authentication system for digital documents. In: 2021 third international conference on intelligent communication technologies and virtual mobile networks (ICICV). IEEE, pp 1262–1267 11. Yumna H, Khan MM, Ikram M, Ilyas S (2019) Use of blockchain in education: a systematic literature review. In: Intelligent information and database systems: 11th Asian conference, ACIIDS 2019, Yogyakarta, Indonesia, April 8–11, Proceedings, Part II 11. Springer International Publishing, pp 191–202 12. Sharma P, Namasudra S, Crespo RG, Parra-Fuente J, Trivedi MC (2023) EHDHE: enhancing security of healthcare documents in IoT-enabled digital healthcare ecosystems using blockchain. Inf Sci 629:703–718 13. Li ZZ, Joseph KL, Yu J, Gasevic D (2022) Blockchain-based solutions for education credentialing system: comparison and implications for future development. In: 2022 IEEE international conference on blockchain (blockchain). IEEE, pp 79–86 14. Weyl EG, Ohlhaver P, Buterin V (2022) Decentralized society: finding web3’s soul. Available at SSRN 4105763 15. Sharples M, Domingue J (2016) The blockchain and kudos: a distributed system for educational record, reputation and reward. In: Adaptive and adaptable learning: 11th European conference on technology enhanced learning, EC-TEL 2016, Lyon, France, September 13–16, Proceedings 11. Springer International Publishing, pp 490–496 16. Chen G, Xu B, Lu M, Chen NS (2018) Exploring blockchain technology and its potential applications for education. Smart Learn Environ 5(1):1–10 17. UNIC (2018) Blockchain certificates (academic & others). https://www.unic.ac.cy/iff/blockc hain-certificates/ 18. Sony Global Education. Creating a trusted experience with blockchain. https://blockchain.son yged.com/ 19. Blockcerts. The open standard for blockchain credentials. https://www.blockcerts.org/ 20. The Times of India. 1/3rd of undergraduate students in India doing BA: survey. http://tim esofindia.indiatimes.com/articleshow/97465229.cms?from=mdr&utm_source=contentofint erest&utm_medium=text&utm_campaign=cppst

Time Series Forecasting of NSE Stocks Using Machine Learning Models (ARIMA, Facebook Prophet, and Stacked LSTM) Prabudhd Krishna Kandpal, Shourya, Yash Yadav, and Neelam Sharma

Abstract It is widely recognised and acknowledged among market observers and analysts that the stock market, by its very nature, exhibits a tremendous degree of volatility, resulting in frequent and substantial fluctuations. Consequently, the ability to accurately anticipate and forecast market trends assumes paramount importance when it comes to making well-informed decisions regarding the buying and selling of stocks. To achieve such predictive capabilities, the focus of this particular research endeavour is specifically centred around leveraging advanced machine learning models, including but not limited to AutoRegressive Integrated Moving Average (ARIMA), Prophet, as well as deep learning models such as Long ShortTerm Memory (LSTM). Root Mean Squared Error (RMSE) is utilised to assess the performance and efficacy of these models. Therefore, the results emanating from this meticulously conducted study contribute invaluable insights and shed light on the comparative effectiveness of different models within the realm of time series forecasting. Importantly, the prevailing body of evidence strongly supports the notion that deep learning-based algorithms, such as LSTM, hold a distinct advantage over traditional statistical methods like the ARIMA model, thereby reinforcing their superiority in this domain. Keywords Deep learning · Long Short-Term Memory (LSTM) · AutoRegressive Integrated Moving Average (ARIMA) · Prophet · Time series forecasting · Machine learning

P. K. Kandpal (B) · Shourya · Y. Yadav · N. Sharma Department of Artificial Intelligence and Machine Learning, Maharaja Agrasen Institute of Technology, Delhi, India e-mail: [email protected] N. Sharma e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 A. Swaroop et al. (eds.), Proceedings of Data Analytics and Management, Lecture Notes in Networks and Systems 788, https://doi.org/10.1007/978-981-99-6553-3_24

303

304

P. K. Kandpal et al.

1 Introduction There are numerous techniques available for addressing time series forecasting problems. While several procedures assist in drawing conclusions, they do not guarantee accurate results. To make an informed decision, it is crucial to precisely forecast which option to choose. Therefore, it is necessary to carefully evaluate the pros and cons of each method before applying it. The ARIMA model is commonly used for stock predictions due to its simplicity, ability to identify time-dependent patterns, and provision of statistical insights. However, it has limitations in capturing nonlinear patterns, requires stationary data, does not consider exogenous variables, and is less accurate for long-term forecasts. Combining ARIMA models with advanced methodologies can improve accuracy and overcome these limitations in real-world usage. Prophet offers a user-friendly and efficient means of conducting time series analysis for stock forecasting. It provides a quick and simple solution without requiring substantial modification. However, its rudimentary assumptions and limited control might restrict its applicability in complex scenarios that demand better modelling methodologies or detailed integration of external elements. LSTM models for stock predictions offer benefits such as long-term forecasting, handling sequential data, incorporating exogenous factors, and capturing complex patterns. However, they require a large amount of data, are computationally demanding, are highly susceptible to overfitting, and lack interpretability. Despite these shortcomings, LSTM models are often used in combination with various methods to enhance accuracy for stock forecasting purposes. This paper follows a clear and organised structure. It begins by introducing the dataset used in the research and providing the necessary background information. The different models utilised are then listed and explained, offering a comprehensive overview of the methodologies employed. The proposed methodology is presented in detail, describing the specific approach taken in the research. Subsequently, the paper presents the results obtained from implementing these models, emphasising the outcomes of the analysis. A thorough examination and analysis of these results are conducted to gain a deeper understanding of the findings. Finally, the paper concludes by summarising the findings of the proposed model, discussing their implications, and acknowledging the limitations and potential for future research in the field. The motivations of this paper can be summarised as follows: 1. This study aims to compare the performance of three popular forecasting models, namely ARIMA, Facebook Prophet, and LSTM, in predicting stock prices. By analysing the results across multiple companies with varying levels of stability and volatility, the study aims to provide insights into the strengths and weaknesses of each model. 2. Accurate stock predictions are crucial for financial decision-making. The study aims to emphasise the importance of accurate forecasting for longterm predictions and highlights how advanced models like LSTM can enhance accuracy.

Time Series Forecasting of NSE Stocks Using Machine Learning …

305

3. The study also aims to provide insights into the robustness and adaptability of these models by examining their performance in both stable and volatile market conditions. Overall, this study contributes to the field of stock price prediction by comparing the performance of different forecasting models and highlighting the potential of deep learning-based algorithms. The findings have practical implications for researchers and practitioners, paving the way for further advancements in the field of time series forecasting.

2 Literature Review Siami-Namini et al. have suggested that forecasting time series data presents a formidable challenge, primarily due to the ever-evolving and unpredictable nature of economic trends and the presence of incomplete information. Notably, the increasing volatility witnessed in the market over recent years has raised significant concerns when it comes to accurately predicting economic and financial time series. Consequently, it becomes imperative to evaluate the precision and reliability of forecasts when employing diverse forecasting methodologies, with a particular focus on regression analysis. This is crucial since regression analysis, despite its utility, possesses inherent limitations in its practical applications [1]. Pang et al. have discovered that neural networks have found extensive applications across a range of fields, including pattern recognition, financial securities, and signal processing. Particularly in the realm of stock market forecasting, neural networks have garnered considerable acclaim for their effectiveness in regression and classification tasks. Nevertheless, it is important to note that conventional neural network algorithms may encounter challenges when attempting to accurately predict stock market behaviour. One such obstacle arises from the issue of random weight initialisation, which can lead to a susceptibility to local optima and, consequently, yield incorrect predictions [2]. This study aims to accurately predict the closing price of various NSE stocks using machine learning methods like ARIMA, Prophet, and deep learning models, namely the LSTM model. Stock market forecasting has traditionally relied on linear models such as AutoRegressive (AR), AutoRegressive Moving Average (ARMA), and AutoRegressive Integrated Moving Average (ARIMA). However, a notable limitation of these models is their specificity to a particular time series dataset. In other words, a model that performs well for forecasting the stock market behaviour of one company may not yield satisfactory results when applied to another company. This can be attributed to the inherent ambiguity and unpredictable nature of the stock market, which inherently carries a higher level of risk compared to other sectors. Consequently, this inherent complexity and risk associated with stock market prediction significantly contribute to the difficulty of accurately forecasting stock market trends [3]. There are several reasons why deep learning models have come to be significantly successful in comparison to traditional machine learning and statistical

306

P. K. Kandpal et al.

models, and their usage has been on the rise for several decades. Models such as the LSTM model possess the ability to take into account the temporal dependencies present in time series data. Secondly, these models are very successful in extracting features from raw data, eliminating the need for manual feature extraction. Moreover, deep learning models are capable of accommodating both univariate and multivariate time series data, with even irregular and unevenly spaced data points. With the latest advancements in parallel computing and GPUs, deep learning models can be trained and optimised on large-scale data. Saiktishna et al. [4] focus on the utilisation of the FB Prophet model for historical analysis of stock markets and time series forecasting. It explores the techniques, conclusions, and limits of previous research in this field. The evaluation emphasises FB Prophet’s ability to capture market patterns and seasonality, as well as future research possibilities. It contains useful information for scholars and practitioners who want to use FB Prophet for stock market study and forecasting. Numerous publications in the literature have made an effort to investigate the hybrid modelling of financial time series movement using various models. He et al. [5] utilised a hybrid model using the ARMA and CNN-LSTM model to accurately predict the financial market by applying it to three different time series with different levels of volatility. They presented that optimisations are still possible to machine learning and deep learning models, given the rapid development in the aforementioned fields. Fang et al. [6] proposed a novel approach using the dual-LSTM approach, which consisted of two LSTM layers with batch normalisation which addressed the problem of sharp point changes by capturing significant profit points using an adaptive crossentropy loss function, enhancing the model’s prediction capabilities. Gajamannage et al. [7] have emphasised the importance of real-time forecasting and presented its importance in risk analysis and management. They have put forth a sequentially trained dual-LSTM model which has addressed the issue of semi-convergence in a recurrent LSTM setup and has validated their results based on various diverse financial markets. Patil [8] highlights the use of machine learning approaches for stock market forecasting, including ARIMA, support vector machines, random forest, and recurrent neural networks (RNNs). The paper explores the strengths and limitations of these models, emphasising the importance of feature engineering and selection in improving prediction accuracy. It also includes case studies and empirical research to show how these models may be used in stock price forecasting. It is evident from the above examples that LSTM is a robust model that is excellent for the purpose of time series analysis and forecasting. It has displayed its capabilities in various other fields, including energy consumption forecasting, wind speed forecasting, carbon emissions forecasting, and aircraft delays forecasting.

Time Series Forecasting of NSE Stocks Using Machine Learning …

307

3 Dataset Description The historical stock data used in this research study have been sourced from Yahoo Finance. The dataset encompasses the stock prices of four prominent companies: Reliance Industries, Tata Steel LLC, ICICI Bank, and Adani Enterprise. The selection of Reliance Industries, Tata Steel LLC, ICICI Bank, and Adani Enterprise for this research study was based on specific criteria. Reliance Industries, Tata Steel LLC, and ICICI Bank were chosen due to their stable stock performance and a general upward trend observed over time. These stocks have demonstrated consistent growth and are considered relatively stable investments. In contrast, Adani Enterprise was included in the dataset because it is known for its high volatility, with stock prices being heavily influenced by market reports and external factors. By including a mix of stocks with different characteristics, such as stability and volatility, we aim to accurately assess the capabilities of our predictive models in handling various stock market scenarios and understanding their effectiveness in different market conditions. Within the dataset, two crucial components play a pivotal role in our modelling approach: the ‘Close’ and ‘Date’ variables. The ‘Close’ variable signifies the last recorded price at which a particular stock was traded during the regular hours of a trading session. It serves as the target variable in our predictive models, as we aim to forecast future stock prices based on historical trends and patterns. On the other hand, the ‘Date’ variable acts as the predictor variable, providing temporal information to aid in the prediction process. For the purpose of this research, the dataset spans from the inception of each respective company until May 1, 2023, thereby covering a substantial period of historical data (Fig. 1).

Fig. 1 Sample dataset of reliance industries obtained from Yahoo finance

308

P. K. Kandpal et al.

Table 1 Number of time series observations

Stock Reliance Industries

Observations

Total

Train 90%

Test 10%

6182

687

6869

Tata Steel LLC

6184

688

6872

ICICI Bank

4656

518

5174

Adani Enterprise

4658

518

5176

4 Data Preparation The financial time series datasets were divided into two parts: a training dataset and a test dataset. The training dataset consisted of 90% of each dataset and was used to train the models. The remaining 10% of each dataset was allocated to the test dataset to evaluate the accuracy of the models. The number of time series observations for each dataset is provided in Table 1.

5 Assessment Metric We have utilised the Root Mean Square Error (RMSE) to evaluate the precision of our model’s predictions. RMSE measures the differences or residuals between the predicted and actual values. By employing RMSE, we have been able to compare prediction errors within the same dataset across different models rather than between different datasets. The formula used to calculate RMSE is as follows: | N |1 ∑ ( )2 | xi − xi , RMSE = N i=1 ∆

(1) ∆

where N is the total number of stocks, xi is the actual value of the stock, whereas xi is the value predicted by our model.

6 Models 6.1 ARIMA ARIMA is a widely used time series forecasting model that combines autoregressive (AR), differencing (I), and moving average (MA) components to capture linear relationships, stationarity, and dependencies within the data [9]. We have made use of rolling ARIMA to perform forecasting. It means that we have refitted the model at

Time Series Forecasting of NSE Stocks Using Machine Learning … Table 2 ARIMA model parameters

309

Stock

ARIMA model (p, d, q)

RMSE

Reliance Industries

(5, 2, 0)

41.62

Tata Steel LLC

(2, 1, 2)

18.62

ICICI Bank

(4, 1, 4)

11.28

Adani Enterprise

(4, 2, 4)

83.27

RMSE values are for the entire test set in Table 2

each iteration as new data becomes available. This allowed our model to continuously adapt to the most recent data, enhancing its accuracy and robustness of the forecast. In ARIMA modeling, the notation ARIMA (p, d, q) is commonly used, where [1]: . ‘p’ represents the number of lag observations used in training the model (i.e. lag order). . ‘d’ denotes the number of times differencing is applied (i.e. degree of differencing). . ‘q’ indicates the size of the moving average window (i.e. order of moving average). To determine the appropriate values for these parameters, we utilised the Autocorrelation Function (ACF) graph and Partial Autocorrelation (PACF) graphs [10]. The ACF graph provided insights into the correlation between the current observation and lagged observations at various time lags. Meanwhile, the PACF graphs helped us assess the correlation between the current observation and the residuals from previous observations, taking into account the effects of intermediate observations. By carefully examining these graphs, we were able to estimate the optimal values for the AR, MA, and differencing components of the ARIMA model for each dataset. They are listed in Table 2.

6.2 Facebook Prophet Prophet is an open-source library developed by Facebook and is generally used for univariate time series forecasting. Prophet [11] is a decomposable framework that reduces an elaborate problem, such as time series data prediction, into simpler ones and does so by taking three factors into account: seasonality, holidays, and trend. y(t) = g(t) + s(t) + h(t) + ε.

(2)

Here, g(t) represents the trend, s(t) represents seasonality, h(t) represents holidays, and t represents the error rate. The trend parameter monitors two additional parameters: saturation growth and change points. Seasonality is another factor that Prophet considers, which uses the Fourier series to create a precise end model.

310

P. K. Kandpal et al.

s(t) =

( ( ) )) N ( ∑ 2π nt 2π nt an cos + bn sin . p p n=1

(3)

s(t) denotes seasonality, and P denotes the time period, which might be monthly, weekly, daily, quarterly, or even annual. N is the frequency of change, and the parameters an and bn are dependent on it. The Prophet [12] is adept at detecting lost data, changing trends, and treating outsiders often. Compared to previous time series forecasting techniques, the prophet makes it evident how it generates a faster forecast that is more precise.

6.3 LSTM Long Short-Term Memory (LSTM) is a variation of the Recurrent Neural Network (RNN) which is often used in time series forecasting due to its ability to take the temporal dependencies of the time series data into account. Before diving into LSTM, we will first go over the following to develop a better understanding: (a) Layered-formatted neurons make up the core of Feedforward Neural Networks (FFNNs) [13]. Each neuron updates its values using an optimisation algorithm, such as Gradient Descent, Adam optimiser, and computes values based on randomly initialised weights. FFNNs are loop-free and completely linked. Every FFNN has three layers of neurons: an input layer that receives input from users, a hidden layer that allows the network to learn complex patterns and relationships in data, and an output layer that produces the output based on the input from the last layer. In an FFNN, each layer of neurons feeds information to the layer above it. (b) Recurrent Neural Networks (RNNs) [1] are special neural networks where the outputs are partially dependent on a series of outputs obtained in the previous stages. The hidden layers in an RNN network work as memory units that hold this information and use it during computation. The only drawback of RNNs is that they are only capable of learning a small number of previous stages, which makes them incapable of remembering long sequences of data. The LSTM model solves this issue by introducing a ‘memory’ line. (c) Long Short-Term Memory (LSTM) [14] is an improvement of the RNN model. It is equipped with input, output, and forgetting gates to accommodate the effects of longer time intervals and delays while also solving the problem of vanishing gradient and exploding gradient. The structure of an LSTM cell is given in Fig. 1: In Fig. 2, h(t) and h(t−1) represent the outputs of the current and previous cell, x(t) represents the input of the current cell, and c(t) and c(t−1) represent the current and previous states of the neuron at t. i(t) represents the input threshold which determines the information gain with the sigmoid function, and o(t) represents the output

Time Series Forecasting of NSE Stocks Using Machine Learning …

311

Fig. 2 LSTM cell

threshold which determines the output neuron state using the sigmoid function and the tanh activation function. f (t) represents the forgetting threshold which controls the information that is discarded with the help of the sigmoid function.

7 Proposed Methodology For the purpose of this study, we perform univariate time series forecasting, following these steps: Data Collection: The historical time series data relevant to your problem were gathered. Factors like data quality, missing values, outliers, and potential seasonality or trends in the data were considered. Data Preprocessing: The data were prepared for modelling by performing various preprocessing steps. Missing values were handled by deciding on a strategy to fill or impute them, such as forward/backward filling, interpolation, or using statistical methods. Scaling and normalisation techniques were applied to normalise the data to a common scale, such as Min–Max scaling, to improve model convergence and performance. Train–Test Split: The data were split into training and testing sets. Typically, a larger portion was allocated for training, while a smaller portion was kept for evaluating the model’s performance. The split maintains the temporal order of the data to simulate real-world forecasting scenarios. Model Selection: The appropriate machine learning or deep learning models were chosen for the time series forecasting task. These models were chosen for time series forecasting: Autoregressive model (ARIMA) captures the dependency of the current observation on previous observations. Long Short-Term Memory (LSTM) model is a type of RNN model specifically designed to capture long-term dependencies in time series data. The Prophet model combines the flexibility of generalised additive models with the simplicity of traditional forecasting methods. It incorporates seasonality, trend

312

P. K. Kandpal et al.

Table 3 RMSEs of ARIMA, Prophet, and LSTM models or the last 100 days of forecast Stock

RMSE ARIMA

Lowest RMSE Prophet

LSTM

Reliance Industries

32.38

286.16

14.81

LSTM

Tata Steel LLC

11.29

34.63

4.86

LSTM

9.73

33.91

9.36

LSTM

155.54

1035.87

ICICI Bank Adani Enterprise

81.2

LSTM

changes, and holiday effects, making it effective for predicting time series data with intuitive and interpretable results. Model Training: The selected model was trained using the training dataset. During training, the model learns to capture patterns, trends, and seasonality in the data. Hyperparameters (e.g. learning rate, batch size, number of layers) were adjusted through experimentation. Model Evaluation: The trained model’s performance on the testing dataset was evaluated using appropriate evaluation metrics for time series forecasting, such as Root Mean Squared Error (RMSE). The model’s ability to generalise and make accurate predictions on unseen data was assessed. Model Refinement: If the initial model’s performance was not satisfactory, the model was refined by adjusting hyperparameters, trying different architectures, or employing regularisation techniques to improve the model’s accuracy and generalisation capabilities. This step was iterated until the desired performance was achieved.

8 Observations and Results The experimental findings have been summarised in Table 3, which provides a comprehensive overview of the performance of the three models across the selected set of four stocks (Figs. 3, 4, 5, and 6).

9 Result Analysis Considering the higher level of comparability between the predictions of the ARIMA and Stacked LSTM models compared to Prophet, we have done a detailed performance evaluation of these two models. To gain more profound insights into the accuracy of these models for stock price prediction, we have conducted our analysis individually for each company. This comprehensive approach has allowed us to thoroughly assess the performance of ARIMA and Stacked LSTM models across all four companies. Additionally, we have focused on the final 30 days of the forecasted

Time Series Forecasting of NSE Stocks Using Machine Learning …

313

Fig. 3 a ARIMA and LSTM models’ last 100 days’ forecast on RELIANCE NSE stock. b Facebook Prophet model’s forecast on RELIANCE NSE stock

Fig. 4 a ARIMA and LSTM models’ last 100 days’ forecast on TATA STEEL NSE stock. b Facebook Prophet model’s forecast on TATA STEEL NSE stock

Fig. 5 a ARIMA and LSTM models’ last 100 days’ forecast on ICICI BANK NSE stock. b Facebook Prophet model’s forecast on ICICI bank NSE stock

period for each company, enabling us to make precise evaluations of these models’ effectiveness. A. Reliance Industries

314

P. K. Kandpal et al.

Fig. 6 a ARIMA and LSTM models’ last 100 days’ forecast on ADANI NSE stock. b Facebook Prophet model’s forecast on ADANI NSE stock

Fig. 7 a ARIMA model’s last 30 days’ forecast on RELIANCE NSE stock. b LSTM model’s last 30 days’ forecast on RELIANCE NSE stock

ARIMA: The ARIMA model for Reliance Enterprises achieved an RMSE of 32.26, indicating an average deviation of approximately Rs 32.26 between the predicted and actual stock prices (Fig. 7). Stacked LSTM: The LSTM model for Reliance Enterprises achieved a lower RMSE of 9.26, implying an average deviation of approximately Rs 9.26. Comparing the two models, both LSTM and ARIMA effectively captured the trends in Reliance’s stock prices, as evident from the graphs. However, the LSTM model appears to be more accurate in mapping these trends. The significantly lower RMSE of 9.26 for the LSTM model suggests that it better captured the underlying patterns in Reliance’s stock prices compared to ARIMA. B. Tata Steel LLC: ARIMA: The ARIMA model for Tata Steel achieved an RMSE of 8.1, indicating an average deviation of approximately Rs 8.1 between the predicted and actual stock prices (Fig. 8). Stacked LSTM: The LSTM model for Tata Steel achieved a lower RMSE of 4.26, implying an average deviation of approximately Rs 4.26. Comparing the two models, both LSTM and ARIMA effectively captured the trends in Tata Steel’s stock prices, as evident from the graphs. However, the LSTM

Time Series Forecasting of NSE Stocks Using Machine Learning …

315

Fig. 8 a ARIMA model’s last 30 days’ forecast on TATA STEEL NSE stock. b LSTM model’s last 30 days’ forecast on TATA STEEL NSE stock

model demonstrated greater accuracy in mapping these trends. The significantly lower RMSE of 4.26 for the LSTM model suggests that it better captured the underlying patterns in Tata Steel’s stock prices compared to ARIMA. C. ICICI Bank: ARIMA: The ARIMA model for ICICI Bank achieved an RMSE of 8.89, indicating an average deviation of approximately Rs 8.89 between the predicted and actual stock prices (Fig. 9). Stacked LSTM: The LSTM model for ICICI Bank achieved a lower RMSE of 7.93, implying an average deviation of approximately Rs 7.93. Comparing the two models, both LSTM and ARIMA effectively captured the trends in ICICI Bank’s stock prices, as evident from the graphs. However, the LSTM model demonstrated greater accuracy in mapping these trends. The significantly lower RMSE of 7.93 for the LSTM model suggests that it better captured the underlying patterns in ICICI Bank’s stock prices compared to ARIMA. D. Adani Enterprises:

Fig. 9 a ARIMA model’s last 30 days’ forecast on ICICI bank NSE stock. b LSTM model’s last 30 days’ forecast on ICICI bank NSE stock

316

P. K. Kandpal et al.

Fig. 10 a ARIMA model’s last 30 days’ forecast on ADANI NSE stock. b LSTM model’s last 30 days’ forecast on ADANI NSE stock

ARIMA: The ARIMA model for Adani Enterprises achieved an RMSE of 58.77, indicating an average deviation of approximately Rs 58.77 between the predicted and actual stock prices (Fig. 10). Stacked LSTM: The LSTM model for Adani Enterprises achieved a slightly lower RMSE of 58.0, implying an average deviation of approximately Rs 58.0. Both the LSTM and ARIMA models achieved some success in capturing the trends in Adani Enterprises’ stock prices, as seen in the graphs, but had high RMSE values, indicating notable deviations from the actual prices. Adani’s stock is highly volatile, influenced by market reports and external factors, making accurate predictions challenging. Although the Stacked LSTM model performed the best among the two, accurately forecasting Adani Enterprises’ stock prices still has room for improvement due to the inherent uncertainty and complexity of market dynamics.

10 Limitations While the findings of this research provide valuable insights into the performance of LSTM, ARIMA, and Facebook Prophet for stock price prediction, there are a few limitations to consider: 1. Lack of External Factors: The models used in this study solely relied on historical stock price data as input. External factors, such as company-specific news, industry trends, or global economic events, were not incorporated into the analysis. These external factors can significantly influence stock prices and might enhance the accuracy of the predictions if considered. 2. Limited Generalizability: The study focused on a specific set of companies, namely Reliance Industries, Tata Steel LLC, ICICI Bank, and Adani Enterprise. The findings may not apply to other stocks or industries. The performance of the models could vary when applied to different datasets with diverse characteristics. 3. Limited Scope of Model Selection: While the research compared LSTM, ARIMA, and Facebook Prophet models, it is important to note that there are several other advanced forecasting algorithms available, such as gradient

Time Series Forecasting of NSE Stocks Using Machine Learning …

317

boosting machines and Long Short-Term Memory networks with attention mechanisms. Exploring a wider range of forecasting techniques could provide additional insights and potentially reveal alternative models that could yield different results. Addressing these limitations and conducting further research would enhance the robustness and applicability of the findings, leading to a more comprehensive understanding of the strengths and weaknesses of different forecasting models in stock price prediction.

11 Conclusion The findings of this study highlight the superior performance of LSTM, a deep learning-based algorithm, in comparison to ARIMA and Facebook Prophet for stock price prediction across Reliance Industries, Tata Steel LLC, and ICICI Bank. These companies, known for their stability, exhibited low RMSE values when analysed using both ARIMA and Stacked LSTM models. However, when applied to the highly volatile stock of Adani Enterprise, the models yielded higher RMSE values. Despite this, the models were successful in capturing the general trend of the stock. It is worth noting that while ARIMA demonstrated good overall performance, LSTM consistently outperformed it in terms of accuracy. The limitations of Facebook Prophet in handling time series with little or no seasonality, such as stock prices, were also evident in this study. This research highlights the advantages of deep learning-based algorithms in analysing economic and financial data, providing valuable insights for finance and economics researchers and practitioners. It calls for further exploration of these techniques in different datasets containing varying features, expanding our understanding of the improvements that can be achieved through deep learning in various domains. In summary, this study contributes to the comparative performance analysis of ARIMA, Prophet, and LSTM models in stock price prediction. It supports the notion that deep learning-based algorithms, particularly LSTM, show promise in enhancing prediction accuracy. It also recognises the reliability of ARIMA for stock price prediction and acknowledges the limitations of Prophet for time series lacking strong seasonality.

12 Social Impact The study conducted in this paper, although not achieving highly precise stock price prediction, has demonstrated the effectiveness of two models: ARIMA and Stacked LSTM, in accurately forecasting market trends. This is evident from the 30-day forecast presented in Sect. 10. While predicting the exact stock price is considered

318

P. K. Kandpal et al.

an extremely challenging task, the ability to forecast market trends can provide valuable assistance to society in the following ways: 1. Risk Management: Accurate trend forecasting helps investors and institutions manage stock market risks, optimising investment strategies to minimise losses and maximise returns. 2. Market Timing: Understanding market trends enables effective investment timing and optimising buying and selling decisions to capitalise on opportunities and enhance investment performance. 3. Strategic Planning: Accurate trend forecasting informs businesses’ strategic planning, aligning product development, marketing, and expansion strategies with market dynamics for competitive advantage and informed resource allocation. 4. Economic Analysis: Trend forecasting contributes to understanding the overall economy, providing insights into industry health, market sentiments, and potential economic shifts, and aiding policymakers in decision-making. 5. Algorithmic Trading Strategies: The findings of this study can benefit algorithmic trading developers, enhancing trading algorithm performance for more profitable automated strategies. In conclusion, while precise stock price prediction may be challenging, the ability to forecast market trends, as demonstrated by the ARIMA and Stacked LSTM models in this study, offers significant benefits to society. From risk management and strategic planning to economic analysis, accurate trend forecasting supports informed decision-making in various domains, contributing to better financial outcomes and market understanding.

13 Future Scope This research opens up several avenues for future investigation and expansion. Here, we outline some potential directions and areas of exploration that can contribute to the advancement of this field: 1. Incorporating External Factors: To enhance the predictive accuracy of the models, future research can consider integrating external factors such as company-specific news, industry trends, macroeconomic indicators, and market sentiment into the analysis. This can provide a more comprehensive understanding of the factors influencing stock prices and improve the models’ ability to capture complex market dynamics. 2. Comparing with Other Advanced Forecasting Techniques: While this research focused on LSTM, ARIMA, and Facebook Prophet, there are numerous other advanced forecasting techniques available. Future studies could expand the model selection and compare the performance of additional algorithms, such as gradient boosting machines, support vector machines, or ensemble methods. This comparative analysis can shed light on various forecasting approaches relative to strengths and weaknesses of various forecasting approaches.

Time Series Forecasting of NSE Stocks Using Machine Learning …

319

3. Real-Time Prediction and Adaptive Models: Another interesting avenue for future research is to evaluate the performance of the models in real-time prediction scenarios. This would involve updating the models with the latest available data and assessing their ability to adapt to changing market conditions. Developing adaptive models that can adjust their predictions dynamically based on new information can be valuable for investors and financial institutions. 4. Integration of Hybrid Models: Hybrid models that combine the strengths of different forecasting techniques can be explored. For example, integrating the strengths of LSTM and ARIMA in a hybrid model may provide improved forecasting accuracy. Investigating the effectiveness of such hybrid models can contribute to the development of more robust and accurate prediction systems. By addressing these future directions, researchers can further advance the field of stock price prediction and deepen our understanding of the capabilities and limitations of different forecasting models.

References 1. Siami-Namini S, Tavakoli N, Siami Namin A (2018) A comparison of ARIMA and LSTM in forecasting time series. In: 2018 17th IEEE international conference on machine learning and applications (ICMLA), pp 1394–1401 2. Pang X, Zhou Y, Wang P, Lin W, Chang V (2020) An innovative neural network approach for stock market prediction. J Supercomput 76(3):2098–2118. https://doi.org/10.1007/s11227017-2228-y 3. Hiransha, Gopalakrishnan, Menon VK, Soman (2018) NSE stock market prediction using deep-learning models. Proc Comput Sci 132:1351–1362. https://doi.org/10.1016/j.procs.2018. 05.050 4. Saiktishna C, Sumanth NS, Rao MM, Thangakumar J (2022) Historical analysis and time series forecasting of stock market using FB prophet. In: 2022 6th International conference on intelligent computing and control systems (ICICCS), pp 1846–1851 5. He K, Yang Q, Ji L, Pan J, Zou Y (2023) Financial time series forecasting with the deep learning ensemble model. Mathematics 11(4):1054. https://doi.org/10.3390/math11041054 6. Fang Z, Ma X, Pan H, Yang G, Arce GR (2023) Movement forecasting of financial time series based on adaptive LSTM-BN network. Expert Syst Appl 213:119207 7. Gajamannage K, Park Y, Jayathilake DI (2023) Real-time forecasting of time series in financial markets using sequentially trained dual-LSTMs. Expert Syst Appl 223:119879. https://doi.org/ 10.1016/j.eswa.2023.119879 8. Patil R (2021). Time series analysis and stock price forecasting using machine learning techniques 19. https://doi.org/10.1994/Rajat/AI 9. Jamil H (2022) Inflation forecasting using hybrid ARIMA-LSTM model. Laurentian University of Sudbury 10. Zhang R, Song H, Chen Q, Wang Y, Wang S, Li Y (2022) Comparison of ARIMA and LSTM for prediction of hemorrhagic fever at different time scales in China. PLoS ONE 17(1):e0262009. https://doi.org/10.1371/journal.pone.0262009 11. Lilly SS, Gupta N, Anirudh RRM, Divya D (2021) Time series model for stock market prediction utilising prophet. Turk J Comput Math Educ (TURCOMAT) 12(6):4529–4534. https://tur comat.org/index.php/turkbilmat/article/view/8439 12. Kaninde S, Mahajan M, Janghale A, Joshi B (2022) Stock price prediction using Facebook prophet. ITM Web Conf 44:03060. https://doi.org/10.1051/itmconf/20224403060

320

P. K. Kandpal et al.

13. Staudemeyer RC, Morris ER (2019) Understanding LSTM—a tutorial into long short-term memory recurrent neural networks. arXiv [cs.NE]. http://arxiv.org/abs/1909.09586 14. Zhang J, Ye L, Lai Y (2023) Stock price prediction using CNN-BiLSTM-attention model. Mathematics 11(9):1985. https://doi.org/10.3390/math11091985

Analysis of Monkey Pox (MPox) Detection Using UNETs and VGG16 Weights V. Kakulapati

Abstract As the world struggles to recover from the extensive destruction caused by the advent of COVID-19, a new threat emerges: the MPox virus. MPox is neither as deadly nor as ubiquitous as COVID-19, but it still causes new instances of infection in patients every day. If another worldwide epidemic occurs for the same reason, it would not come as a shock to anybody. Image-based diagnostics may benefit greatly from the use of ML. For this reason, a comparable application may be modified to detect the MPox-related illness as it manifests on human skin, and the obtained picture can then be used to establish a diagnosis. However, there is no publicly accessible MPox dataset for use in machine learning models. As a result, creating a dataset with photos of people who have had MPox is an urgent matter. To do this, continually gather fresh MPox images from MPox patients, evaluate the efficacy of the recommended modeling using VGG16 on very skewed data, and compare the results from our model to those from previous publications, all using the UNETs with VGG16 weights’ model. The time it takes to go from diagnosis to treatment is shortened because the MPox is easily seen. Because of this, there is a great need for fast, accurate, and reliable computer algorithms. Using the U-Net and the VGG16 CNN, the system presented here can automatically recognize and analyze MPox. Keywords Custom CNN · U-Net · VGG16 · CNN · Disease · Diagnosis · MPox virus · Machine learning (ML) · Performance · Images · Patients

1 Introduction The MPX infection causes a pathogenic sickness that has several diagnostic similarities to chickenpox, measles, and smallpox. Due to its rarity and similarities to other diseases, early diagnosis of monkeypox has proven challenging. In 1959, Denmark V. Kakulapati (B) Sreenidhi Institute of Science and Technology, Yamnampet, Ghatkesar, Hyderabad, Telangana 501301, India e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 A. Swaroop et al. (eds.), Proceedings of Data Analytics and Management, Lecture Notes in Networks and Systems 788, https://doi.org/10.1007/978-981-99-6553-3_25

321

322

V. Kakulapati

became the first country to report a monkeypox pandemic. There were three outbreaks of human MPX between October 1970 and May 1971, infecting a total of six people in Liberia, Nigeria, and Sierra Leone. Ten further cases of monkeypox have been reported in Nigeria since the first index case was found in 1971. Since the epidemic started in 2013, there have been reports of monkeypox in people in 15 countries. Eleven of these countries are located in Africa. There have been incidences of MPX in countries as diverse as Singapore and Israel [1] The virus, which has its roots in Central and West Africa, has now spread to several other regions and threatens to become a worldwide pandemic. There may also be a rash and lymph node swelling. The condition is self-limiting and treated primarily with symptomatic care; however, 3–5% of patients may die from medical complications. As there is currently no medicine available that specifically targets the MPox virus, antiviral and vaccinia immune globin, both designed for the treatment of smallpox in older people, are used to control acute MPox infections [2, 3]. The virus that causes MPox may also infect humans and other animals. A rash that begins as blisters and eventually crusts over is one symptom. Fever and enlarged lymph nodes are also present. The time until symptoms appear after exposure might be anything between five and twenty-one days. In most cases, people have symptoms for two to four weeks. Mild symptoms are possible, but it is also possible that you would not notice anything at all. Not every epidemic follows the usual pattern of fever, aching muscles, swollen glands, and lesions appearing simultaneously. More severe symptoms may be experienced by those who are more vulnerable to the disease, such as infants, pregnant women, and those with impaired immune systems. Increases in reported cases of MPox occur although the disease is not very infectious in comparison to the 19 instances of Coronavirus that have been seen so far. In 1990, there were only 50 confirmed cases of MPox in the whole of West and Central Africa [4, 5]. However, by the year 2020, a startlingly high number of 5000 cases had occurred. Before 2022, it was thought that MPox only existed in Africa; nevertheless, several countries outside of Africa, together with Europe and the USA, confirmed detecting MPox infections in their populations. So far, 94 nations have reported MPox cases to the Centers for Disease Control and Prevention (CDC) as of December 21 (year, day, and number), 2022 [6]. As a result, widespread panic and dread are on the rise [7]. This is often reflected in people’s online comments. This investigation aims to extend a model that can more accurately identify and diagnose monkey pox using current data. The UNETs and VGG16 weights are used to construct an effective model for diagnosing monkey pox. Similar to smallpox, although less severe, MPox has comparable clinical characteristics [8]. Yet, the rashes and lesions produced by an MPox virus often mimic those of chickenpox and cowpox. Due to its similarities in appearance and symptoms to other poxviruses, early identification of MPox may be difficult for medical professionals. Moreover, since human cases of MPox were so uncommon before the current epidemic [9], there is a significant information vacuum among healthcare providers worldwide. Healthcare providers are utilized to detect poxviruses by image assessment of skin lesions; however, the polymerase chain reaction (PCR) test is widely regarded as the most reliable method for identifying an MPox infection [10]. While

Analysis of Monkey Pox (MPox) Detection Using UNETs and VGG16 …

323

fatalities from MPox infections are relatively rare (1–10%) [11], the disease may be effectively controlled in communities by isolating patients and tracking down their contacts as soon as possible. By combining U-Net with structural contexts, numerous clinical feature extraction approaches are used, such as those used to classify and segment retinal veins. VGG16, a neural network, improves the strategy of merging several prediction performances to raise the efficiency of classification methods. As a fully convolutional neural network (CNN), VGG16 has 16 layers (convolutional neural network). The ImageNet database, which contains over a million photos, is now being downloaded as part of a training dataset instance. Photos may be broken down into 1,000 distinct items, like the computer keyboard and mouse, using the pre-trained network. This has led to rich visibility of visual variety throughout the network. The 224 × 224 input picture is used by the network. While U-Net is among the most widely used CNN architectures for image segmentation, many systems lack the time and resources necessary to implement it. To get around this issue, we combined U-Net with another design, called VGG16, to lower the number of layers and parameters required. VGG16 was selected because its contracted layer is very similar to that of U-Net and because it has a large variety of tuning options. We use the weights from VGG16, which are based on freely accessible characteristics. The remainder paper has the following arrangement: Sect. 2 focuses on previous studies. Next, Sect. 3 provides a brief overview of the recommended model’s structure, the methodology utilized in this study, and the investigational setup employed to assess the model’s efficacy. Section 4 provides visual explanations of each evaluative metric. Outcomes from tests performed to address the issues raised in Sect. 5’s work are then presented, together with an analysis of the resulting data and any conclusions that may be drawn from it. Section 6 contains concluding remarks, followed by future enhancement investigations.

2 Previous Works Several deep learning models that had been pre-trained were utilized in a feasibility study [12] to recognize MPox lesions as distinct from those caused by chickenpox and measles. The dataset was collected from freely available online resources, including news websites, and then, it was augmented using a data mining approach to boost its size. Pre-trained deep learning models, such as Inception V3, ResNet50, and VGG16 are widely used. We discovered that the approach successfully differentiated MPox lesions from measles and chickenpox lesions. Overall, the ResNet50 model performed the best, with an accuracy of 82.96%. When compared to an ensemble of the three models, VGG16 achieved 81.48 and 79.26% accuracy. The researcher described the possibility of using AI to identify MPox lesions in digitized skin images [13]. In 2022, the research debited the biggest skin imaging collection to date, which was collected from cases of MPox. Seven different DL models were

324

V. Kakulapati

used for the study: ResNet50, DenseNet21, Inception-V3, Squeeze Net, MnasNetAI, MobileNet-V2, and ShuffleNet-V2-X. With an accuracy rate of 85%, the research suggests that AI has considerable promise in diagnosing MPox from digitized skin pictures. Using a retrospective observational research design [14] describes the clinical characteristics and treatment options for human MPox in the UK. Human MPox, the study’s authors conclude, presents unusual difficulties even for the UK’s wellendowed healthcare systems and their high-consequence infectious diseases (HCID) networks. To diagnose MPox, it was proposed to use a reworked version of the VGG16 technique. The consequences of their trials were split between two research projects. In terms of the variables, these studies relied on batch size, learning rate, and the number of epochs. According to the findings, the improved model correctly diagnosed MPox patients in both experiments with an accuracy of 97 plus or minus 1.8% and 88 plus or minus 0.8%, respectively. Furthermore, a well-known explainable AI method called LIME (Local Interpretable Model-Agnostic Explanations) was used to gloss over and explain the post-prediction and feature extraction outcomes. By studying these early signs of the virus’s spread, LIME hopes to get a better understanding of the virus itself. The results confirmed that the proposed models could pick up on trends and pinpoint the exact location of the outbreak. MPox skin lesion photos, together with those of chickenpox and measles, were assembled into a dataset called “MPox Skin Lesion Dataset” (MSLD) [15]. Detecting and segmenting the brachial plexus have been suggested using variants of the U-Net [16, 17], a popular model for ultrasound picture segmentation. For better recognition results, the best convolutional neural network (CNN) has 16 weight layers, whereas the VGG network [18] was suggested in the ImageNet Large-Scale Visual Recognition Competition (ILSVRC) in 2014. VGG’s model helps with object localization since it is based on a small filter size [19]. Networks using the VGG architecture have been used for these tasks [20]. Several types of convolutional architecture often include a skip connection, and attention is a method that, when combined with an encoder and a decoder, may boost the performance of the models. The ResNet50, AlexNet, ResNet18, and CNN models were used to test the efficacy of the transfer learning approach. The recommended network achieves a 91.57% accuracy rate with a sensitivity of 85.7%, a specificity of 85.7%, and a precision of 85.7%. This information may be used to create new PC-based testing methods for broad monitoring and early identification of MPox. Furthermore, it would allow those who fear that they have MPox to do basic testing in the comfort of their own homes, putting them at a safer distance from the disease’s potentially harmful effects in its early stages [21].

Analysis of Monkey Pox (MPox) Detection Using UNETs and VGG16 …

325

3 Methodology Although computer vision has found several uses, it has seen very little use in health care. Investigation into health-related image identification has greatly benefited from the latest developments in deep learning methodologies for image detection. UNETs are a specialized method for segmenting images. Much of what has inspired this study is the extensive research into monkey poop prediction utilizing ML techniques like VGG16 and UNETs. We preprocessed and transformed into masked pictures a dataset of patients with MPox and others who had a comparable impact. In order to prepare for any future illnesses that may threaten human life, a model was developed that is data-agnostic. After extensive training and testing, the models finally produced a picture with unprecedented precision.

3.1 VGG16 A convolutional neural network (CNN) created by the VGG at Oxford University won the ImageNet [22] competition in 2014. This model has a total of 13 layers of convolution, five max-pooling layers, and three dense layers. Since it includes 16 layers with trainable weight parameters, it is known as VGG16 [23].

3.2 CNN The structure has a U-shaped layout. The encoder and decoder are each divided into two halves. UNET takes a photograph as input and outputs a picture of its own. It is well-established for the diagnosis of various pictures to recognize aberrant traits. Encoders, like other convolutional layers, work to reduce the overall size of an input picture. Although the image’s pixel values seem to remain the same, the decoder increases the quantity by fusing the encoder layer with itself [24].

3.3 Custom CNN CNNs, or ConvNets, are a kind of deep learning network design that may acquire knowledge directly from data without the requirement for human intervention to extract features. CNNs excel at recognizing objects, people, and scenes by analyzing pictures for recurring patterns. Making it simpler to create bespoke neural networks and use TensorFlow Core for particular purposes, this post walked over the methods and choices utilized in constructing graphs and executing them in a training session.

326

V. Kakulapati

3.4 UNET It is the model of CNN. It is U-shaped architecture. The encoder and decoder have two pieces. UNET takes an input picture and delivers the output of an image. It is established for the diagnosis of different images to identify characteristics of abnormality. While every convolutional layer does, the encoder minimizes the image size. At the same time, the decoder increases the amount by integrating the encoder layer with the decoder layer, while the image seems to be in the image size of the pixel values [25]. Image segmentation is complete inside the image, and each pixel’s characteristics are estimated. The neighbourhood window has a dimension of 3x3, with the centre pixel being the one of interest. All pixels are recurring to evaluate the characteristics. In many segmentation models, the contracting layer and expansion layer are the foundational layers. Many up-sampling layers and convolution layers were added to the end of the VGG16 architecture in this study to make it more like the U-Net. When finished, the model’s architecture will have the symmetry of the letter U. Consequently, the VGG16 will serve as the contracting layer in the UNet-VGG16 model’s design, while the expansion layer will be introduced later.

4 Implementation Analysis The dataset is collected from the web which is a publically available dataset. In implementation, train 150 images of MPox. Though it is a small data set, we obtained more accurate results. To prepare the MPox dataset for uploading, read all photos, resize images, normalize image pixel values, and divide the dataset into TRAIN and TEST. Roughly 150 images were collected for the dataset by scouring various online sources. While the additional 70% of the dataset is used for planning purposes, the remaining 30% is used for testing the computations. Execute the VGG16 algorithm with a tailor-made CNN algorithm: VGG will be fed 80% training photos and 20% test images to evaluate its efficacy (Fig. 1).

4.1 Data Preprocessing Preprocessing is crucial for improving the quality of MPox photos and preparing them for feature extraction through image analysis, which may be carried out by either humans or robots. Preprocessing has several benefits, including higher signalto-noise ratios, a clearer perception of an MPox picture, less clutter, and more accurate color reproduction.

Analysis of Monkey Pox (MPox) Detection Using UNETs and VGG16 …

327

Fig. 1 Monkey pox image dataset images

4.2 Extracting Features This is to reduce the amount of time spent analyzing data by isolating and quantifying the specific features that make up a given training sample. By analyzing the most crucial features of an image in a feature space, the extracted feature provides input for the subsequent classifier. There are eight separate textural aspects at play in this picture analysis research. In contrast, these features are evaluated differently when used for classifying and segmenting tasks. To facilitate image classification across pictures, an approximation of the characteristics is made from the complete image. On the other hand, within the picture, the segmentation is complete, and the predicted attributes of each pixel are accurate. For this example, we will use a 3 × 3 neighborhood window with the center pixel marked as the point of interest. For a complete evaluation of the features, all pixels must be used repeatedly. (1) Upload MPox dataset: Utilizing the section may add a dataset to the program. (2) Preprocessing dataset to read the complete dataset, scale all of the photos so that they are the same size, normalize the pixel values of the images, and then divide the dataset into halves, one for the training set and one for the testing set. Accurate prediction accuracy will be calculated by applying 20% of test photos to the trained model. (3) VGG16 algorithm through its paces uses the 80% of photos that have already been processed as input to train a prediction model, which will then be used to make predictions on test images. (4) Execute Custom CNN algorithm: The 80% of processed photos will be sent into the Custom CNN algorithm to train the prediction model, which will then be used to make predictions on test images. (5) The Comparison Graph module will be used to create a graph contrasting the VGG and Custom CNN methods.

328

V. Kakulapati

(6) “Predict Disease from Test Image,” we can submit a test image and have Custom CNN determine whether or not the image is healthy or contaminated with MPox.

4.3 Measures of Performance Jaccard and Dice coefficients are used to evaluate the trustworthiness of the proposed ML method. Cutting the ground at the actual and anticipated places of intersection and union yields the required dimensions. Jaccard’s index ranges from 0 to 1, Jaccard s Index =

GT ∩ PS TP = , GT ∪ PS TP + FN + FP

and the Dice coefficient quantifies the degree of overlap between two masks. One indicates a continuous overlap, whereas zero indicates there is no overlap at all. Dice coefficient =

2TP 2 ∗ (GT ∩ PS) . = 2TP + FN + FP (GT ∩ PS) + (GT ∪ PS)

Dice loss = 1 − Dice coefficient. Dice loss is used with binary or categorical cross-entropy in various segmentation situations. The Jaccard coefficient on the dice is worth 89%, which corresponds to a value of 80%. 89% on the Dice (Fig. 2). The MPox recognition model is creating on the CNN and VGG16-UNET architecture. The MPox detector was trained and evaluated on the dataset acquired, 80% trained and 20% as tests (Figs. 3, 4, and 5). The UNET-VGG16 model was trained via several epochs. An efficient method of machine learning was applied.

5 Discussion The zoonotic illness MPox, which is caused by an Orthopoxvirus, has changed since it was originally identified in the Democratic Republic of the Congo in 1970. Ten African countries and four other countries have reported human cases of monkeypox. The median age at presentation has grown from 4 years old in the 1970s to 18 years old in the 2010s and 2020s, and there have been at least 10 times as many cases reported. Death rates in Central African clades were almost twice as high as those in West African clades, at 10.6% versus 3.6%. The dynamic epidemiology of this reemerging illness can only be comprehended through the use of surveillance and detection methods.

Analysis of Monkey Pox (MPox) Detection Using UNETs and VGG16 …

329

Fig. 2 Comparison and segmentation of the MPox image

Deforestation may be a cause or even function as a potentiate in the comeback of MPox, although declining immunity is the most widely held explanation. The Orthopoxvirus family includes the very similar MPox virus, smallpox-causing variola virus, and smallpox vaccine-causing vaccinia virus. Despite widespread smallpox, no instances of MPox were ever recorded at the time. This might have happened for a few reasons: either the emphasis was on smallpox and the symptoms of the two illnesses are similar, or the absence of scientific proof of the causative agent led to the presumption of smallpox. In the past, we knew that the smallpox vaccine provided around 85% protection against MPox. Investigations on the median nerve dataset’s segmentation showed that models constructed with the learning algorithm and/or the residual module outperformed

330

Fig. 3 Detection of MPox confusion matrix

Fig. 4 Predict mask of MPox images Fig. 5 UNET-VGG16 model was trained via several epochs

V. Kakulapati

Analysis of Monkey Pox (MPox) Detection Using UNETs and VGG16 …

331

their baseline counterparts. These results showed that the two augmentations might increase model performance by using more learned information between the layers. Some of the original image’s spatial information may be lost during the pooling process, but the attention mechanism may turn it into a new space while keeping important information or attributes. The residual module can then restore this information. As a result of combining U-Net with VGG, the proposed VGG16-UNet outperforms previous iterations of both models.

6 Conclusion Using masked photos and a combination of the Unsupervised Feature Extraction Network (UNET) and the Deep Convolutional Neural Network (VGG16), we can determine whether the patient has an MPox. The UNET approach may provide great performance in a wide variety of biomedical segmentation applications. The first step is to use the data increase method to collect more training data, then use picture edge detection to pinpoint the area of interest in MPox photos. With neural networks such as UNET and VGG16, you can efficiently organize MPox outbreaks.

7 Future Enhancement In future work, multimodal classification algorithms use an extensive dataset to enhance the classification accuracy and apply nature-inspired optimization algorithms for precise performance.

References 1. Kakulapati V et al (2023) Prevalence of MPX (Monkeypox) by using machine learning approaches. Acta Sci Comput Sci 5(5):10–15 2. Gessain A, Nakoune E, Yazdanpanah Y (2022) Monkeypox. N Engl J Med 387:1783–1793 3. Mileto D, Riva A, Cutrera M, Moschese D, Mancon A, Meroni L, Giacomelli A, Bestetti G, Rizzardini G, Gismondo MR et al (2022) New challenges in human monkeypox outside Africa: a review and case report from Italy. Travel Med Infect Dis 49:102386 4. Doucleff M (2022) Scientists warned us about MPox in 1988. Here’s why they were right 5. https://www.npr.org/sections/goatsandsoda/2022/05/27/1101751627/scientists-warned-usabout-MPox-in-1988-heres-why-they-were-right 6. WHO L (2022) Multi-country MPox outbreak in non-endemic countries. https://www.who.int/ emergencies/disease-outbreak-news/item/2022-DON385. Accessed on 29 May 2022 7. https://www.cdc.gov/poxvirus/MPox/symptoms.html 8. Bragazzi NL et al (2022) Attaching a stigma to the LGBTQI+ community should be avoided during the MPox epidemic. J Med Virol 9. Rizk JG, Lippi G, Henry BM, Forthal DN, Rizk Y (2022) Prevention and treatment of MPox. Drugs 1–7

332

V. Kakulapati

10. Sklenovska N, Van Ranst M (2018) Emergence of MPox as the most important orthopoxvirus infection in humans. Front Public Health 6:241 11. Erez N, Achdout H, Milrot E, Schwartz Y, Wiener-Well Y, Paran N, Politi B, Tamir H, Israely T, Weiss S et al (2019) Diagnosis of imported MPox, Israel, 2018. Emerg Infect Dis 25(5):980 12. Gong Q, Wang C, Chuai X, Chiu S (2022) MPox virus: a reemergent threat to humans. Virologica Sinica 13. Nafisa Ali S, Ahmed T, Paul J, Jahan T, Sani S, Noor N, Hasan T. MPox skin lesion detection using deep learning models: a feasibility study. arXiv, 13. Available online: https://arxiv.org/ pdf/2207.03342.pdf 14. Islam T, Hussain M, Chowdhury F, Islam B (2022) Can artificial intelligence detect MPox from digital skin images? BioRxiv 15. Adler H et al (2022) Clinical features and management of human MPox: a retrospective observational study in the UK. Lancet Infect Dis 22:1153–1162 16. Ali SN et al (2022) MPox skin lesion detection using deep learning models: a feasibility study. arXiv:2207.03342 17. Ronneberger O, Fischer P, Brox T (eds) (2015) U-net: convolutional networks for biomedical image segmentation. In: International conference on medical image computing and computerassisted intervention. Springer 18. Kakade A, Dumbali J (eds) (2018) Identification of nerve in ultrasound images using u-net architecture. In: 2018 International conference on communication information and computing technology (ICCICT). Mumbai, India 19. Simonyan K, Zisserman A. Very deep convolutional networks for large-scale image recognition. Available online: https://arxiv.org/abs/1409.1556 20. Inan MSK et al, Deep integrated pipeline of segmentation leading to classification for automated detection of breast cancer from breast ultrasound images. Available online: https://arxiv.org/ abs/2110.14013 21. Iglovikov V, Shvets A. Ternausnet: U-net with VGG11 encoder pre-trained on imagenet for image segmentation. Available online: https://arxiv.org/abs/1801.05746 22. Kakulapati V et al (2023) Monkeypox detection using transfer learning, ResNet50, Alex Net, ResNet18 and custom CNN model. Asian J Adv Res Rep 17(5):7–13. https://doi.org/10.9734/ ajarr/2023/v17i5480 23. Deng J, Dong W, Socher R, Li LJ, Li K, Fei-Fei L (2009) Imagenet: a large-scale hierarchical image database. In: 2009 IEEE conference on computer vision and pattern recognition. IEEE, pp 248–255 24. He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 770–778 25. Kakulapati V et al (2021) Analysis of tumor detection using UNETS and VGG16 weights. J Med Pharm Appl Sci 10(4). ISSN: 2320-7418

Role of Robotic Process Automation in Enhancing Customer Satisfaction in E-commerce Through E-mail Automation Shamini James, S. Karthik, Binu Thomas, and Nitish Pathak

Abstract In recent years, the use of Robotic Process Automation (RPA) in ecommerce has grown in popularity. RPA gives businesses the ability to automate routine, manual processes, increasing productivity, cutting down on response times, and improving customer satisfaction. RPA can be used in e-commerce to automate a variety of e-mail-related functions, including reading, processing, and handling client inquiries. RPA can also be effectively used for handling online payments and sending personalized immediate responses to customers. This paper is a case study that gives an overview of RPA technology and how it is used in real-time e-mail automation, particularly for managing customer payments and e-mail feedback. The paper also explains the implementation experiences of RPA systems in creating accounts in Moodle LMS and appropriate course enrollment in the LMS as per user requirements. Consequently, the paper introduces the systematic procedures for setting up RPA automation in an e-learning environment to improve efficiency and customer satisfaction. The paper also discusses the advantages and general concerns over using RPA in customer support and e-mail automation in an e-commerce environment. Keywords Robotic process automation · E-commerce · E-mail automation · Customer support

S. James · S. Karthik Kalasalingam Academy of Research and Education, Krishnankoil, Tamil Nadu, India B. Thomas (B) Marian College Kuttikkanam, Peermade, Idukki, Kerala, India e-mail: [email protected] N. Pathak Bhagwan Parshuram Institute of Technology (BPIT), GGSIPU, New Delhi, India © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 A. Swaroop et al. (eds.), Proceedings of Data Analytics and Management, Lecture Notes in Networks and Systems 788, https://doi.org/10.1007/978-981-99-6553-3_26

333

334

S. James et al.

1 Introduction Robotic Process Automation, or RPA, is a technology that enables businesses to automate manual and repetitive operations. In RPA, regular tasks like data entry, form filling, and process execution are carried out by software robots, or “bots,” in place of humans. The bots may communicate with a variety of systems and applications, including enterprise resource planning (ERP) software, customer relationship management (CRM) systems, and other backend systems [1]. The RPA bots are built to imitate human activities. Organizations gain greatly from RPA, which boosts production and efficiency while reducing costs and increasing compliance. Additionally, it helps to free up workers from menial and low-value duties, so they can concentrate on more strategic and creative work [2]. RPA has grown quickly in popularity in recent years and is anticipated to become a crucial technology for businesses. E-mail is a common form of consumer communication for e-commerce companies regarding purchases, delivery, refunds, and other issues [3]. E-commerce companies must have a system in place for managing and responding to customer inquiries promptly and effectively if they want to offer e-mail customer service that is effective [4]. This often entails having a group of customer service professionals who are qualified to respond to a variety of questions from clients and address their problems [5, 6]. Robotic Process Automation (RPA) can also be used for of Moodle administration since it automates time-consuming, repetitive chores, streamlines administrative procedures, and improves overall effectiveness [7]. This paper describes the use of the RPA tool in managing customer service in the payment section of an e-learning site operated by Marian College Kuttikkanam (Autonomous) called Marian Institute of Innovative Teaching Learning and Evaluation (MIITLE). It also discusses about the implementation experience of RPA in Moodle LMS administration.

2 Review of Literature Studies show that RPA may greatly improve the e-commerce industry in terms of effectiveness, accuracy, and cost savings [8]. The capacity of technology to automate tedious and routine work has allowed employees to focus on more strategic and value-adding tasks [1]. Several e-commerce tasks, including order processing [1], customer assistance [9], and inventory management [10], can be carried out using RPA. According to the study, RPA has also been found to work well with other technologies, including artificial intelligence and machine learning [5], allowing for even greater automation and efficiency gains. Studies have also examined how RPA will impact workforce in e-commerce [11]. RPA can cause employment losses, according to some studies [12], but others [13, 14] have found that it can also provide jobs and opportunities for reskilling.

Role of Robotic Process Automation in Enhancing Customer …

335

The research suggests that RPA might considerably benefit the e-commerce industry overall, but it also highlights the challenges and limitations of using the technology. Studies on RPA are mostly concerned with figuring out how to apply it in a way that is advantageous to businesses and employees [6] and how to use it to enhance the customer experience in e-commerce. Recent studies have shown that e-mail automation can significantly improve consumer experiences and productivity gains in commerce industry [4]. The capacity of technology to automate tedious and routine tasks has allowed employees to focus on more strategic and value-adding tasks. The literature has also emphasized the difficulties and restrictions associated with e-mail automation implementation in the e-commerce sector. These include concerns about data security and privacy [15], employee resistance to change [11, 16], and the substantial up-front expenditures of installing e-mail automation solutions [17, 18]. In general, the literature indicates that e-mail automation has the potential to significantly help the e-commerce sector, but it also emphasizes the difficulties and constraints of putting the technology into practice [18]. In Moodle, repetitive and time-consuming administrative processes including user management, course enrollment, data entry and migration, grading, and report preparation can all be automated using RPA, according to the literature [19]. RPA increases administrative productivity by increasing efficiency, lowering errors, and improving overall performance of the LMS [7].

3 The E-learning Environment Marian College Kuttikkanam (Autonomous) has a training institute named Marian Institute for Innovative Teaching Learning and Evaluation (MIITLE). It aims at bringing innovations into teaching learning and evaluation through faculty empowerment. During Covid-19, MIITLE started offering online courses on Moodle, Video Content Creation, ICT-Enabled Teaching, Google Classroom, etc. The courses were offered to teachers after accepting online payments. The payment gateway was integrated using the Razorpay payment portal. During Covid pandemic, 2230 teachers from colleges, schools, and medical institutions have joined the online courses. The online courses were offered using a dedicated Moodle Learning Management System (LMS) Server installed on the Amazon cloud platform where the participants were enrolled in the courses immediately after receiving payments through the payment gateway.

336

S. James et al.

Table 1 Areas of RPA implementation in the project Areas of ımplementation

Purpose of automation

Online payment

Ensuring successful transaction

Customer support

Communication about payment status

Moodle LMS administration

Creation of Moodle accounts

Course enrollment

Enrollment of participants to Moodle courses

E-mail automation

Intimating Moodle users about their login credentials

4 Need for Robotic Process Automation The online courses were offered by the college during the Covid pandemic lockdown. During that time, it was extremely difficult to find support staff to coordinate marketing, receive payments from individuals, create Moodle accounts, enroll participants in their preferred courses, and send e-mail communications to participants. The participants were expecting immediate responses from the college after making their payments. If there were payment issues like failures and multiple payments, these issues also had to be communicated to the participants. After a successful payment also, the MIITLE office had to do a series of routine activities before sending the e-mail confirmation to the participants. These routine activities are listed in Table 1. Usually, these routine activities were done by support staff available at the college office, and due to the lockdown situation, the college had to rely on RPA technologies to automate these repetitive tasks. Due to the lack of availability of human resources during Covid lockdown, it was taking almost two days for managing customer payments and to send e-mails with Moodle LMS login credentials.

5 RPA Implementation MIITLE had decided to implement RPA in managing the user accounts of the elearning platform to overcome the challenges caused by COVID. The first step was to identify the specific areas of routine operations to implement Robotic Process Automation. After a detailed analysis of the requirement, it was decided to incorporate payment management, user feedback, Moodle account creation, and user notification in the RPA module. Responding to user inquiries was not incorporated under RPA because of its technological feasibility. It was decided to use the UiPath Business automation package as the RPA tool for development. The RPA automation plan was mainly focusing on e-mail automation so that the participants will immediately receive e-mail clarification and login credentials for Moodle eLearning accounts. Different modules of the eLearning environment were considered for RPA automation.

Role of Robotic Process Automation in Enhancing Customer …

337

5.1 Payment Management The RPA module was designed to automate the process of capturing payment information from customers, such as credit card details or other forms of payment, and then processing the payment through the relevant payment gateway. It can also reconcile payments received from customers against duplicate payments and failed payments to reduce manual effort and improve accuracy. The same RPA module is capable of segregating and preparing an Excel worksheet that contains the e-mail and other contact details of customers based on successful payments, failed payments, and duplicate payments. Excel application scope container activity of UiPath is used for reading Excel files and creating a new Excel file. The Razorpay payment portal can prepare a daily payment report from the dashboard. UiPath Open Application Activity is used for opening the Razorpay portal and downloading the daily payment reports. Logging into the portal, locating the report generation tab, selecting the duration for report generation, and downloading the report in Excel format are automated at this stage.

5.2 Moodle Account Creation RPA Module is designed to automate the process of creating Moodle accounts by automatically inputting the information received from the payment gateway into the Moodle Administration tab. The Open Application Activity of UiPath is used for generating unique login credentials for each user after automatically logging into Moodle. The RPA module developed for Moodle automation is capable of creating Moodle account after receiving payments. The RPA module thus developed for Moodle automation can assign a student role to the new participant. After assigning a student role, the participant is automatically enrolled by the RPA module into an appropriate course based on the choice made by the participant at the time of course registration.

5.3 E-mail Automation Customers usually expect immediate feedback after making the payment to join a course. Previously, these communications were sent manually after verifying the payments from the payment gateway application. This was a time-consuming task to verify the successful payments and to intimate the participants by e-mail about their login credentials. An RPA module was developed using the Open Application and SMTP mail message activities of UiPath. Send SMTP Mail Message activity available in UiPath studio for e-mail automation is used for sending e-mails. Different components used in RPA are explained in Table 2.

338

S. James et al.

Table 2 UiPath components used in the automation Area of automation

UiPath component used

Payment through Razorpay

Open application UiPath activity

Extracting personal information

Excel application scope container UiPath activity

Moodle account creation

Open application UiPath activity

Moodle course enrollment

Open application UiPath activity

E-mail automation

SMTP mail message UiPath activity

6 Discussions The implementation of Robotic Process Automation has many visible benefits. During the COVID lockdown, the MIITLE office could not run online courses without the support staff. Even with support clerical staff, it was taking more than a day to send the first response to the customer after receiving the payment. It was taking two days for MIITLE to create Moodle accounts, enroll participants into courses according to their choice, and send them their Moodle LMS login credentials. With the implementation of the RPA module to automate these tasks, it was taking less than ten minutes to send responses to the participants after getting the payment reports from the Razorpay payment gateway. The average time for the first response to customers was reduced to ten minutes through RPA. There were 1513 participants to whom the first confirmation e-mails were sent through the automation process. The details are depicted in Fig. 1. From the figure, it is clear that the time required for sending the first response varies from 2 to 10 min, and in most of the cases, it is between 3 and 8 min. The variation in time occurs due to the delay in realizing the payment. The benefits of using RPA in the e-learning environment are explained in Table 3. The time for Moodle account creation, enrollment of the participant to the courses of their preference, and sending an automated e-mail with their login credentials are found to be between 1 and 6 h. This could be achieved with lesser time with RPA implementation, but a manual verification of Moodle accounts and course enrollment is performed before sending the e-mail. The time taken for this process against the number of participants falling under this time slot is illustrated in Fig. 2. From Fig. 2, 15 10

0

1 59 117 175 233 291 349 407 465 523 581 639 697 755 813 871 929 987 1045 1103 1161 1219 1277 1335 1393 1451 1509

5

Fig. 1 Customers and time is taken for the first response through RPA after payment

Role of Robotic Process Automation in Enhancing Customer …

339

Table 3 Benefits of using RPA in customer support Activity

Duration before RPA implementation Duration after RPA

First response after payment

One day

10 min

Creation of Moodle account

Two days

6h

Course enrollment

Two days

6h

Sending Moodle login credentials Two days

6h

Fig. 2 Time for sending LMS login credentials and number of customers

it is clear that most of the participants received their LMS login credentials between 1.45 and 1.89 h. Forty participants received their login details between 5.85 and 6.26 h. Based on this study, integrating RPA into Moodle administration frees Moodle administrator from tedious activities and allows him to concentrate on more strategic and value-added initiatives. RPA has several advantages, but there are also a number of difficulties and things to think about, which are covered in the literature. These include the necessity for strong data security measures, potential stakeholder resistance to automation, implementation-related technological challenges, and the continuing upkeep and supervision of RPA bots. Many e-commerce procedures can be automated with RPA, but human oversight is still required to make sure that the automation is effective and that any failures or exceptions are handled effectively. RPA implementation must ensure that it has the appropriate personnel in place to supervise the RPA systems and handle any problems that may occur.

7 Conclusion There are several advantages for both customers and businesses when Robotic Process Automation (RPA) is used in e-mail automation for customer care during online payments. RPA frees up customer service representative’s time, so they may work on more challenging and value-adding tasks by automating repetitive

340

S. James et al.

and manual processes like sending e-mails, gathering and analyzing data, and updating client information. RPA deployment also ensures accuracy and consistency in customer communications, improving customer loyalty and satisfaction. According to the study’s findings, RPA is a highly efficient technique for streamlining and enhancing the customer assistance experience for online payments. Numerous advantages result from the incorporation of RPA in the project, including increased administrative effectiveness, improved data synchronization and integration, and the capacity to concentrate on strategic activities. Future studies need concentrate on assessing RPA’s long-term effects on Moodle administration and investigating new developments depending on the scalability of automation.

References 1. Lee J, Lee M, Kim K (2018) Impact of robotic process automation on business process outsourcing: a knowledge-based view. J Bus Res 91:428–436 2. Aguirre S, Rodriguez A (2017) Automation of a business process using robotic process automation (RPA): a case study. 2:65–71 3. Choi D, Hind R (2021) Candidate digital tasks selection methodology for automation with robotic process automation 4. Akshay PN, Kalagi N, Shetty D, Ramalingam HM (2020) E-mail client automation with RPA 5. Wang D, Chen S, Zhao X, Li X (2018) Understanding the impact of robotic process automation on business processes: a case study in the financial sector. Inform Syst Front 20(4):799–814 6. Yu KC, Lu HP, Chen JC (2021) The impact of robotic process automation on customer satisfaction: evidence from the banking industry. J Business Res 125:586–597 7. Sharma U, Gupta D (2021) E-mail ıngestion using robotic process automation for online travel agency. In: 2021 9th International conference on reliability, ınfocom technologies and optimization (Trends and Future Directions)(ICRITO), IEEE, pp 1–5 8. Lacity MC, Willcocks LP (2017) Robotic process automation and risk mitigation: the role of internal audit. J Inf Technol 32(3):256–268 9. Menon VS, Soman R (2020) Robotic process automation (RPA) for financial reporting: A REVIEW of emerging practices and research opportunities. J Account Lit 45:23–40 10. Seidel S, Hirsch B, Treiblmaier H (2020) Towards a comprehensive understanding of the impact of robotic process automation on organizations. J Bus Res 108:365–379 11. Madakam S, Holmukhe RM, Jaiswal DK (2019) The future digital work force: robotic process automation (RPA). JISTEM-J Inform Syst Technol Managem 16 12. Bourgouin A, Leshob A, Renard L (2018) Towards a process analysis approach to adopt robotic process automation 13. Bhardwaj V, Rahul KV, Kumar M, Lamba V (2022) Analysis and prediction of stock market movements using machine learning. In: 2022 4th International conference on ınventive research in computing applications (ICIRCA), pp 946–950 14. Hofmann P (2019) Robotic process automation 15. Issac Ruchi RM (2018) Delineated analysis of robotic process automation tools, pp 0–4 16. Mohamed SA, Mahmoud MA, Mahdi MN, Mostafa SA (2022) Improving efficiency and effectiveness of robotic process automation in human resource management. Sustainability 14(7):3920 17. Sobczak A (2022) Robotic process automation as a digital transformation tool for ıncreasing organizational resilience in polish enterprises. Sustain 14(3) 18. Hyun Y, Lee D, Chae U, Ko J (2021) Applied sciences ımprovement of business productivity by applying robotic process automation

Role of Robotic Process Automation in Enhancing Customer …

341

19. Munawar G (2021) Bot to monitor student activities on e-learning system based on robotic process automation (RPA). Sinkron: J dan penelitian teknik informatika 6(1):53–61 20. Bhardwaj V, Kukreja V, Sharma C, Kansal I, Popali R (2021) Reverse engineering-a method for analyzing malicious code behavior. In: 2021 International conference on advances in computing communication and control (ICAC3), pp 1–5 21. Athavale VA, Bansal A (2022) Problems with the implementation of blockchain technology for decentralized IoT authentication: a literature review. Blockchain Ind 4.0, pp 91–119 22. Van der Aalst WM, Bichler M, Heinzl A (2018) Robotic process automation. Bus Inf Syst Eng 60:269–272

Gene Family Classification Using Machine Learning: A Comparative Analysis Drishti Seth, KPA Dharmanshu Mahajan, Rohit Khanna, and Gunjan Chugh

Abstract Accurate classification of gene families is of utmost importance in comprehending the functional roles and evolutionary history of genes within a genome. The exponential growth of genomic data has heightened the urgency for efficient and effective methods to classify gene families from DNA sequences. In this research paper, we present a novel approach for classifying DNA sequences into seven gene families. Our approach is based on machine learning and uses k-mer counting as a feature engineering technique to predict the gene family of a given DNA sequence. We evaluated our approach on a large dataset of DNA sequences and achieved a high accuracy of 90.9% in classification performance. Our results demonstrate the potential of machine learning methods for advancing our understanding of DNA sequences and gene families and can provide valuable insights for biologists and geneticists. Keywords Bioinformatics · Gene family · DNA sequences · Classification · Machine learning · k-mer

D. Seth · KPA Dharmanshu Mahajan (B) · R. Khanna · G. Chugh Department of Artificial Intelligence and Machine Learning, Maharaja Agrasen Institute of Technology, Delhi, India e-mail: [email protected] D. Seth e-mail: [email protected] R. Khanna e-mail: [email protected] G. Chugh e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 A. Swaroop et al. (eds.), Proceedings of Data Analytics and Management, Lecture Notes in Networks and Systems 788, https://doi.org/10.1007/978-981-99-6553-3_27

343

344

D. Seth et al.

1 Introduction Bioinformatics is a discipline that uses computer science, statistics, mathematics, and engineering to analyze and understand biological data. Its origins can be traced back to Gregor Mendel’s groundbreaking work on hereditary traits in 1865. James Watson and Francis Crick’s groundbreaking discovery of the structure of DNA in 1953 further solidified the foundations of this discipline. Since then, bioinformatics has been essential in the evaluation and interpretation of biological data. Bioinformatics is a field that aims to unravel the principles governing nucleic acid and protein sequences, with a specific emphasis on gene sequences. Chromosomes, which reside within the nucleus of every cell, serve as the structures that contain the cell’s DNA. The DNA that makes up each chromosome is tightly coiled around proteins known as histones to support its shape. DNA carries heritable information in the form of nucleotides, which consist of three components: a nitrogenous base, a phosphate group, and a 5-carbon sugar. The nitrogenous base can be either adenine (A), thymine (T), cytosine (C), or guanine (G). The natural information is decoded by the different sequences these four nucleotides are put in. For the study of natural processes like gene expression and inheritable diversity, understanding the order of these nucleotides is essential. Bioinformatics has surfaced as a vital field that utilizes computational tools and algorithms to dissect and interpret complex natural data. Bioinformatics is a rapidly evolving discipline, with new algorithms being regularly developed. For instance, bioinformatics increasingly employs machine learning to generate predictive models capable of accurately classifying biological data. The analysis of complex biological data, such as DNA sequences and protein structures, relies on deep learning, a subfield of machine learning.

1.1 Problem Statement This project’s primary objective is to develop a machine learning model that can appropriately classify DNA sequences into various gene families. DNA sequences are crucial molecules that transport the genetic data required for the synthesis of proteins in living things. We can learn more about how genes work by determining which gene family a specific DNA sequence belongs to. Building a solid machine learning model that can precisely classify DNA sequences into their appropriate gene families is the goal of this study.

Gene Family Classification Using Machine Learning: A Comparative …

345

1.2 Machine Learning in Bioinformatics Due to the exponential expansion of sequencing data and the limits of conventional methods based on sequence alignment and homology, gene family categorization is a difficult task in bioinformatics. One of the key difficulties in gene family classification is distinguishing between homologous and paralogous genes. Paralogous genes arise from gene duplication events and may have diverged significantly from their ancestral sequence, while homologous genes share a common ancestor and exhibit similar sequences. Machine learning algorithms aid in identifying sequence patterns that indicate homology or paralogy, aiding in the differentiation of gene types. Machine learning has become an essential tool in gene family classification, particularly for the analysis of large-scale genomic data. As machine learning algorithms continue to advance and as more genomic data becomes available, it is anticipated that machine learning will continue to play a crucial role in the analysis of biological data and the discovery of novel gene families.

1.3 Motivation Gene family classification plays a significant role in precision medicine by providing insights into individual genetic profiles and enabling personalized healthcare. By accurately classifying gene families, it becomes possible to understand the variations and mutations within specific gene families, leading to the identification of disease-associated variants and the prediction of individual disease risks. This knowledge is critical for assessing an individual’s susceptibility to certain diseases and enables proactive measures for disease prevention, early detection, and personalized screening programs. Moreover, accurate gene family classification plays a vital role in treatment selection and response prediction. Different gene families may influence an individual’s response to specific therapies, and by considering the genetic profile of the patient, healthcare providers can predict treatment responses and select the most effective treatment options. Additionally, gene family classification facilitates the development of targeted therapies tailored to an individual’s genetic profile. This study is organized into several sections to effectively present the study’s objectives and findings. The dataset description section provides a detailed overview of the dataset used, including its source, size, and characteristics of the gene sequences. The proposed architecture section outlines the methodology and approach employed, incorporating various machine learning algorithms such as support vector machines, random forests, or XG Boost. Feature extraction using k-mer is explained separately, detailing the process and rationale behind its selection. The limitations of the research are discussed, highlighting potential biases or challenges faced during the study. A comparative analysis section presents a comprehensive evaluation of the performance and accuracy of the different machine learning algorithms used.

346

D. Seth et al.

Results and discussion are presented, including the outcome of the classification experiments and the interpretation of the findings. The conclusion summarizes the main findings, emphasizes their significance, and proposes future directions for gene family classification research.

2 Literature Survey The authors in [1] use a genetic algorithm in conjunction with deep learning to classify viral DNA sequences. The suggested approach uses a genetic algorithm for feature selection and a convolutional neural network (CNN) for feature extraction. The proposed methodology calls for preprocessing DNA sequences to extract useful features, followed by the use of several machine learning classification algorithms, such as support vector machines, decision trees, and random forests. The researchers in [2] focus on the challenging task of accurately classifying viral DNA sequences, which is crucial for understanding viral evolution, developing diagnostics, and designing targeted treatments. The proposed approach leverages the power of deep learning models, specifically convolutional neural networks (CNNs), to automatically extract relevant features from the DNA sequences. The experts who authored the publication [3] explore the use of machine learning techniques in analyzing DNA subsequence and restriction sites. It likely discusses the application of machine learning algorithms to automate and enhance the analysis of DNA sequences and restriction enzyme recognition sites. In [4], the AdaBoost algorithm which is based on support vector machines (SVMs), and its use in diverse domains is presented. The study suggests a novel method for enhancing classification task performance that combines the AdaBoost algorithm with SVM. The authors of this study [5] provide a summary of the most significant developments in DNA sequencing technology between 2006 and 2016. It goes into the effect of numerous sequencing platforms, their advantages and disadvantages, and the study fields in which they are used. The researchers in [6] discuss the problems and potential paths for DNA sequencing research. This research introduces a deep learning method for classifying DNA sequences. In [7], the authors explore algorithms including artificial neural networks, support vector machines, decision trees, and random forests to classify DNA sequences according to their functions. The authors also explore feature selection strategies that might be applied to retrieve pertinent data from DNA sequences. The experts in [8] suggest a technique that makes use of the AdaBoost algorithms to identify DNA-binding proteins. The method combines PseKNC, a feature extraction technique that captures sequence information, with the AdaBoost algorithm for classification. By utilizing these two techniques, the proposed method aims to enhance the accuracy of DNA-binding protein recognition.

Gene Family Classification Using Machine Learning: A Comparative …

347

The authors in [9] show that their approach achieves excellent accuracy and specificity while outperforming other widely used approaches for identifying DNAbinding proteins. Understanding protein–DNA interactions and drug discovery may both benefit from the proposed approach. The method for storing bio-orthogonal data in l-DNA using a mirror-image polymerase is described in the paper. The authors demonstrate that the mirror-image polymerase accurately synthesizes l-DNA and showcase the potential for scalable information storage using this approach.. In [10], an unsupervised classifier utilizes deep learning for gene prediction. It likely discusses the methodology behind CDLGP, and its application in gene prediction, and presents experimental results to demonstrate its effectiveness. In [11], the paper introduces a K-Nearest Neighbors’ (KNNs) model-based approach for classification tasks. It discusses the methodology of using KNN for classification, including the selection of K value and distance metrics. The paper may provide insights into the strengths, limitations, and experimental evaluations of the KNN model-based approach in classification tasks. The authors in [12] explain the theoretical aspects of SVMs, discussing the optimization problem and the underlying mathematical concepts. It explores the generalization properties of SVMs and their ability to handle nonlinear classification tasks through kernel functions. Additionally, the authors present practical considerations for implementing SVMs, such as the choice of kernel and the selection of hyperparameters. The experts in [13] conduct an empirical analysis of decision tree algorithms by applying them to various benchmark datasets. They compare the performance of different algorithms based on evaluation metrics such as accuracy, precision, recall, and F1-score. The analysis aims to provide insights into the strengths, weaknesses, and suitability of each algorithm for different classification tasks. The experts in the paper ‘Random Forests and Decision Trees’ [14] discuss machine learning algorithms. Random forests combine multiple decision trees, offering robustness and generalization. Decision trees make predictions by recursively splitting data based on features. The paper likely covers principles, advantages, construction, and performance comparisons of these algorithms. In [15], the study introduces XGBoost, an optimized implementation of gradient boosting machines, focusing on scalability, speed, and performance improvements. It discusses the algorithm’s key features, techniques, and empirical results. The authors in [16] explore the AdaBoost algorithm, a popular ensemble learning technique. The paper likely discusses the algorithm’s principles, and applications, and provides research insights into its effectiveness.

348

D. Seth et al.

3 Proposed Work 3.1 Architecture In the proposed architecture, the DNA dataset will undergo preprocessing to prepare it for analysis. It involves cleaning the data and removing unwanted symbols and information. Then, the model is trained on human data and then tested on both chimpanzee and dog datasets. Figure 1 depicts the architectural explanation of our proposed methodology that we have inculcated in our study. Fig. 1 Architectural view of the proposed methodology

Gene Family Classification Using Machine Learning: A Comparative …

349

3.2 Implementation of Machine Learning Algorithms 3.2.1

K-Nearest Neighbor’s Classifier

The k-Nearest Neighbor’s (kNNs) [11] classifier is used to assign gene sequences to specific families based on their similarity to known labeled sequences by calculating the distance between a new gene sequence and the labeled sequences in the training set.

3.2.2

Support Vector Machine (SVM) Classifier

The support vector machine (SVM) [12] classifier creates a hyperplane that effectively distinguishes various gene families by considering their feature vectors. By identifying the best decision boundary, SVM accurately assigns unknown gene sequences to their corresponding families.

3.2.3

Decision Tree Classifier

The decision tree classifier [13] constructs a hierarchical tree-like structure where internal nodes represent feature tests and leaf nodes represent class labels (gene families). The algorithm recursively partitions the feature space based on the most informative features, leading to efficient classification.

3.2.4

Random Forest Classifier

The random forest classifier [14] is an ensemble learning approach that builds a group of decision trees during the training process. After constructing the forest, the classifier determines the gene family of an unknown sequence by selecting the class that is most commonly predicted by the individual trees.

3.2.5

XGBoost Classifier

The XGBoost classifier [15] is a robust machine learning algorithm commonly utilized in gene family classification. It is an optimized implementation of gradient boosting, which sequentially combines weak learners to form a robust predictive model.

350

3.2.6

D. Seth et al.

AdaBoost Classifier

The AdaBoost classifier [16] can effectively handle imbalanced datasets, where certain gene families may be underrepresented. It focuses on improving the classification of challenging gene sequences by assigning higher weights to misclassified instances. Through iterative training, AdaBoost adapts and enhances its predictive performance, leading to accurate gene family classification results. The above-mentioned classifiers are widely employed in gene family classification due to their simplicity, interpretability, and ability to handle different types of data. These classifiers are often used as baseline models to establish a performance benchmark. Boosting classifiers are powerful ensemble methods that combine multiple weak learners (base classifiers) to create a strong classifier. In gene family classification, boosting algorithms can effectively handle complex relationships and capture subtle patterns in the data. This iterative process improves the overall performance of the classifier. Boosting classifiers are known for their high accuracy and ability to handle class imbalance.

4 Implementation A gene family is a group of genes that share a common origin and have similar functions or sequences. These genes are derived from a common ancestral gene through processes such as gene duplication and divergence over evolutionary time. Gene divergence is a fundamental mechanism that drives the evolution of gene families. As duplicated genes accumulate mutations, they can acquire distinct functions or develop specialized roles within an organism. Studying gene families provides insights into the evolutionary processes that have shaped the genomes of organisms. By comparing gene family composition and organization across different species, researchers can unravel evolutionary relationships and trace the origins of key biological innovations. The gene families mentioned below have significant biological and psychological significance: (1) G protein-coupled receptors (GPCRs): They are essential membrane proteins that play an important role in role in cell signaling. They are involved in transmitting signals from various external stimuli (such as hormones, neurotransmitters, and light) into the cell. (2) Tyrosine kinase: These are enzymes that add phosphate groups to specific tyrosine residues of target proteins. They play a vital role in cellular communication and signaling pathways, including growth, differentiation, and cell cycle control. (3) Tyrosine phosphatase: These are enzymes that remove phosphate groups from tyrosine residues and act in balance with tyrosine kinases to regulate cellular signaling and control various cellular processes.

Gene Family Classification Using Machine Learning: A Comparative … Table 1 Gene families

Gene family

351

Number of samples Class label

G protein-coupled receptors

531

0

Tyrosine kinase

534

1

Tyrosine phosphatase

349

2

Synthetase

672

3

Synthase

711

4

240

5

1343

6

Ion channel Transcription factor

(4) Synthetase: They are enzymes involved in the synthesis of various molecules that catalyze the attachment of amino acids to their corresponding tRNA molecules during protein synthesis. (5) Synthase: They are enzymes that catalyze the synthesis of complex molecules by joining smaller molecular components. For example, ATP synthase is an enzyme involved in the synthesis of ATP, the primary energy currency of cells. Synthases are vital for energy production and various biosynthetic pathways. (6) Ion channels: They are membrane proteins that allow the selective passage of ions across cell membranes. They play critical roles in regulating the electrical properties of cells, nerve impulses, muscle contraction, and numerous physiological processes. (7) Transcription factors: They are proteins that control the transcription of target genes by binding to certain DNA regions to influence gene expression. Table 1 represents the gene families and the number of samples along with the class label it belongs to.

4.1 k-mer Counting In bioinformatics and genomics, k-mer counting is a sort of feature extraction. When working with DNA sequencing, we must convert the DNA sequences into a format that can be properly analyzed. K-mer counting is a prominent technique for this. It involves breaking down the DNA sequence into smaller parts based on a value called ‘k’. This approach is particularly useful when dealing with large sets of DNA sequences because it allows for quick and efficient analysis. The key characteristic about k-mer counting is that it helps us deal with the challenge of variable read lengths in DNA sequencing data. When we sequence DNA, the lengths of the fragments we obtain can vary, which makes it tricky to compare and analyze them accurately. But by dividing the sequences into fixed-size segments called k-mers, we simplify the analysis process. Each k-mer represents a unique subsequence within the DNA sequence, so we can focus on specific patterns or areas of interest. Another advantage of K-mer counting is that

352

D. Seth et al.

Fig. 2 Analysis of DNA sequence with k = 6 Fig. 3 Class distribution of human dataset

it makes it easier to use machine learning models in DNA sequence analysis. Many machine learning algorithms work best when the inputs have fixed lengths, and using fixed-size k-mers meets this requirement. By transforming the DNA sequences into fixed-length vectors of k-mers, we can more effectively apply machine learning techniques to uncover meaningful patterns or features in the data. In our study, we have used k = 6 which means that we have used word length 6, also called hexamers. Figure 3 depicts how the DNA sequence is broken down into k-mers of size 6. The next step involves converting the list of k-mers of length six for each gene into sentences composed of strings. This step aims to combine all the sequences, which simplifies the conversion process into a bag-of-words representation. This approach allows the creation of independent features in the form of strings. Figure 2 shows the analysis of DNA sequence with a ‘k’ value equal to 6.

4.2 Dataset Description We took up three distinct datasets: human, dog, and chimpanzee genomes from Kaggle. Each dataset consists of DNA sequences along with corresponding class labels. In the datasets, we encountered varying class distributions, with some classes having low representation, while others were relatively balanced. Therefore, utilizing

Gene Family Classification Using Machine Learning: A Comparative …

353

the available data directly seemed like a viable option. In addition to this, if the issue of class imbalance persists, oversampling techniques can be employed. Oversampling involves duplicating or synthesizing minority class samples to create a more balanced distribution, helping machine learning models learn from a representative dataset. Another technique that can be used is class weighting. It assigns different weights to instances of each class during the training process, typically inversely proportional to the class frequencies. Class weighting gives higher importance to the minority class, enabling the model to focus on correctly predicting instances from that class. This helps reduce the influence of class disparities and increases the model’s performance. For the human genome dataset, we divided it into training and test sets in an 80:20 ratio. The training set was used to train various classification algorithms, while the test set served as an independent evaluation of model performance. The chimpanzee and dog datasets were solely employed for testing purposes. Our primary objective was to assess the generalizability of the machine learning model across divergent species. By initially training the model on the human data, we aimed to determine its effectiveness when applied to the chimpanzee and dog datasets, which represent species with increasing divergence from humans. To evaluate the accuracy of the classification algorithms, we employed different classification metrics derived from the confusion matrix. These metrics provided a comprehensive assessment of the model’s performance. In Figs. 3 and 4, we present the Class Balance of each dataset, showcasing the distribution of classes across the human, dog, and chimpanzee genomes. Table 2 provides a comprehensive overview of the datasets utilized in this study, including the number of sequences contained in each dataset and their corresponding attributes.

Fig. 4 Class distribution of dog and chimpanzee datasets

354 Table 2 Description of each dataset

D. Seth et al.

Items

Chimpanzee

Dog

Human

Number of sequences

1682

820

4380

2

2

2

Number of attributes

4.3 Limitations and Challenges Pseudogenes: The classification of pseudogenes, non-functional gene copies with high sequence similarity, can be challenging. Distinguishing them from functional genes requires careful examination of features such as intact open reading frames or regulatory elements. Incorporating such functional elements is crucial for accurate classification. Sequence Variations: Genetic polymorphisms, allelic differences, and sequencing errors introduce variations that can complicate gene family classification. To address this, advanced algorithms and comprehensive analysis of multiple individuals or species are necessary. Considering a broader range of genetic variations helps to improve the accuracy of classification. Identification of New Gene Families: Gene families exhibit a dynamic nature, constantly evolving and giving rise to new families. Staying updated with the latest genomic data, regularly updating classification pipelines, and integrating diverse genomic information are essential for identifying and classifying new gene families. A holistic approach that encompasses multiple data sources and analytical techniques facilitates the discovery of novel gene families. Gene Family Size and Complexity: Gene families can vary significantly in size and complexity. Some gene families consist of only a few closely related genes, while others are large and contain divergent members. Classifying complex families with extensive divergence necessitates the use of specialized algorithms and scalable computational resources due to the computational demands involved. These resources enable comprehensive analysis and accurate classification of gene families with varying sizes and complexities.

5 Assessment Metrics Accuracy and F1-score are significant evaluation metrics to take into account when using machine learning approaches for classifying gene families. By calculating the proportion of cases that are correctly classified to all instances, accuracy assesses the classification model’s overall correctness. A more complete metric that incorporates recall and precision is the F1-score. It takes into account both false positives (putting a gene in the incorrect family) and false negatives (not finding a gene that belongs to a certain family). It is frequently crucial to have both a high accuracy rate and an F1-score simultaneously when classifying gene families.

Gene Family Classification Using Machine Learning: A Comparative …

355

Achieving high accuracy in classifying gene families is important for us to understand the functions and evolutionary history of genes. When we can accurately classify genes into families, it helps us assign meaningful roles to them and predict their functions. Accurate classification also allows us to trace the evolutionary relationships of genes, identifying genes that have a shared ancestry or that have arisen from gene duplication events. By understanding these relationships, we can gain insights into how genes have evolved and how their functions have changed over time. Furthermore, accurate classification helps us unravel the intricate networks and interactions between genes, shedding light on complex biological processes.

6 Comparative Analysis In our research, we conducted a comprehensive comparison and analysis of various machine learning models to evaluate their accuracy and F1-score. The models considered in our study as mentioned above. Among these models, the Random forest method achieved a remarkable accuracy of 90.9% and an F1-score of 91.06%. The results of our study show that the proposed model is highly effective in accurately categorizing DNA sequences. Among the various algorithms we tested, the random forest classifier performed the best, achieving the highest accuracy and F1-score. Based on this performance, we decided to further evaluate the random forest model using additional datasets, specifically DNA sequences from Chimpanzees and dogs. By including these genetically related (human and Chimpanzee) and less related (human and dog) species, we aimed to assess the model’s performance across different levels of genetic similarity. In Figs. 5, 6 and Table 3, we present the comparative analysis of all the algorithms we tested. These figures provide visual representations of how the different models performed in terms of accuracy and F1-score. The data demonstrates the superior performance of the random forest model, reinforcing our decision to select it for further evaluation with the Chimpanzee and dog datasets.

7 Result Analysis In our analysis, we placed specific emphasis on evaluating the performance of two classification algorithms: the random forest classifier and the XGBoost classifier. The confusion matrices depicted below present a comprehensive depiction of the model’s predictions by comparing the actual labels with the predicted labels. These matrices serve as an invaluable summary of the classification outcomes. Upon assessing the human genome test set, we observed that the random forest classifier achieved an impressive accuracy of 91.5%. When extending our analysis to the chimpanzee dataset, which exhibits genetic similarities to humans, both the random forest and XGBoost classifiers showcased

356

D. Seth et al.

Fig. 5 Accuracy performance of all the algorithms

Fig. 6 F1-score performance of all the algorithms Table 3 Accuracy and F1-score performance of all the algorithms

S. No

Classifier

Accuracy (%)

F1-score (%)

1

KNN

81.3

79.8

2

SVM

81.5

81.9

3

Decision tree

83.5

81.1

4

Random forest

90.9

91.06

5

XG boost

89.1

89.3

6

AdaBoost

77.6

83.6

Gene Family Classification Using Machine Learning: A Comparative …

357

Fig. 7 Confusion matrix for human dataset random forest classifier

comparable accuracies. The random forest classifier attained an accuracy of 98.4%. However, the divergent dog genome dataset presented a distinct challenge. Here, the random forest classifier outperformed the XGBoost classifier, attaining an accuracy of 82%. Overall, the analysis highlights the varying performance of the random forest and XGBoost classifiers across different genome datasets. While the XGBoost classifier excelled on the human genome and exhibited comparable performance on the chimpanzee dataset, the random forest classifier proved to be more effective when confronted with the distinct characteristics of the dog genome. The accuracy values presented above were derived from analysis of the corresponding confusion matrices, as visually depicted in Figs. 7, 8, 9, and 10. These matrices encapsulate the classification outcomes and serve as valuable references for evaluating the performance of the classifiers.

8 Conclusion and Future Scope This study undertook a comparative analysis of five distinct classification algorithms, namely KNN, SVM, Decision tree, Random forest, XGBoost, and AdaBoost. To convert DNA sequences into fixed-length vectors, the k-mer encoding technique was employed. Additionally, NLP’s bag-of-words algorithm, implemented using a count vectorizer, facilitated the processing of DNA sequence strings. Among all the algorithms, Random forest and XGBoost displayed equally impressive results, achieving accuracies of 90.9% and 89.1%, respectively. The project has ample room for future enhancements and expansions, offering an opportunity to delve into the relationship between classifier performance and

358

D. Seth et al.

Fig. 8 Confusion matrix for dataset using XG boost classifier

Fig. 9 Confusion matrix for chimpanzee and dog datasets, respectively, using random forest classifier

Fig. 10 Confusion matrix for chimpanzee and dog datasets, respectively, using XGBoost classifier

Gene Family Classification Using Machine Learning: A Comparative …

359

variations in the ‘k’ values. By adjusting the ‘k value’, which determines the length of the substring in gene sequences, we can gain insights into how different values impact the classifier’s effectiveness. Investigating the effects of diverse ‘k’ values on the classifier’s performance presents an enticing path for further research and exploration.

References 1. Dixit P, Prajapati GI (2015) Machine learning in bioinformatics: a novel approach for DNA sequencing. In: 2015 Fifth international conference on advanced computing communication technologies, Haryana, India, pp 41–47. https://doi.org/10.1109/ACCT.2015.73. https://ieeexp lore.ieee.org/document/707904 2. El-Tohamy A, Maghawry HA, Badr N (2022) A deep learning approach for viral DNA sequence classification using genetic algorithm. Int J Adv Comput Sci Appl 13 3. Moyer E, Das A (2020) Machine learning applications to DNA subsequence and restriction site analysis. In: 2020 IEEE signal processing in medicine and biology symposium (SPMB), December, IEEE, pp 1–6. https://ieeexplore.ieee.org/document/9353634/ 4. Zhang Y, Ni M, Zhang C, Liang S, Fang S, Li R, Tan Z (2019) Research and application of AdaBoost algorithm based on SVM. In: 2019 IEEE 8th joint international information technology and artificial intelligence conference (ITAIC), May, IEEE. https://ieeexplore.ieee. org/abstract/document/8785556/authors#authors 5. Mardis ER (2017) DNA sequencing technologies: 2006–2016. Nature Protocols 12(2):213– 218. https://www.nature.com/articles/nprot.2016.182 6. Rizzo R, Fiannaca A, La Rosa M, Urso A (2016) A deep learning approach to DNA sequence classification. In: Angelini C, Rancoita P, Rovetta S (eds) Computational intelligence methods for bioinformatics and bio-statistics. CIBB 2015. Lecture Notes in Computer Science, vol 9874. Springer, Cham. https://link.springer.com/chapter/https://doi.org/10.1007/978-3-319-443324_10 7. Kanumalli SS, Swathi S, Sukanya K, Yamini V, Nagalakshmi N (2023) Classification of DNA sequence using machine learning. In: Ranganathan G, Fernando X, Piramuthu S (eds) Soft computing for security applications. advances in intelligent systems and computing, vol 1428. Springer, Singapore. https://link.springer.com/chapter/https://doi.org/10.1007/978-98119-3590-9_57 8. Yang L, Li X, Shu T, Wang P, Li X (2021) PseKNC and Adaboost-based method for DNAbinding proteins recognition. Int J Pattern Recogn Artif Intell 2150022. https://www.worldscie ntific.com/doi/abs/https://doi.org/10.1142/S0218001421500221 9. Fan C, Deng Q, Zhu TF (2021) Bioorthogonal information storage in l-DNA with a highfidelity mirror-image Pfu DNA polymerase. Nature Biotechnol 39(12):1548–1555. https:// www.nature.com/articles/s41587-021-00969-6 10. Sree PK, Rao PS, Devi NU (2017) CDLGP: a novel unsupervised classifier using deep learning for gene prediction. In: 2017 IEEE international conference on power, control, signals and instrumentation engineering (ICPCSI), September, IEEE, pp 2811–2813. https://ieeexplore. ieee.org/abstract/document/8392232 11. Guo G, Wang H, Bell D, Bi Y (2004) KNN Model-based approach in classification. https:// link.springer.com/chapter/https://doi.org/10.1007/978-3-540-39964-3_62 12. Evgeniou T, Pontil M (2001) In: Support vector machines: theory and applications. vol 2049. pp 249–257. https://doi.org/10.1007/3-540-44673-7_12. https://link.springer.com/chapter/10. 1007/3-540-44673-7_12 13. Patel H, Prajapati P (2018) Study and analysis of decision tree based classification algorithms. Int J Comput Sci Eng 6:74–78. https://doi.org/10.26438/ijcse/v6i10.7478

360

D. Seth et al.

14. Jehad A, Khan R, Ahmad N, Maqsood I (2012) Random forests and decision trees. Int J Comput Sci Issues (IJCSI) 9 15. Chen T, Guestrin C (2016) XGBoost: a scalable tree boosting system. pp 785–794. https://doi. org/10.1145/2939672.2939785. https://www.researchgate.net/profile/Shatadeep-Banerjee/ publication/318132203_Experimenting_XGBoost_Algorithm_for_Prediction_and_Classific ation_of_Different_Datasets/links/595b89b0458515117741a571/Experimenting-XGBoostAlgorithm-for-Prediction-and-Classification-of-Different-Datasets.pdf 16. Chengsheng T, Huacheng L, Bing X (2017) AdaBoost typical Algorithm and its application research. In: MATEC Web of Conferences. vol 139. pp 00222. https://doi.org/10.1051/matecc onf/201713900222. http://www.yorku.ca/gisweb/eats4400/boost.pdf

Dense Convolution Neural Network for Lung Cancer Classification and Staging of the Diseases Using NSCLC Images Ahmed J. Obaid, S. Suman Rajest, S. Silvia Priscila, T. Shynu, and Sajjad Ali Ettyem

Abstract Lung cancer is life-threatening cancer disease which is owing to abnormal development of cells in the lung and its surrounding tissue. Hence, identification and classification of the lung tumor growth through physical examination are extremely challenging due to complex boundaries and features with high degree of intraclass variation and low degree of interclass variations. Machine learning approaches were implemented to classify the cancer on basis of the tumor representation and its features, but those models consume more computation time and produce minimized accuracy and efficiency. In order to manage those complications, deep learning architecture has been introduced as it is multiple advantageous in characterizing the lung lesions features accurately. In this article, dense Convolution Neural Network architecture is employed to lung cancer classification and staging of the disease to the NSCLC images has been proposed. Initially, Wiener filter is employed as preprocessing technique as it improves the results of segmentation. Next, gradient vector flow-based segmentation has been implemented on the images to segment the coarse appearance and lesion boundary, and segmented image has been employed to ABCD rule as feature descriptors to extract the features lesion such as diameter lesion, asymmetry, border and color. Extracted feature has been employed to the training of the dense-connected multi-constrained Convolution Neural Network which contains the dense blocks with 128 layers which is capable of producing better accuracy with A. J. Obaid (B) Faculty of Computer Science and Mathematics, University of Kufa, Kufa, Iraq e-mail: [email protected] S. Suman Rajest · S. Silvia Priscila Bharath Institute of Higher Education and Research, Chennai, Tamil Nadu, India e-mail: [email protected] T. Shynu Department of Biomedical Engineering, Agni College of Technology, Chennai, Tamil Nadu, India S. A. Ettyem National University of Science and Technology, Thi-Qar, Iraq e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 A. Swaroop et al. (eds.), Proceedings of Data Analytics and Management, Lecture Notes in Networks and Systems 788, https://doi.org/10.1007/978-981-99-6553-3_28

361

362

A. J. Obaid et al.

reduced processing time. Furthermore, proposed model uses the hyperparameter optimization to reduce the network complexity and enhances the computation efficiency. Implementation outcome of the current approach is assessed using MATLAB software on using NSCLC dataset. Performance analysis of the proposed model on three classes of the disease as large cell carcinoma, adenocarcinoma and squamous cell carcinoma with 98.75% accuracy, 98.46 specificity and 99% sensitivity, respectively, on comparing against conventional classifiers. Keywords Lung cancer · Deep learning · Dense Convolution Neural Network · ABCD segmentation · Feature extraction

1 Introduction Lung cancer is a composite deadly illness which is primarily owing to uncertain and gathering of the numerous molecular variation. Molecular variation of lung cells leads to tumor in form of cancer with nonuniform representation and extremities [1]. Identification of the lung cancer disease can be carried out using invasive technique such as clinical screening and biopsy and non-invasive techniques like analysis of image with respect to dermoscopy and histopathological aspects. However, correct diagnoses of lung lesion are hard, cumbersome and complicated due to heterogeneous appearance, nonuniform shapes and segments of lung lesions [2]. Manual recognition and classification of the lung lesion are extremely comprehensive and difficult on features with large degree of intraclass changes and low degree of interclass modification [3]. Machine learning-based unsupervised algorithm such as K-Nearest Neighbor [4], Random Forest [5], Artificial Neural Network [6] is implemented to classify the cancer on basis of the tumor representation and its characteristics on structure, dimensions and edges into tumor and non-tumor types. Machine learning model is not suitable in staging the tumor features, those approaches require high computation time and it leads to minimized accuracy and efficiency. Furthermore, these approach processes reduced time-changing capability and are less resilient to tumor boundary changes on the multiple classes of the tumor features of the small lung cancer cells. To mitigate those limitations, deep learning architecture is utilized as it is high beneficial in categorizing the features of the lesions efficiently and accurately [7]. In this paper, a novel dense Convolution Neural Network for skin cancer classification and staging of the disease to the NSCLC images has been proposed. Initially, image noise removal, segmentation and feature extraction have been employed as a preprocessing technique for noise, segmenting the coarse appearance and lesion boundary and to extract the features lesion such as lesion diameter, lesion asymmetry and lesion borders. Further, those extracted feature has been employed to the dense Convolution Neural Network which contains the dense blocks to classify the features into three disease classes: adenocarcinoma, large cell carcinoma and squamous cell

Dense Convolution Neural Network for Lung Cancer Classification …

363

carcinoma [8, 21, 22]. Finally, proposed model uses hyperparameter optimization to reduce the network complexity and enhances the computation efficiency. The remaining article is partitioned as follows: Sect. 2 represents the problem statement and literature review for lung cancer classification. In Sect. 3, the current dense Convolution Neural Network architecture for disease classification on lesion features into types and stages has been provided. Implementation analysis of the current methodology on the disease dataset is accomplished in Sect. 4 along with experimental analysis on numerous measures like accuracy, recall and precision on the confusion matrix. Finally, Sect. 5 concludes the work with remarkable suggestions.

2 Related Work In this part, multiple traditional approaches are implemented for lung lesion identification and classification as an automated approach on the analysis of medical images by incorporating a machine learning model which is illustrated as follows.

2.1 Lung Lesion Classification Using Artificial Neural Network Artificial Neural Network is most effective in detecting and classifying the lesion classification. The process of the categorization has been carried out on preprocessing of the lung images implementing maximum gradient intensity algorithm [9]. Next, preprocessed image is segmented using Otsu threshold model to separate the lung lesion. Gray-level co-occurrence matrix [10] is applied to extract the multiple features on segmented images. Those computed features are employed to train the neural network. Neural Network classifies the feature into classes as tumor and non-tumor class. Finding on this particular architecture is that it is capable of classifying the lung disease into various classes with performance accuracy of 96% and reduced processing compared with other machine learning classifiers.

2.2 K-Nearest Neighbor Classification Model for Lung Lesion Classification K-Nearest Neighbor classification model is employed to identify the skin lesions and segment it into normal and benign. The process of classification has been carried out after preprocessing, feature extraction and segmentation of the images using region growing and local binary pattern mechanism [11]. Those processes generate

364

A. J. Obaid et al.

the lesion boundaries and effective features for effective classification. KNN classification on [12] obtained features to classifies the tumor and non-tumor with high scalability and reliability. Finding on this particular architecture is that it is capable of classifying the lung disease into various classes with performance accuracy of 87% and it leads to over-fitting and under-fitting issues.

3 Current Approach In this part, a novel deep learning approach represented as dense Convolution Neural Network architecture is designed on concerning the lung lesion illness. This approached is established to detect and classify the severity of disease tumors into basal cell adenocarcinoma, large cell carcinoma and squamous cell carcinoma. It has been classified and staged with respect to the lesion features.

3.1 Image Preprocessing Lung image may contain some artifacts as noise which can be eliminated using image contrast enhancement technique termed as CLAHE. It has been employed to smoothen image by eliminating the artifacts without altering the necessary characteristic in the lesion image. CLAHE incorporates the histogram equalization operation along the bilinear interpolation which is employed to enhance the contrast and reduce the noises including adaptive median filtering.

3.2 Image Segmentation—Gradient Vector Flow (GAV) Preprocessed image of the NSCLC image is segmented using Gradient Vector Flow [14] to detect the lesion boundaries on the image lesions. Gradient vector allows large scale of feature similarity of the tumor part on processing with respect to the lesion boundary. The object boundary of image lesion is fixed using following equation: X (s) = (x(t), y(t)), where t ε[0, 1]. The image contour is started using heuristic criteria. The criteria compute coarse appearance and lesion boundary. The lesion is represented using differential equation d X (s,t) = Fint X (s,t) + Vint X (s,t) , dt

Dense Convolution Neural Network for Lung Cancer Classification …

365

where F int is an internal force which keeps the shape continuity and smoothness of the contour and V int is the gradient vector flow. Vector flow contains the lesion boundaries of the lesion which will take for feature extraction.

3.3 Feature Extraction—ABCD Rule The segmented image is processed on employing ABCD rule [15] as feature descriptors to extract the features lesion such as diameter lesion, asymmetry, border and color. Variational features are segmented into normal and abnormal tumor features on employing the ABCD segmentation conditions. ABC segmentation conditions process the feature vector to identify the asymmetry, border and color characteristics of the feature [10]. ABC segmentation conditions are employed to segment the tumor segments accurately. • Asymmetry Asymmetry is important in the lesion segment analysis. The asymmetry of the lesion is computed using asymmetry index and lengthening index. Asymmetry index of the lesion images’ segments is computed using AI =

∆A ∗ 100, A

where A is the total surface of the image and ∆A is the surface difference among the tumor surfaces of the image. • Border Irregularity Border irregularity computes the irregularity in the border of the lesion segments. There are many measures to compute irregularity termed as compact index, fractal index, edge and pigment variations. • Compact Index: It is the computation of the barrier to noise along the boundary. It is computed using following equation: CI =

PL2 , 4π A

where PL = lesion perimeter and AL is lesion area.

3.4 Dense Convolution Neural Network In this work, extracted features of the ABCD feature descriptor are employed to the DenseNet architecture. It processes the feature vector to produce the disease type and

366

A. J. Obaid et al.

Fig. 1 Feature map of the lung lesion feature’s vectors

stages such as large cell carcinoma, adenocarcinoma and squamous cell carcinoma. AlexaNet mechanism [16] is employed to generate combinations of feature map with the minimum resolution as it is mixture of the convolution layer, pooling layer activation layer, loss layer and fully connected layer as classification layer. • Convolution Layer In this convolution layer, kernel size of 3*3 has been utilized to process the features from the ABCD descriptors. It constructs the feature map from the feature vectors. Figure 1 represents the feature map generation in the convolution layer. • Max Pooling Layer In this layer, feature vector in form of feature map is down-sampled by half on computing the relationship of the features of the lesion and creates the pooling index for the features to control the over-fitting issues. The max pooling layer extracted features which has high-level representations on the feature vector constructed by ABC rule. The feature map is represented as: Fm ε R C∗H ∗W . • Activation Layer Proposed approach employs the rectified linear units (ReLUs) as activation function as it improves the training stage to minimize the errors and introduces nonlinearity among the max pooled feature vectors. Activation function is provided by ReLu F(x) =

x x >0 . o x