Meta Heuristic Techniques in Software Engineering and Its Applications: METASOFT 2022 (Artificial Intelligence-Enhanced Software and Systems Engineering, 1) 3031117123, 9783031117121

138 22 40MB

English Pages [368]

Table of contents :
Preface
Contents
Performance Analysis of Heuristic Optimization Algorithms for Transportation Problem
1 Introduction
2 Mathematical Model of TP
3 Optimization Algorithms
3.1 PSO
3.2 APSO
3.3 NQPSO
4 Steps of Optimization
5 Performance Evaluation of the Algorithms
5.1 Convergence
5.2 Optimization Value
5.3 Accuracy
6 Conclusion
References
Source Code Features Based Branch Coverage Prediction Using Ensemble Technique
1 Introduction
2 Related Works
3 Proposed Approach
4 Implementation and Result Analysis
4.1 Result Analysis
4.2 Feature Analysis
5 Comparison with Related Work
6 Conclusions and Future Work
References
Implicit Methods of Multi-factor Authentication
1 Introduction
2 Related Work
3 Problem Statement
4 Implicit Methods of Multifactor Authentication
4.1 Mouse Events as Passcode
4.2 Image as Passcode
4.3 Patterned OTP
5 Implementation
5.1 Mouse Events as Passcode
5.2 Image as Passcode
5.3 Patterned Based OTP
6 Experimental Results
6.1 Mouse Events as Passcode
6.2 Image as Passcode
6.3 Pattern Based OTP
7 Analysis
7.1 Usability
7.2 Security and Privacy
8 Conclusion
References
Comparative Analysis of Different Classifiers Using Machine Learning Algorithm for Diabetes Mellitus
1 Introduction
2 Related Work
2.1 Control Mechanism Process for Diabetes Mellitus
2.2 Focus on Digital System
2.3 Medication
2.4 Diabetes Services
2.5 Psychological Issues and Tensity
3 Methodology
3.1 Data Gathering
3.2 Data Prediction
3.3 Implementation of Classifier for Evaluation of Accuracy
4 Proposed Architecture
5 Logistic Regression
6 K-Nearest Neighbors Algorithms (K-NN)
7 CART (Classification and Regression Tree)
8 Conclusion
References
Survey on Machine Learning Techniques for Software Reliability Accuracy Prediction
1 Introduction
2 Research Methodology and Contribution
3 Reliability Measurement Methods and Process
4 Related Work
5 Motivations and Objectives
5.1 Motivation
5.2 Objective
6 Methodology
6.1 Axes of Quality Evaluation
6.2 Technical Catalogue
7 Evaluation Criteria
8 Experiments and Results
8.1 Defect Prediction
9 Conclusion
References
Classification of Pest in Tomato Plants Using CNN
1 Introduction
2 Materials and Methods
2.1 Sample Collection
2.2 Hardware Setup
2.3 Software Setup
2.4 Image Processing
2.5 Design and Setup of Data and CNN Architecture
3 Results and Discussion
3.1 Comparison with Existing Works
3.2 Limitations
4 Conclusions
References
Deep Neural Network Approach for Identifying Good Answers in Community Platforms
1 Introduction
2 Ranking and Tagging of Quality Answers in CQA
2.1 Tagging Answers in CQA Using Ensemble Deep Learning Model
3 Results and Discussion
4 Conclusion and Future Direction
References
Time Series Analysis of SAR-Cov-2 Virus in India Using Facebook’s Prophet
1 Introduction
2 Methodology
2.1 Data Collection
2.2 Testing
2.3 Data Visualization
2.4 Comparison
2.5 Facebook’s Prophet
3 Conclusion and Future Work
References
Model-Based Smoke Testing Approach of Service Oriented Architecture (SOA)
1 Introduction
2 Related Works on ERP
3 Overview of SOA Testing and Smoke Testing
3.1 The Importance of SOA
3.2 Challenges in SOA Testing
3.3 Smoke Testing
4 Case Study: Enterprise Resource Planning (ERP)
5 Conclusion
References
Role of Hybrid Evolutionary Approaches for Feature Selection in Classification: A Review
1 Introduction
2 Background
2.1 Searching Techniques
2.2 Criteria for Evaluation
2.3 Number of Objectives
3 Systematic Literature Review
3.1 Thawkar et al. (2021) [30]
3.2 Hussain et al. (2021) [32]
3.3 Wajih et al. (2021) [33]
3.4 Bindu et al. (2020) [3]
3.5 Bhattacharyya et al. (2020) [6]
3.6 Alweshah et al. (2020) [7]
3.7 Meera et al. (2020) [8]
3.8 Neggaz et al. (2020) [9]
3.9 Hans et al. (2020) [29]
3.10 Khamees et al. (2020) [37]
4 Analysis
5 Conclusion
References
Evaluation of Deep Learning Models for Detecting Breast Cancer Using Mammograms
1 Introduction
2 Literature Review
3 Proposed Methods
4 Simulation Environment
5 Experimental Results
6 Conclusion
References
Evaluation of Crop Yield Prediction Using Arsenal and Ensemble Machine Learning Algorithms
1 Introduction
2 Literature Review
3 Methodology
3.1 Basic Terminologies
4 Module Description
4.1 Module 1 - Data Acquisition
4.2 Module 2 - Exploratory Data Analysis
4.3 Module 3 - Preprocessing
4.4 Module 4 - Model Evaluation
5 GUI Application
5.1 Application Functionality
6 Results and Discussion
7 Conclusion and Future Scope
References
Notification Based Multichannel MAC (NM-MAC) Protocol for Wireless Body Area Network
1 Introduction
2 Related Work
3 Proposed Work
3.1 Data Type Organisation
3.2 Notification Based Multichannel MAC Protocol
4 Simulation Results
5 Conclusion and Future Scope
References
A Multi Brain Tumor Classification Using a Deep Reinforcement Learning Model
1 Introduction
1.1 Brain Tumor
1.2 Glioma Tumor
1.3 Meningioma Tumor
1.4 Pituitary Tumors
1.5 Dataset
2 Related Work
3 Proposed Model
3.1 Convolution Neural Networks
3.2 Reinforcement Learning
3.3 Deep Q-Learning
3.4 Confusion Matrix
4 Experimental Results
5 Conclusion
References
A Brief Analysis on Security in Healthcare Data Using Blockchain
1 Introduction
2 Overview of Blockchain
2.1 Properties of Blockchain
2.2 Types of Blockchain
2.3 Blockchain in Healthcare
3 Related Work
4 Discussion
5 Conclusion and Future Scope
References
A Review on Test Case Selection, Prioritization and Minimization in Regression Testing
1 Introduction
2 Basic Concepts
2.1 Regression Testing
2.2 Effectiveness of Prioritized Test
3 Related Work
4 Conclusion and Future Scope
References
Artificial Intelligence Advancement in Pandemic Era
1 Introduction
2 Artificial Intelligence
2.1 Artificial Intelligence Role in the Treatment of COVID-19
2.2 Artificial Intelligence as a Catalyst for the Exchange of Information
2.3 Role of Artificial Intelligence as an Observer and Predictor in the Pandemic Evolution
2.4 Healthcare Personnel Assistance with Artificial Intelligence
3 AI and ML to Combat Covid-19
3.1 Developing Novel COVID-19 Antibody Sequences for the use in Experimental Testing Using the Machine Learning
3.2 CT Scan Checks with AI for COVID-19
3.3 AI as an Aid in the Transmission of COVID-19
4 Conclusion
References
Predictive Technique for Identification of Diabetes Using Machine Learning
1 Introduction
2 Literature Review
3 Prediction of Healthcare Data Using AI and ML
4 Results
5 Conclusion
References
Prognosis of Prostate Cancer Using Machine Learning
1 Introduction
1.1 Correlation Between Machine Learning and Healthcare Databases
2 Literature Review
3 Outline of Research
4 Result
4.1 The Yearly Incidence Rate and the Mortality Rate of Prostate Cancer.
4.2 The Age-Based Incidence Rate and Mortality Rate of Prostate Cancer
5 Conclusion
References
Sign Language Detection Using Tensorflow Object Detection
1 Introduction
2 Literature Review
3 Methodology
3.1 Data Collection
3.2 Labelling Images
3.3 TF Record Generation
3.4 Creating Model Configuration File
3.5 Training Model
4 Result and Discussion
5 Conclusion
References
Automated Test Case Prioritization Using Machine Learning
1 Introduction
2 System Overview
3 Literature Survey and Problem Definition
4 Proposed Solution
5 Result and Future Scope
6 Conclusion
References
A New Approach to Solve Linear Fuzzy Stochastic Differential Equation
1 Introduction
2 Basic Preliminaries
2.1 Stochastic Differential Equation [7]
2.2 Linear Stochastic Differential Equation [7]
2.3 Brownian Motion (BM) [7]
2.4 Fuzzy Brownian Motion (FBM) [9]
2.5 Fuzzy Ito Formula (FIF) [9]
2.6 Stochastic Process (SP) [7]
2.7 Fuzzy Stochastic Process (FSP) [9]
2.8 Adapted Process [7]
2.9 Fuzzy Ito Product [9]
3 Linear Fuzzy Stochastic Differential Equation (LFSDE)
4 Conclusion
References
An Improved Software Reliability Prediction Model by Using Feature Selection and Extreme Learning Machine
1 Introduction
2 Related Work
3 Methodology
3.1 Collection of Data
3.2 Preparation of Data
3.3 Feature Selection
3.4 Training the ELM Classifier
3.5 Testing and Calculation of Accuracy
4 Results and Analysis
5 Conclusion
References
Signal Processing Approaches for Encoded Protein Sequences in Gynecological Cancer Hotspot Prediction: A Review
1 Introduction
2 Materials and Methods
2.1 Hotspot Prediction in Proteins
2.2 Hotspot Prediction in Cancer Cells
3 Conclusions
References
DepNet: Deep Neural Network Based Model for Estimating the Crowd Count
1 Introduction
2 Related Works
3 Proposed Methodology
3.1 Density-Estimation Based Approach
3.2 Neural Net Architecture
3.3 Data Augmentation
3.4 Training Details
3.5 Result and Evaluation
4 Conclusion
References
Dynamic Stability Enhancement of Power System by Sailfish Algorithm Tuned Fractional SSSC Control Action
1 Introduction
2 Power System with SSSC
3 Proposed Controller
4 Objective Function
5 Sailfish Algorithm (SFA)
6 Result and Discussion
7 Conclusion
References
Application of Machine Learning Model Based Techniques for Prediction of Heart Diseases
1 Introduction
2 Related Work
3 Proposed Work and Methodology
4 Problem-Solving Method
5 Results Analysis and Discussion
6 Conclusion and Future Work
References
Software Effort and Duration Estimation Using SVM and Logistic Regression
1 Introduction
2 Objective
3 Challenges
4 Proposed System
5 Literature Survey
6 System Design
6.1 Data Set
6.2 Architecture
6.3 Machine Learning Algorithms
6.4 Performance Metrics
7 Implementation
7.1 Data Preprocessing
7.2 COCOMO Model for Effort and Duration
7.3 Classifier SVC
7.4 Logistic Regression
8 Results and Discussion
9 Conclusion
References
A Framework for Ranking Cloud Services Based on an Integrated BWM-Entropy-TOPSIS Method
1 Introduction
2 Related Works
3 Proposed Cloud Service Selection Framework
4 Cloud Service Selection Methodology
5 A Case Study with Experiment
5.1 Experiments
6 Concluding Remarks
References
An Efficient and Delay-Aware Path Construction Approach Using Mobile Sink in Wireless Sensor Network
1 Introduction
2 Related Work
3 Preliminaries
3.1 Network Model
3.2 Energy Model
3.3 Problem Description
4 Proposed Algorithm
4.1 Generation of Virtual Path
4.2 Network Division and RP Generation
5 Simulation Results
5.1 Simulation Setup
5.2 Results and Discussion
6 Conclusion
References
Application of Different Control Techniques of Multi-area Power Systems
1 Introduction
2 Materials and Methods
2.1 System Examined
2.2 Controller Structure and Objective Function
2.3 Firefly Technique
3 Results and Discussions
4 Conclusion
References
Analysis of an Ensemble Model for Network Intrusion Detection
1 Introduction
2 Literature Survey
3 Implementation
3.1 KDD Dataset
3.2 Data Load and Pre-processing
3.3 Exploratory Data Analysis
3.4 Standardization of Numerical Attributes
3.5 Encoding Categorical Attributes
3.6 Data Sampling
3.7 Feature Selection
3.8 Data Partition
3.9 Train Models
3.10 Evaluate Models
4 Results and Discussions
5 Conclusion
6 Future Scope
References
D2D Resource Allocation for Joint Power Control in Heterogeneous Cellular Networks
1 Introduction
2 System Model and Problem Statement
2.1 System Model
2.2 Problem Statement
3 Resource Allocation Algorithm for Joint Power Control
3.1 Power Control
3.2 Power Control
3.3 Search and Exchange Algorithm
4 Simulation and Results
4.1 Search and Exchange Algorithm
4.2 Results
5 Conclusion
References
Prediction of Covid-19 Cases in Kerala Based on Meteorological Parameters Using BiLSTM Technique
1 Introduction
2 Literature Review
3 Methodology
3.1 Data for Model Development
3.2 Modeling Framework
3.3 LSTM – Long Short Term Memory
3.4 Bi LSTM - Bidirectional Long Short Term Memory
3.5 Evaluation of Bi-lSTM
4 Results
5 Conclusion
References
Monitoring COVID-Related Face Mask Protocol Using ResNet DNN
1 Introduction
2 ResNet-50 Based Method
3 Methodology
4 Experimental Results
5 Conclusion
References
Author Index

Recommend Papers

Meta-heuristic Optimization Techniques: Applications in Engineering 9783110716214, 9783110716177

This book offers a thorough overview of the most popular and researched meta-heuristic optimization techniques and natur

164 27 13MB Read more

Software Engineering for Embedded Systems: Methods, Practical Techniques, and Applications [2 ed.] 0128094486, 9780128094488

Software Engineering for Embedded Systems: Methods, Practical Techniques, and Applications, Second Edition provides the

120 12 74MB Read more

Innovations and Advanced Techniques in Systems, Computing Sciences and Software Engineering 1402087349, 9781402087349

Innovations and Advanced Techniques in Systems, Computing Sciences and Software Engineering includes a set of rigorously

101 90 53MB Read more

Software Engineering for Internet Applications 0262511916

628 38 553KB Read more

Software Product Line Engineering: Foundations, Principles and Techniques 3540243720, 9783540243724

I. Software Product Line Engineering Are you interested in producing software products or software-intensive systems at

119 106 5MB Read more

Optimization of Automated Software Testing Using Meta-Heuristic Techniques 9783031072963, 9783031072970

165 22 7MB Read more

Software Engineering Aspects of Continuous Development and New Paradigms of Software Production and Deployment (Programming and Software Engineering) 3030393054, 9783030393052

This book constitutes revised selected papers of the Second International Workshop on Software Engineering Aspects of Co

126 74 20MB Read more

Formal Techniques for Distributed Objects, Components, and Systems (Programming and Software Engineering) 3030780880, 9783030780883

This book constitutes the refereed proceedings of the 41st IFIP WG 6.1 International Conference on Formal Techniques for

99 64 8MB Read more

100 Mistakes in Software Engineering

In this book I write about 100 mistakes that professional software engineers(including myself) have made in their career

99 72 5MB Read more

Software Engineering, Artificial Intelligence, Networking and Parallel/Distributed Computing 2022-Winter 9783031261350, 9783031261343, 3031261356

This edited book presents scientific results of the 24th ACIS International Winter Conference on Software Engineering, A

151 82 21MB Read more

Meta Heuristic Techniques in Software Engineering and Its Applications: METASOFT 2022 (Artificial Intelligence-Enhanced Software and Systems Engineering, 1)
3031117123, 9783031117121

Author / Uploaded
Swagatam Das
Mihir Narayan Mohanty
Mitrabinda Ray
Bichitrananda Patra

0 0 0
Like this paper and download? You can publish your own PDF file online for free in a few minutes! Sign Up

File loading please wait...

Citation preview

Artificial Intelligence-Enhanced Software and Systems Engineering 1

Mihir Narayan Mohanty Swagatam Das Mitrabinda Ray Bichitrananda Patra Editors

Meta Heuristic Techniques in Software Engineering and Its Applications METASOFT 2022

Artiﬁcial Intelligence-Enhanced Software and Systems Engineering Volume 1

Series Editors Maria Virvou, Department of Informatics, University of Piraeus, Piraeus, Greece George A. Tsihrintzis, Department of Informatics, University of Piraeus, Piraeus, Greece Nikolaos G. Bourbakis, College of Engineering and Computer Science, Wright State University, Dayton, OH, USA Lakhmi C. Jain, KES International, Shoreham-by-Sea, UK

The book series AI-SSE publishes new developments and advances on all aspects of Artiﬁcial Intelligence-enhanced Software and Systems Engineering—quickly and with a high quality. The series provides a concise coverage of the particular topics from both the vantage point of a newcomer and that of a highly specialized researcher in these scientiﬁc disciplines, which results in a signiﬁcant cross-fertilization and research dissemination. To maximize dissemination of research results and knowledge in these disciplines, the series will publish edited books, monographs, handbooks, textbooks and conference proceedings. Of particular value to both the contributors and the readership are the short publication timeframe and the world-wide distribution, which enable both wide and rapid dissemination of research output.

More information about this series at https://link.springer.com/bookseries/16891

Mihir Narayan Mohanty Swagatam Das Mitrabinda Ray Bichitrananda Patra •

•

•

Editors

Meta Heuristic Techniques in Software Engineering and Its Applications METASOFT 2022

123

Editors Mihir Narayan Mohanty Department of ECE ITER, Siksha ‘O’ Anusandhan (Deemed to be University) Bhubaneswar, India Mitrabinda Ray Department of CSE ITER, Siksha ‘O’ Anusandhan Bhubaneswar, India

Swagatam Das Indian Statistical Institute Kolkata, West Bengal, India Bichitrananda Patra Department of CA ITER, Siksha ‘O’ Anusandhan Bhubaneswar, India

ISSN 2731-6025 ISSN 2731-6033 (electronic) Artiﬁcial Intelligence-Enhanced Software and Systems Engineering ISBN 978-3-031-11712-1 ISBN 978-3-031-11713-8 (eBook) https://doi.org/10.1007/978-3-031-11713-8 © The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 This work is subject to copyright. All rights are solely and exclusively licensed by the Publisher, whether the whole or part of the material is concerned, speciﬁcally the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microﬁlms or in any other physical way, and transmission or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed. The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication does not imply, even in the absence of a speciﬁc statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use. The publisher, the authors, and the editors are safe to assume that the advice and information in this book are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or the editors give a warranty, expressed or implied, with respect to the material contained herein or for any errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional claims in published maps and institutional afﬁliations. This Springer imprint is published by the registered company Springer Nature Switzerland AG The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland

Preface

In this twenty-ﬁrst century, the extensive use of software technologies has increased the efﬁciency and productivity in the workplace. As the expectation of customer is increasing day by day, the products become more complex and chaotic. For this, the real-life problems need to be optimized in various ﬁelds that are complex and difﬁcult to solve. There is a need to overcome the trade-off between exact methods, which may guarantee an optimal solution with more computing time and greedy methods which require less computing time but provide a low-quality or unsatisfactory solution. In this situation, there is a need to adopt a technique, heuristic, which represents some form of shortcut. It is a technique used to solve an optimization problem in a quick way. The main objective of this technique is to ﬁnd “good enough solution” to a problem for which it is too hard or time consuming to get the exact solution. The solution is obtained in an approximate way, by combining constructive methods with local and population-based search strategies. From computer science and engineering to economics and management, optimization is a core component for problem solving. Metaheuristic algorithms have attracted a great deal of attention in artiﬁcial intelligence, software engineering, data mining, planning and scheduling, logistics and supply chains, etc. The international conference entitled “Metaheuristics in Software Engineering and Its Application”, (METASOFT-2021) is organized by the Department of Computer Science and Engineering, Institute of Technical Education and Research, Siksha ‘O’ Anusandhan (Deemed to be University), Bhubaneswar, Odisha, India, on 10–12 March 2022. The conference is focused in the direction of numerous advanced concepts in modern metaheuristic techniques that help the global leaders in fast decision making by providing quality solutions to important problems in business, engineering, economics and science. More than 120 numbers of articles have been received through online related to the scope of the conference area. Out of these submissions, the editors have chosen only 34 high-quality articles after a thorough rigorous peer-review process. In the peer-review process, several highly knowledgeable researchers/professors with expertise in single/multi-domain are assisted the editors in unbiased decision making of the acceptance of the selected articles. Moreover, valuable suggestions of the advisory, programme and technical v

vi

Preface

committee also help the editors for smoothing the peer-review process. The complete review process is based on several criteria, such as major contribution, technicality, clarity and originality of some latest ﬁndings. The whole process starting from initial submission to the acceptances notiﬁcation to authors is done electronically. The conference “METASOFT-2022” focuses on a detailed review of metaheuristic algorithms, their applications, the related issues and recent trends that provide the researchers with the skills needed to apply these algorithms to their research work towards optimizing problems. The key concepts and techniques of widely used search-based techniques and their applications are discussed in the ﬁeld of networking, software engineering, electrical and mechanical engineering, etc. The discussion on foundations of optimization and algorithms gives the idea to beginners to apply the common approaches to optimization problem. Metaheuristic algorithms discussed in this conference present common metaheuristic algorithms in detail, including genetic algorithms, simulated annealing, ant algorithms, bee algorithms, particle swarm optimization, ﬁrefly algorithms and harmony search and also discuss various modiﬁcations used for multi-objective optimization. These algorithms are applied by researchers in software testing not only for test case generation and optimization but also for test sequence generation. This is a new research area to apply metaheuristic algorithms for test sequence generation at the design phase. The accepted manuscripts (original research and survey articles) have been well organized to emphasize the cutting-edge technologies applied in electrical, electronics and computer science domains. We appreciate the authors’ contribution and value the choice that is “METASOFT” for disseminating the output of their research ﬁndings. We are also grateful for the help received from the each individual reviewer and the programme committee members regarding peer-review process. We are highly thankful to the management of SOA (Deemed to be University) and each faculty member of Department of Computer Science and Engineering, ITER, for their constant support and motivation for making the conference successful. The editors would also like to thank Springer Editorial Members for their constant help and for publishing the proceedings in “Artiﬁcial Intelligence-enhanced Software and Systems Engineering” series.

Contents

Performance Analysis of Heuristic Optimization Algorithms for Transportation Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . S. K. Behera, Ayeshkant Mallick, and Nilima R. Das

1

Source Code Features Based Branch Coverage Prediction Using Ensemble Technique . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Swadhin Kumar Barisal, Pushkar Kishore, and Gayatri Nayak

10

Implicit Methods of Multi-factor Authentication . . . . . . . . . . . . . . . . . . Chippada Monisha, Koli Pavan Kumar, Pasili Ajay, Pushpendra Kumar Chandra, and Satish Kumar Negi Comparative Analysis of Different Classiﬁers Using Machine Learning Algorithm for Diabetes Mellitus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Santosh Kumar Sharma, Ankur Priyadarshi, Srikanta Kumar Mohapatra, Jitesh Pradhan, and Prakash Kumar Sarangi Survey on Machine Learning Techniques for Software Reliability Accuracy Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Suneel Kumar Rath, Madhusmita Sahu, Shom Prasad Das, and Jitesh Pradhan Classiﬁcation of Pest in Tomato Plants Using CNN . . . . . . . . . . . . . . . . K. N. S. Dharmasastha, K. Sharmila Banu, G. Kalaichevlan, B. Lincy, and B. K. Tripathy

20

32

43

56

Deep Neural Network Approach for Identifying Good Answers in Community Platforms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Julius Femi Godslove and Ajit Kumar Nayak

65

Time Series Analysis of SAR-Cov-2 Virus in India Using Facebook’s Prophet . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Sushree Gayatri Priyadarsini Prusty and Sashikanta Prusty

72

vii

viii

Contents

Model-Based Smoke Testing Approach of Service Oriented Architecture (SOA) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Pragya Jha, Madhusmita Sahu, and Sukant Kishoro Bisoy

82

Role of Hybrid Evolutionary Approaches for Feature Selection in Classiﬁcation: A Review . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Jayashree Piri, Puspanjali Mohapatra, Raghunath Dey, and Niranjan Panda

92

Evaluation of Deep Learning Models for Detecting Breast Cancer Using Mammograms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104 Subasish Mohapatra, Sarmistha Muduly, Subhadarshini Mohanty, and Santosh Kumar Moharana Evaluation of Crop Yield Prediction Using Arsenal and Ensemble Machine Learning Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113 Nikitha Pitla and Kayal Padmanandam Notiﬁcation Based Multichannel MAC (NM-MAC) Protocol for Wireless Body Area Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124 Manish Chandra Roy, Tusarkanta Samal, and Anita Sahoo A Multi Brain Tumor Classiﬁcation Using a Deep Reinforcement Learning Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134 B. Anil Kumar and N. Lakshmidevi A Brief Analysis on Security in Healthcare Data Using Blockchain . . . . 145 Satyajit Mohapatra, Pranati Mishra, and Ranjan Kumar Dash A Review on Test Case Selection, Prioritization and Minimization in Regression Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156 Swarnalipsa Parida, Dharashree Rath, and Deepti Bala Mishra Artiﬁcial Intelligence Advancement in Pandemic Era . . . . . . . . . . . . . . . 164 Ritu Chauhan, Harleen Kaur, and Bhavya Alankar Predictive Technique for Identiﬁcation of Diabetes Using Machine Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173 Ritu Chauhan, Harleen Kaur, and Bhavya Alankar Prognosis of Prostate Cancer Using Machine Learning . . . . . . . . . . . . . 181 Ritu Chauhan, Neeraj Kumar, Harleen Kaur, and Bhavya Alankar Sign Language Detection Using Tensorﬂow Object Detection . . . . . . . . 191 Harleen Kaur, Arisha Mirza, Bhavya Alankar, and Ritu Chauhan Automated Test Case Prioritization Using Machine Learning . . . . . . . . 200 Ayusee Swain, Kaliprasanna Swain, S. K. Swain, S. R. Samal, and G. Palai A New Approach to Solve Linear Fuzzy Stochastic Differential Equation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 209 S. Panda, J. K. Dash, and G. B. Panda

Contents

ix

An Improved Software Reliability Prediction Model by Using Feature Selection and Extreme Learning Machine . . . . . . . . . . . . . . . . . . . . . . . 219 Suneel Kumar Rath, Madhusmita Sahu, Shom Prasad Das, and Jitesh Pradhan Signal Processing Approaches for Encoded Protein Sequences in Gynecological Cancer Hotspot Prediction: A Review . . . . . . . . . . . . . . . 232 Lopamudra Das, Sony Nanda, Bhagyalaxmi Nayak, and Sarita Nanda DepNet: Deep Neural Network Based Model for Estimating the Crowd Count . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 247 Amit Baghel, Pushpendra Kumar Chandra, and Satish Kumar Negi Dynamic Stability Enhancement of Power System by Sailﬁsh Algorithm Tuned Fractional SSSC Control Action . . . . . . . . . . . . . . . . 256 Sankalpa Bohidar, Samarjeet Satapathy, Narayan Nahak, and Ranjan Kumar Mallick Application of Machine Learning Model Based Techniques for Prediction of Heart Diseases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 266 Nibedan Panda, Prithviraj Mohanty, G. Nageswara Rao, and Sai Tulsibabu Software Effort and Duration Estimation Using SVM and Logistic Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 275 Sasanko Sekhar Gantayat and V. Aditya A Framework for Ranking Cloud Services Based on an Integrated BWM-Entropy-TOPSIS Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 289 Soumya Snigdha Mohapatra and Rakesh Ranjan Kumar An Efﬁcient and Delay-Aware Path Construction Approach Using Mobile Sink in Wireless Sensor Network . . . . . . . . . . . . . . . . . . . . . . . . 298 Piyush Nawnath Raut and Abhinav Tomar Application of Different Control Techniques of Multi-area Power Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 308 Smrutiranjan Nayak, Sanjeeb Kumar Kar, and Subhransu Sekhar Dash Analysis of an Ensemble Model for Network Intrusion Detection . . . . . 315 H. S. Gururaja and M. Seetha D2D Resource Allocation for Joint Power Control in Heterogeneous Cellular Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 326 Hyungi Jeong and Wanying Guo Prediction of Covid-19 Cases in Kerala Based on Meteorological Parameters Using BiLSTM Technique . . . . . . . . . . . . . . . . . . . . . . . . . . 338 Jerome Francis, Brinda Dasgupta, G. K. Abraham, and Mahuya Deb

x

Contents

Monitoring COVID-Related Face Mask Protocol Using ResNet DNN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 348 Atlanta Choudhury and Kandarpa Kumar Sarma Author Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 357

Performance Analysis of Heuristic Optimization Algorithms for Transportation Problem S. K. Behera1 , Ayeshkant Mallick2 , and Nilima R. Das3(B) 1 Department of Mathematics, Government Women’s College, Sambalpur, Odisha, India 2 Department of CSE, Trident Academy of Technology, Bhubaneswar, Odisha, India 3 Department of CA, Siksha O Anusandhan (Deemed to be University), Bhubaneswar, Odisha,

India [email protected]

Abstract. Particle swarm optimization (PSO) is a commonly used populationbased stochastic optimization method. It is an optimization process that imitates the collective action and reaction of the biological population. Since the results produced by PSO are satisfactory and effective, it is more widely used for solving several problems. It is a faster and more economic method that employs a small number of parameters to fine-tune the result. Conversely, PSO can be easily trapped into the local optima while solving complicated problems. A number of variants of PSO have been designed to increase the convergence speed without letting it be trapped into local optima. QPSO (Quantum-behaved PSO), WQPSO (Weighted Quantum-behaved PSO), APSO (Adaptive PSO), NQPSO (Quantumbehaved particle swarm optimization with neighbourhood search for numerical optimization), and IQPSOS (Improved QPSO-Simplex method) are some examples of these variants. In this article, PSO and some of these variants have been employed to find the optimal solution for transportation problems. The performance of the algorithms has been compared based on the convergence characteristics and accuracy level achieved in the results. The comparative evaluation of these algorithms shows that NQPSO outperforms the other discussed algorithms and can be implemented to achieve solutions for Transportation problems. Keywords: PSO · APSO · NQPSO · Transportation problem

1 Introduction Transportation problems (TP) are in the centre of attention in combinatorial optimisation. These are elementary problems of network flow optimization. A variety of real-life problems can be represented using a TP. The objective of the TP is to minimize the cost of transportation incurred by transporting a certain product from a set of sources (manufacturing units) to a set of destinations. TP is a very fundamental Linear Programming Problem which works for some sources with identical products and some destinations requiring these products [1, 2]. A lot of study has been done recently to solve the transportation problem. The authors in [3] have proposed blocking method to get optimal solution for TP. Another © The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 M. N. Mohanty et al. (Eds.): METASOFT 2022, AISSE 1, pp. 1–9, 2022. https://doi.org/10.1007/978-3-031-11713-8_1

2

S. K. Behera et al.

method called blocking zero method is used by the authors which provides a feasible solution to minimize the maximum value of the entity costs [4]. A solution to TP problem was also provided by ant colony optimization [5] and hybrid ant colony optimization [6].The method used in these articles have employed a computational intelligence based technique to solve the TP with an aim to minimize the cost of transportation. The authors in [7] have derived some techniques using MATLAB to reduce the calculation to calculate the initial feasible solution for TP. The authors in [8] put emphasis on experimental design and measures of algorithm performance. The authors in [9, 10] have used a novel heuristic for finding solution to the TP. A method named Bilqis Chastine Erma method (BCE) was suggested to get the first basic feasible solution of TP [11]. Another research work designed an algorithm called Zack Algorithm in order to find an initial solution to TP [12]. The cost generated by this algorithm is smaller than that produced by the North West Corner and the Least Cost method. For the fixed charge transportation problem(NP hard problem) the computational time increases exponentially when the size of the problem increases. So meta-heuristic algorithms like Genetic algorithm (GA), simulated annealing (SA) and Keshtel algorithm (KA) have been used to solve the fixed charge TP [13]. Mixed-integer programming was used to solve TP [14] and as it can’t be used for large, real-world problems the authors have also implemented ant colony optimization to solve large and complex forest TPs with some constraints. Another research work also employed heuristic optimization methods like GA, harmony search (HS), and OptQuest to solve a TP of an urban arterial network [15]. The linear programming problem is generally solved by using popular algorithms like simplex method, branch and bound method etc. Conversely, when the complexity of the problem increases or the present resources are restricted, then using these traditional methods is not always promising. Where utilizing such exact methods is not feasible and a near optimal solution will do, a heuristic approach can be followed. It also reduces the time to calculate the solution. It is generally possible to design exact heuristics to find solution for a complex problem. Heuristics takes advantage of the distinct characteristics of the problem or works on historical information acquired from the past experimentations. The qualities of such heuristics highly rely on the amount of extent learning and present tests in the design of the algorithm. optimization procedures based on general heuristics have been significantly increased in recent years. These methods are adequate across the problem scope and often result in better implementation compared to specific heuristics specifically in terms of quality and performance time. It has become a trend to apply meta-heuristic methods to find solution to transportation network problems. The most popular meta-heuristic optimization methods are Genetic Algorithm, Ant Colony Optimization and Particle Optimization Swarm etc. In this research, Particle Swarm Optimization (PSO) and some of its variants have been used. These are minimization methods that try to find solution form an Ndimensional space. In this space, first some assumptions are designed then an initial speed is allocated to them. Also, a linkage channel between particles is considered. Subsequently, these particles respond to the space for moving. Their movement is guided by merit criterion calculation. Within less time, the particle accelerates to the particles with

Performance Analysis of Heuristic Optimization Algorithms

3

high merit criteria and in the same linkage group. This kind of mathematical optimization effectively solves all continuous optimization problems. This optimization method is time saving and effective in finding solutions. The structure of this article is described as follows. Section 2 describes the traditional transportation problem. Section 3 describes the behaviour of some variants of PSO algorithm. Section 4 provides a brief description of the optimization procedure employed in this work to find solution for the problem. In Sect. 5 the performance of the algorithms is compared on the basis of simulation results. Section 6 presents the conclusion.

2 Mathematical Model of TP The transportation problem tries to find an optimal distribution arrangement for a particular product. It includes some sources each having some quantity of the product and some destinations requiring this product. There is some transportation cost between each pair of source and destination. For simplicity the unit transportation cost may be considered as constant. The aim is to find the optimal distribution method that minimizes the total cost of transportation. Figure 1 describes a typical problem with 3 sources and 2 destinations. Here a source indicates the place where the product is available and transportation will begin from there. Destination has some demand for the product, so the product with the required quantity should reach the destination. ‘cxy ’ indicates the cost to transport a unit from ‘Srcx ’ (source) to ‘Dsty ’ (destination). The quantity of the product that can be transported from a particular starting place to a destination has to be decided, so that the total cost will be minimized. The objective of the TP can be defined as Eq. 1 when there are p sources and q targets. The transportation cost of single unit of the product from x to y is cxy , where x belongs to the set {1, 2, . . . . . . ., p} and j belongs to {1, 2, . . . , q}. Let ax denotes the amount of product available at source x and by denote the quantity of the product required by the destination y. Qxy is the quantity of goods to be transported from x to y. Mathematically the Transportation Problem can be presented as: p q Minimize Qxy cxy (1) x=1

y=1

Subject to q y=1 p x=1

Qxy ≤ ax for x = {1, 2, . . . ., p} Qxy ≥ by for y = {1, 2, . . . ., q}

Qxy ≥ 0

∀x and ∀y

For balanced TP the model is defined as: p q Minimize x=1

y=1

Qxy cxy

(2)

4

S. K. Behera et al.

Subject to q y=1 p

Qxy = ax for x = {1, 2, . . . ., p} Qxy = by for y = {1, 2, . . . ., q}

x=1

Qxy ≥ 0

∀x and ∀y

3 Optimization Algorithms In order to find the minimum cost of transportation (Eq. 1/Eq. 2) between the source and destination, some variants of PSO have been used in this work. The optimization algorithms used here are PSO, APSO and NQPSO. A brief introduction to the algorithms is presented below. 3.1 PSO PSO is a population based optimization algorithm [16]. In PSO a population is named as a swarm. Every individual of the population is called as a particle. The search process is guided by the particle’s previous best position and the global best position estimated by the whole population till now. 3.2 APSO There are 2 major phases [17] in this algorithm. In the first phase fitness of the individuals is evaluated. After fitness evaluation, state of the search process is checked, whether it is exploration, exploitation, convergence or jumping out and after identifying the state some inputs such as inertia weight and acceleration coefficients are updated. In the second phase an elitist learning technique guiding the search process is employed to local optima in order to discover an improved solution than the present global best solution. This is applied if the state is recognized as ‘convergence state’. The particles follow the leader if the current solution is superior to the previously calculated one and converge to the new region. 3.3 NQPSO It is a quantum based approach [18] involving both a local and a global neighbourhood search strategy. In the local neighbourhood search (LNS) strategy, the local neighbourhood of the current particle is explored for better solution which also can yield more exact solutions. In the global neighbourhood search (GNS) strategy, the global neighbourhood of the current particle is explored, which enhances the global search and avoids premature convergence. Opposition based learning (OBL) is used to calculate initial population. The OBL method helps generating better-quality preliminary solutions which can accelerate the convergence speed. In OBL, first n number of random positions is generated

Performance Analysis of Heuristic Optimization Algorithms

5

for n number of particles denoted as vector x. x i represents the position of ith particle and x ij is the individual component of vector x i where j represents the dimension. Then another set of n particles with opposite positions of the original particles is generated using the formula: Opposite value for x ij = (minimum value of x ij + maximum value of x ij − x ij ). Finally n fittest individuals are chosen from the two sets.

4 Steps of Optimization The basic flow of the optimization method proposed here is described through the following algorithm. Step 1: Input the number of sources and destinations present in the problem and the cost of transporting a unit product between each pair of source and destination. Step 2: Initialize the position and velocity matrices. The components of the position matrix should be within predefined (user specified) range. Step 3: Apply steps of PSO/APSO/NQPSO to update particles’ velocities. Step 4: Allow particles to update their positions. Step 5: Go to step 3 and repeat the steps (3, 4 and 5) until the stopping criteria are fulfilled.

Table 1. Simulation results Statistical measurements

PSO

APSO

NQPSO

Mean

7176.72

7497.75

4749.94

Min

2811.06

6426

1267.5

Max

11997.22

8392.8

7845.5

2707.04

1827.7

662.9

31.95

26.519

Std. deviation Std. error

7.65

5 Performance Evaluation of the Algorithms The optimization algorithms discussed here are evaluated on the basis of the convergence speed and calculated optimization value. The outcome of the simulation has been considered to compare their performance. Simulations were carried out repeatedly by changing the size of population and the number of iterations. 5.1 Convergence Figure 2 shows the function values generated by the above discussed algorithms with ten particles in the population in one thousand iterations. From the figure it can be considered that the performance of NQPSO is better than others.

6

S. K. Behera et al.

5.2 Optimization Value Multiple executions were done changing the size of population and the number of iterations. Figure 3 shows the cost values calculated by the optimization algorithms for a population of size ten and Fig. 4 shows the cost values calculated for a population of size twenty. For simplicity they are considered from lowest to highest. It shows that NQPSO is generating lower cost values than other algorithms. 5.3 Accuracy Table 1 shows the mean fitness, standard deviation, standard error, minimum cost and maximum cost values generated by the algorithms. The results were attained from fifty independent executions of the algorithms. Each execution involved different dimension of population and different number of iterations. Looking at the numerical data presented in the table it can be determined that the mean fitness, the minimum value, the maximum value, the standard deviation and the standard error for the results calculated by NQPSO are the lowest among all. It proves that NQPSO outperforms the other algorithms and hence can be implemented for a transportation problem.

c11 c12

Src1

Dst1

c21 c22

Src2 c31 Src3

c32

Dst2

Fig. 1. A typical TP with 3 sources and 2 destinations. Src, Dst and cxy representing source, destination and cost to transport a unit of commodity from Srcx to Dsty respectively

Performance Analysis of Heuristic Optimization Algorithms

Fig. 2. Convergence of the algorithms with population size 10

Fig. 3. Cost calculated for different individuals for a population of size 10

7

8

S. K. Behera et al.

COST

INDIVIDUALS IN A POPULATION Fig. 4. Cost calculated for different individuals for a population of size 20

6 Conclusion To solve the transportation problem some variants of PSO have been used in this article. The simulation results confirm that the quantum based approach NQPSO outperforms other variants. These results encourage further experimentation in this field in order to explore the potential of this optimization algorithm in solving other combinatorial problems.

References 1. Dantzig, G.B.: Linear Programming and Extensions. Princeton University Press, Princeton (1963) 2. Dantzig, G.B., Ramser, J.H.: The truck dispatching problem. Manag. Sci. 6, 80–91 (1959) 3. Sharma, G., Abbas, S., Gupta, V.: Solving transportation problem with the various method of linear programming problem. Asian J. Curr. Eng. Maths 1(3), 81–83 (2012) 4. Seshan, C.R., Achary, K.K.: On the bottleneck linear programming problem. Eur. J. Oper. Res. 9(4), 347–352 (1982) 5. Poorzahedy, H., Abulghasemi, F.: Application of ant system to network design problem. Transportation 32(3), 251–273 (2005) 6. Hadadi, F., Shirmohammadi, H.: A meta-heuristic model for optimizing goods transportation costs in road networks based on particle swarm optimization (2017) 7. Amaliah, B., Fatichah, C., Suryani, E.: A new heuristic method of finding the initial basic feasible solution to solve the transportation problem. J. King Saud Univ. Comput. Inf. Sci. (2020). https://doi.org/10.1016/j.jksuci.2020.07.007 8. Khan, A.R., Syed, S.A., Uddin, Md.S.: Development of a new heuristic for improvement of initial basic feasible solution of a balanced transportation problem. Jahangirnagar Univ. J. Math. Math. Sci. 28, 105–112 (2013)

Performance Analysis of Heuristic Optimization Algorithms

9

9. Khan, A.R., Vilcu, A., Uddin, M., Instrate, C.: The performance evaluation of various techniques for transportation problem. Buletinul Institutului Politechnic IASI 62(1–2), 19–30 (2016) 10. Ronald, L.R., Uzsoy, R.: Experimental evaluation of heuristic optimisation algorithm. A Tutorial. J. Heuristics 7, 261–304 (2001) 11. Amaliah, B., Fatichah, C., Suryani, E.: A new heuristic method of finding the initial basic feasible solution to solve the transportation problem. J. King Saud Univ. Comput. Inf. Sci. (2020) 12. ZakkaUgih, R.: Zack algorithm: a heuristic approach to solve transportation problem. In: Proceedings of the International Conference on Industrial Engineering and Operations Management (2019) 13. Komeil, Y., Afshari, A.J., Keshteli, M.H.: Solving the fixed charge transportation problem by new heuristic approach. J. Optim. Ind. Eng. 12(1), 41–52 (2019) 14. Contreras, M.A., Chung, W., Jones, G.: Applying ant colony optimization metaheuristic to solve forest transportation planning problems with side constraints. Can. J. For. Res. 38(11), 2896–2910 (2008) 15. Amison, A., James, S., Park, B., Yun, I.: Comparative evaluation of heuristic optimization methods in urban arterial network optimization. In: 2009 12th International IEEE Conference on Intelligent Transportation Systems (2009) 16. Kennedy, J., Eberhart, R.: Particle swarm optimization. In: Proceedings of IEEE International Conference on Neural Networks, vol. IV, pp. 1942–1948 (1995) 17. Zhan, Z.H, Zhang, J., Li, Y., Chung, H.S.-H.: Adaptive particle swarm optimization. IEEE Trans. Syst. Man Cybern. 39(6), 1362–1381 (2009) 18. Fu, X., Liu, W., Zhang, B., Deng, H.: Quantum behaved particle swarm optimization with neighbourhood search for numerical optimization. Math. Probl. Eng. (2013)

Source Code Features Based Branch Coverage Prediction Using Ensemble Technique Swadhin Kumar Barisal1 , Pushkar Kishore2(B) , and Gayatri Nayak1 1 Department of Computer Science and Engineering, Siksha ‘O’ Anusandhan Deemed to be

University, Bhubaneswar, Odisha, India {swadhinbarisal,gayatrinayak}@soa.ac.in 2 Department of Computer Science and Engineering, National Institute of Technology, Rourkela, Odisha, India [email protected]

Abstract. Branch coverage prediction plays a critical role in achieving high effective performance for modern applications. However, traditional test solutions are often inadequate to meet test objectives. If the testers know the branch coverage prediction achieved by any specific tool, they can test a subset of classes instead of the complete one. It is noticed that earlier test information of the tools can help make appropriate decisions about branch coverage tool selection. This paper examines the possibility of using source code metrics for branch coverage prediction. We considered different features extracted from 3105 java classes. We considered machine learning techniques like “random forest” (RF), “support vector regression” (SVR), and “linear regression” (LR). We also investigate performance using our ensemble model. The obtained results show that the ensemble model achieved an average of 0.12 and 0.19 “mean absolute error” (MAE) on testing with EVOSUITE and RANDOOP, respectively. Keywords: JIOS coverage prediction · Source-code metrics · Machine learning

1 Introduction Software testing is a precarious task that takes 60% of the development cost [1, 2]. There are many challenges to ensuring maximum branch coverage. If we know apriori, the coverage achieved by test case generation tools, developers can make informed decisions. For achieving maximal branch coverage, testers want to prioritize test cases [3, 19]. They would like to avoid running test cases for classes, not achieving higher branch coverage. Therefore, testers’ goal is to achieve maximum coverage with minimal time spent to generate test cases. We proposed an ensemble technique to predict the branch coverage using EVOSUITE [4] and RANDOOP [5]. Finally, we investigated metrics that quantity the class’s complexity. 79 metrics are considered for testing. We considered source code metrics because of the following reasons: (a) They are statistically achieved without running the code and (b) They are easier to compute. © The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 M. N. Mohanty et al. (Eds.): METASOFT 2022, AISSE 1, pp. 10–19, 2022. https://doi.org/10.1007/978-3-031-11713-8_2

Source Code Features Based Branch Coverage Prediction

11

We used “Chidamber and Kemerer” (CK) [6] and “Halstead metrics” [7]. We experimented with three distinct techniques and found that ensemble technique performed best compared to all others, named “linear regression” (LR), “random forest” (RF), and “support vector regression” (SVR). We achieved an average “mean absolute error” (MAE) of 0.12 for EVOSUITE and 0.19 for RANDOOP, respectively. The contributions of our work are given below: 1. We investigated code-quality metrics like CK, Halstead metrics, Package, and Java keyword. 2. We tested the branch coverage prediction using four ML techniques: SVR, RF, LR, and Ensemble. 3. To validate the proposed model, we ran EVOSUITE and RANDOOP over 3105 java classes. The paper is organized as follows: Sect. 2 briefly describes the literature work, Sect. 3 presents the proposed approach, Sect. 4 explains the experimental setup and results. Section 5 does a comparative study with related works. Finally, Sect. 6 gives conclusions and future directions.

2 Related Works Ferrer et al. [8] suggested a metric for predicting branch coverage. They used the Markov model process. The model estimated test cases count required for achieving specific branch coverage. They concluded that traditional metrics are ineffective in estimating the number of test cases required for achieving specific branch coverage. Phogat et al. [9] evaluated a set of metrics that can test classes of the object-oriented system. Shaheen et al. [10] analyzed testing cost during testing for achieving branch coverage. They tested on 25 applications and observed that depth inheritance tree (DIT) is too abstract to predict the testing cost. Khanna et al. [11] investigated object-oriented design metrics along with prioritization. They used the analytic hierarchy process (AHP) to find that CK and the number of class hierarchy (NOH) metrics have the highest priority. Bousquet et al. [12] assessed testability using a diametric approach and excluding metrics. They checked whether good testing practices were implemented in the final code or not. McMinn et al. [13] designed a testability transformer that evaluated all nested conditions in a single go. Scalabrino et al. [14] designed an optimal coverage search-based means to test using the whole test suite. They observed that iterative single-target approach was more efficient than whole suite for achieving higher coverage. Panichella et al. [15] introduced a highly scalable multi-objective GA, also known as the “many-objective sorting algorithm” (MOSA). Upon testing on 64 java classes, they found that the many-objective algorithm is more effective than the whole suite approach. Overall, they improved 66% of the subjects while search technique achieved improvement in 62% of the subjects. This algorithm is also implemented in EVOSUITE.

12

S. K. Barisal et al.

3 Proposed Approach This section presents a detailed explanation of our proposed approach. Figure 1 represents the proposed model architecture.

Fig. 1. Proposed model architecture

At first, we executed the codes to extract the independent variables’ factors. We considered 79 independent factors, which captured the code complexity of classes under tests. A set of features were selected from JDEPEND1 like total classes, dependent packages, coupling ratio, etc. We used some of the widely adopted CK metrics like “weight method class” (WMC), DIT, “number of children” (NOC), “coupling between objects” (CBO), “response for a class” (RFC), “lack of cohesion methods” (LCOM), etc. A tool named CK2 is utilized to compute the above metrics directly from the source code. We also included the count of java keywords. These keywords are synchronized, import, switch, instance of, etc. Apart from the previous ones, we used Halstead metrics3 . We calculated the branch coverage using two automatic tools like EVOSUITE and RANDOOP. These tools are executed for four different budgets: default, 3 min, 5 min, and 7 min. All the tools are run five times on the classes under test. Thus, 10 different test suites are obtained. The above process is repeated for all four budgets. Therefore, each tool tests each class 40 times. Then, the dependent variable (average coverage) of a class i (yb, t, i) is obtained by averaging the coverage by tool t with a search budget b. Table 1 represents the mean branch coverage achieved on different budgets with RANDOOP and EVOSUITE.

1 https://github.com/clarkware/jdepend. 2 https://www.bmc.com/blogs/mean-squared-error-r2-and-variance-in-regression-analysis/. 3 https://github.com/aametwally/Halstead-Complexity-Measures.

Source Code Features Based Branch Coverage Prediction

13

Table 1. Mean branch coverage using two tools Tools

Default

3 min

5 min

7 min

RANDOOP

90%

92%

93%

94%

EVOSUITE

50%

50%

50%

45%

Mean branch coverage of RANDOOP is consistently improving for an increase in budget. However, in the case of EVOSUITE, coverage is not increasing even after the increase in budget. Four ML models namely SVR, LR, RF and ensemble [16] are considered for coverage prediction.

4 Implementation and Result Analysis This section describes the experimental arrangement and results. The whole experiment is accomplished on 16 cores UBUNTU machine with 64 GB of RAM each. Table 2 consists of the description of seven projects. Table 2. Parameter description of the tested projects Parameters

Guava

Cassandra

Dagger

Ivy

Math

Lang

Time

LOC

78525

220573

848

50430

94410

27552

28771

JAVA files

538

1474

43

464

927

153

166

# classes employed

449

1278

14

410

668

124

142

Apache Cassandra is a dispersed database. The Apache Ivy is used for building; Google Guava consists of core libraries. The Google Dagger is used as a dependency injector. The Joda-Time is meant for time class. The Commons-Lang is meant for the mathematics library. We used MAE for evaluating the performance of the ML model. Equation 1 defines MAE as given below: n |yi − yx | (1) MAE = i+1 n where, y indicates the predicted, x presents the observed value for class i, and n represents the number of classes. In addition to MAE, few more performance indicators are considered like R24 score, “mean squared score” (MSE), “mean squared log error” (MSLE), and “median absolute error” (MedianAE). 4.1 Result Analysis We tried cross-validation and hyper-parameter tuning to get the best ML model. For LR, the solver is “sag” and the penalty is “L1”. In the case of SVR, the penalty is 0.1.

14

S. K. Barisal et al.

Similarly, RF has “n_estimators” = 100, “max_depth” = 8, and “min_sample_leaf ” = 3. After applying the 4 ML techniques, the ensemble model provides the highest prediction accuracy. Table 3 presents the results of LR on five performance parameters and four budgets. We observed an MSE of 0.15 for EVOSUITE compared to 0.2 obtained for RANDOOP. In the case of EVOSUITE and RANDOOP, MSE is not changing even for a change in the budget. Overall R2 score is weak as the value is below 0.5. Table 4 presents the results of SVR on five performance parameters and four budgets. Table 3. Results for LR using two tools and four budgets Parameters

Def-EVO

Def-RAN

3 min-EVO

3 min-RAN

5 min-EVO

5 min-RAN

7 min-EVO

7 min-RAN

MAE

0.25

0.3

0.25

0.3

0.25

0.3

0.25

0.3

MSE

0.15

0.2

0.15

0.2

0.15

0.2

0.15

0.2

MSLE

0.12

0.13

0.12

0.13

0.12

0.12

0.12

0.13

MEDIANAE

0.21

0.3

0.21

0.3

0.21

0.3

0.21

0.3

R2

0.42

0.3

0.42

0.3

0.42

0.3

0.42

0.3

From Table 4, we observed that MSE for EVOSUITE decreases upon increasing the budget. The detriment in MSE is also happening for RANDOOP whenever the search budget is increased. R2 is still weak as its value is lower than 0.5. Compared to results obtained using LR, SVR is better by 5%. Table 5 presents the results of RF on five performance parameters and four budgets. Table 4. Results for SVR using two tools and four budgets Parameters

Def-EVO

Def-RAN

3 min-EVO

MAE

0.2

0.25

0.21

MSE

0.1

0.15

0.11

MSLE

0.07

0.08

0.08

MEDIANAE

0.16

0.25

0.17

R2

0.47

0.35

0.38

3 min-RAN

5 min-EVO

5 min-RAN

7 min-EVO

7 min-RAN

0.26

0.19

0.24

0.18

0.23

0.16

0.09

0.14

0.08

0.13

0.09

0.06

0.07

0.05

0.06

0.26

0.15

0.24

0.14

0.23

0.26

0.24

0.36

0.48

0.38

In Table 5, MAE for EVOSUITE remains constant whenever the search budget is increased. Similarly, MAE remains constant for all search budgets in the case of RANDOOP. Best R2 is obtained using EVOSUITE for default search strategy. R2 slips to the bottom level whenever the search budget time is highest. Out of all the techniques (RF, SVR, and LR), RF has the least MAE and enhances performance by 10%. Finally, we take the ensemble of the results obtained using RF, LR, and SVR. The average is considered for all the results obtained using the three models for calculating the ensemble. Table 6 presents the results of the ensemble on five performance parameters and four budgets.

Source Code Features Based Branch Coverage Prediction

15

Table 5. Results for RF using two tools and four budgets Parameters

Def-EVO

Def-RAN

3 min-EVO

MAE

0.15

0.22

0.15

MSE

0.04

0.07

0.05

MSLE

0.02

0.03

0.02

MEDIANAE

0.11

0.2

R2

0.52

0.36

3 min-RAN

5 min-EVO

5 min-RAN

7 min-EVO

7 min-RAN

0.22

0.15

0.22

0.15

0.22

0.07

0.05

0.07

0.05

0.08

0.03

0.02

0.03

0.02

0.04

0.11

0.2

0.11

0.19

0.1

0.2

0.48

0.35

0.45

0.36

0.41

0.32

From Table 6, we observed the highest MAE of 0.13 for all search budgets in EVOSUITE. R2 decreased in the case of EVOSUITE and RANDOOP whenever we increased the search budget from default to 7 min. Since EVOSUITE utilizes GA to explore possible test cases, the highest R2 is observed for the default search budget. If we increase the search budget in the case of EVOSUITE, the R2 score is the same, around 0.5 but drops to 0.43 for higher search budgets. If we observe the MAEs obtained using the RANDOOP technique, we notice almost no change in values. It is due to the random generation of inputs by RANDOOP. Inspite of increasing the search budget, R2 is not improving due to the randomization created among the inputs. Therefore, we concluded that EVOSUITE is much better than RANDOOP due to the utilization of evolutionary algorithms. MAE for default EVOSUITE is highest for LR while lowest for the ensemble. A similar trend is observed for other search budgets. In the case of RANDOOP, the highest MAE is obtained using LR with default search budget, while the ensemble technique with default search budget has the least MAE. Overall, it can be concluded that ensemble is the best model amongst LR, SVR, RF, and ensemble. Table 6. Results for ensemble using two tools and four budgets Parameters

Def-EVO

Def-RAN

3 min-EVO

3 min-RAN

5 min-EVO

5 min-RAN

7 min-EVO

7 min-RAN

MAE

0.13

0.19

MSE

0.02

0.05

0.13

0.2

0.13

0.2

0.12

0.2

0.03

0.05

0.03

0.05

0.03

MSLE

0.01

0.06

0.02

0.01

0.02

0.01

0.02

0.01

MEDIANAE

0.03

0.09

0.18

0.09

0.18

0.09

0.17

0.08

R2

0.18

0.55

0.38

0.5

0.36

0.5

0.36

0.43

0.34

Table 7 shows the comparison between our proposed ensemble and RF model [9]. In Table 7, (+, −, 0) values represent the difference between existing and our proposed model. In the case of default EVOSUITE, MAE decreases by 0.02, and R2 increases by 0.02. RANDOOP default search budget’s MAE reduced by 0.03 and R2 increases by 0.01. There is no enhancement over the compared state-of-the-art in the case of 7min EVOSUITE or RANDOOP. Overall, our proposed model can reduce MAE, MSE, MSLE, and MEDIANAE for the maximum number of budgets. R2 has improved in seven combinations out of eight.

16

S. K. Barisal et al. Table 7. Comparison between our proposed model and RF [9]

Parameters difference

Def-EVO

Def-RAN

3 min-EVO

3 min-RAN

5 min-EVO

5 min-RAN

7 min-EVO

7 min-RAN

MAE

−0.02

−0.03

−0.02

−0.02

−0.02

−0.02

−0.03

−0.03

MSE

−0.02

−0.02

0.01

−0.02

−0.02

−0.02

−0.01

−0.01

MSLE

0

−0.01

−0.01

−0.01

−0.01

−0.01

0

0

MEDIANAE

−0.02

−0.01

−0.02

−0.01

−0.01

−0.01

−0.01

−0.01

R2

+0.02

+0.01

+0.02

+0.01

0.05

0

+0.01

+0.01

4.2 Feature Analysis Figures 2 and 3 show bar plots for ten important features taken from EVOSUITE and RANDOOP tools, respectively, according to their “mean decrease in accuracy” (MDA). MDA varies between 0 to 1 and represents the model’s accuracy by omitting the selected features. In the case of the top 10 features used by EVOSUITE, three belong to CK, and five belong to the reserved keyword. Therefore, reserved keyword and CK hugely decides the performance of the model. In the top 10 features used by RANDOOP, four belong to CK, and four belong to the reserved keyword. Therefore, reserved keywords and CK hugely decide the performance in the case of RANDOOP. After observing the Halstead metrics, we found that 1 out of 6 named calculated program length is ranked among the top 10.

Fig. 2. EVOSUITE top 10 features according to mean decrease in accuracy

Source Code Features Based Branch Coverage Prediction

17

Fig. 3. RANDOOP top 10 features according to mean decrease in accuracy

5 Comparison with Related Work Fraser et al. [4] designed EVOSUITE to generate test cases for java code. EVOSUITE creates and optimizes whole test suites for satisfying coverage criteria. EVOSUITE generates likely test cases by adding smaller and more effective assertions that encapsulate existing behavior. Shamshiri et al. [17] investigated the efficacy of test cases generated by three tools (Randoop, EvoSuite, and Agitar) in finding faults. They had detected 55.7% of the faults using generated test suites. Fraser and Arcuri [8] proposed a whole-suite (WS) generation with the help of genetic algorithms (GAs) to create test cases. WS is default approach used by the EVOSUITE. Grano et al. [18] investigated branch coverage prediction using source code metrics. They had considered larger datasets for training the ML models. They considered two test-case generation tools named EVOSUITE and RANDOOP.

6 Conclusions and Future Work If we know apriori the coverage, then informed decisions can be taken. With apriori knowledge, we would adequately allocate the budget to achieve higher branch coverage in the least time. This paper predicted the branch coverage of test-case generation tools using machine learning techniques and source-code features. The experimentation is done with four different search budgets. Since the algorithms are non-deterministic, prediction becomes a tedious task. We measured the complexity of classes using selected metrics. After applying Random Forest, Support Vector Regression, Linear Regression, and Ensemble model, we observed that the ensemble model performed best with reduced

18

S. K. Barisal et al.

MAE. Our future efforts will be towards increasing training data and detecting more features that will improve the precision of the model.

References 1. Bertolino, A.: Software testing research: achievements, challenges, dreams. In: Future of Software Engineering (FOSE 2007), pp. 85–103. IEEE (2007) 2. Yoo, S., Harman, M.: Regression testing minimization, selection and prioritization: a survey. Softw. Test. Verif. Reliab. 22(2), 67–120 (2012) 3. Barisal, S.K., Chauhan, S.P.S., Dutta, A., Godboley, S., Sahoo, B., Mohapatra, D.P.: BOOMPizer: minimization and prioritization of CONCOLIC based boosted MC/DC test cases. J. King Saud Univ. Comput. Inf. Sci. 2022 4. Fraser, G., Arcuri, A.: Evosuite: automatic test suite generation for object-oriented software. In: Proceedings of the 19th ACM SIGSOFT Symposium and the 13th European Conference on Foundations of Software Engineering, pp. 416–419 (2011) 5. Pacheco, C., Ernst, M.D.: Randoop: feedback-directed random testing for Java. In: Companion to the 22nd ACM SIGPLAN Conference on Object-Oriented Programming Systems and Applications Companion, pp. 815–816 (2007) 6. Chidamber, S.R., Kemerer, C.F.: A metrics suite for object oriented design. IEEE Trans. Softw. Eng. 20(6), 476–493 (1994) 7. Halstead, M.H.: Elements of Software Science (Operating and Programming Systems Series). Elsevier Science Inc., Amsterdam (1977) 8. Ferrer, J., Chicano, F., Alba, E.: Estimating software testing complexity. Inf. Softw. Technol. 55(12), 2125–2139 (2013) 9. Phogat, M., Kumar, D., Murthal, D.: Testability of software system. IJCEM Int. J. Comput. Eng. Manag. 14(10) (2011) 10. Shaheen, M.R., Du Bousquet, L.: Is depth of inheritance tree a good cost prediction for branch coverage testing? In: First International Conference on Advances in System Testing and Validation Lifecycle, pp. 42–47 (2009) 11. Khanna, P.: Testability of object-oriented systems: an ahp-based approach for prioritization of metrics. In: International Conference on Contemporary Computing and Informatics (IC3I), pp. 273–281 (2014) 12. Du Bousquet, L.: A new approach for software testability. In: Bottaci, L., Fraser, G. (eds.) TAIC PART 2010. LNCS, vol. 6303, pp. 207–210. Springer, Heidelberg (2010). https://doi. org/10.1007/978-3-642-15585-7_23 13. McMinn, P., Binkley, D., Harman, M.: Empirical evaluation of a nesting testability transformation for evolutionary testing. ACM Trans. Softw. Eng. Methodol. (TOSEM) 18(3), 1–27 (2009) 14. Scalabrino, S., Grano, G., Di Nucci, D., Oliveto, R., De Lucia, A.: Search-based testing of procedural programs: iterative single-target or multi-target approach?. In: Sarro, F., Deb, K. (eds.) SSBSE 2016. LNCS, vol. 9962, pp. 64–79. Springer, Cham (2016). https://doi.org/10. 1007/978-3-319-47106-8_5 15. Panichella, A., Kifetew, F.M., Tonella, P.: Reformulating branch coverage as a many-objective optimization problem. In: 2015 IEEE 8th International Conference on Software Testing, Verification and Validation (ICST), pp. 1–10. IEEE (2015) 16. Sammut, C., Webb, G.I.: Encyclopedia of Machine Learning. Springer, New York (2011). https://doi.org/10.1007/978-0-387-30164-8

Source Code Features Based Branch Coverage Prediction

19

17. Shamshiri, S., Just, R., Rojas, J.M., Fraser, G., McMinn, P., Arcuri, A.: Do automatically generated unit tests find real faults? An empirical study of effectiveness and challenges (T). In: 30th IEEE/ACM International Conference on Automated Software Engineering (ASE), pp. 201–211 (2015) 18. Grano, G., Titov, T.V., Panichella, S., Gall, H.C.: Branch coverage prediction in automated testing. J. Softw. Evolut. Process 31(9), 21–58 (2019) 19. Barisal, S.K., Dutta, A., Godboley, S., Sahoo, B., Mohapatra, D.P.: MC/DC guided test sequence prioritization using firefly algorithm. Evol. Intell. 14(1), 105–118 (2019)

Implicit Methods of Multi-factor Authentication Chippada Monisha(B) , Koli Pavan Kumar, Pasili Ajay, Pushpendra Kumar Chandra, and Satish Kumar Negi Department of Computer Science, Guru Ghasidas University, Bilaspur, Chattisgarh, India [email protected]

Abstract. Authentication is a critical technique that ensures the necessary security goals of confidentiality and integrity. Moreover, proper authentication is the first line of defense when it comes to securing any resource. Although standard login/password methods are simple to create, they have been the target of various cyberattacks. Token and biometric authentication systems were proposed as an alternative. However, they have not improved significantly enough to warrant the expense. As a result, a graphical login/password scheme was established as an alternative to the login/password system. However, it suffered as a result of shoulder-surfing and screen dump attacks. As a result, Multifactor Authentication has been introduced to provide another layer of security and make such threats less likely. In this paper, we proposed new methods of authentication namely Mouse Event as Passcode, Image as Passcode and Patterned OTP, that is immune to frequent attacks and adds an extra layer of protection before authenticating and accessing your resources. Keywords: MFA · Image as passcode · Mouse event as passcode · Patterned OTP

1 Introduction In the rapid growth of technology, everything is being online right from education to banking. Seamlessly this technology connects everything from anywhere around the global. In such connected world, we have to note that, data is also online out there, which reminds of data security. One hearing of data security, Authentication is the word that comes to our mind.The method to grant the permission for accessing a device or an application is called authentication. As it assures the security and integrity, it is an important method. Furthermore, the first line of defense for protecting any resource or data is an effective authentication. While talking about Authentication, [5] traditional method of username and password is no longer significant method of authentication to verify the user or authorize the access to the resource or data. To safeguard sensitive data or application, one factor of authentication is no longer valid, as refereeing to increase of recent level of cyber-attacks. Single factor authentication, viz username and password, may be user-friendly and add simplicity to the application. But on the other hand, this methodology doesn’t include adequate level of protection to data and improper access to the data is more likely. © The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 M. N. Mohanty et al. (Eds.): METASOFT 2022, AISSE 1, pp. 20–31, 2022. https://doi.org/10.1007/978-3-031-11713-8_3

Implicit Methods of Multi-factor Authentication

21

Password (or pin) is required for authenticating a valid user. As password sharing may comprise the account immediately, it is the weakest level of authentication. If the user sets a weak password, then the account is easily compromised through some type of cyber-attacks that includes dictionary attacks, rainbow attacks etc. Thus, the user has to consider to set a complex password while using this type of authentication. Even though with a complex password, the application will have a single factor authentication, which at some point less secure. Further, increase of technology and concerning privacy of the data, we can say that single factor authentication is not reliable for providing adequate protection to the security threats so, to add an extra layer of security to the data as well as to minimize the security threats due to unauthorised user, Multifactor Authentication [2], MFA is been proposed. Giving two or more verification factors in order to get access to a resource such as an application, an online account, or a VPN is called Multi-factor Authentication (MFA). There is a need of strong identity and access management (IAM) policy using multifactor authentication (MFA). MFA needs one or more extra verification criteria in addition to a username and password, which reduces the risk of a successful cyber-attack. MFA works by requiring additional verification information (factors) [3]. Most MFA authentication methodology is based on one of three types of additional information: • Things you know (knowledge), such as a password or PIN • Things you have (possession), such as a badge or smartphone or token • Things you are (inherence), such as a biometric like fingerprints or voice recognition MFA, walks the user to step-by-step process to authenticate the user using more than one factors of authentication. This system increases the complexity to authenticate the user, as there is more than one factor of authenticate the user (Fig. 1).

Fig. 1. Factors considered for multifactor authentication [2, 3]

As shown in the above figure, today’s MFA include more than to two factors to authenticate the user, by asking additional information from the user. And certainly, these methods should be user-friendly and increase the security. Some of the MFA methods include; [3] • • • •

Password/Pin Protection Voice Biometric Facial/ Iris Recognition Geographical Location

22

C. Monisha et al.

• Behaviour recognition [6] Apart from the particular challenges faced by each adoption of MFA, they apparently are the best method to authenticate the user. Accelerated adoption of MFA across many applications, and rapidly increasing of concern towards the security of data privacy. More MFA methods are to be evolved and keeping in mind all the challenges, MFA are modified and made more user friendly.

2 Related Work Monther and Saoud, in their paper discussed design implementation of a multi-factor authentication system that utilizes the layered security concept and evaluates its simplicity and performance against different types of attacks. Also evaluate the system mathematically to gauge its immunity against brute-force attacks. The results stated that probability of a successful brute-force attack is less than 6.72 E−25 for the first and second stages combined if you only select 8 items out of 36, which is the total number of items, or 2.7E−17 in the specific case of our implementation. A paper named,” Multi-Factor Authentication in Cyber Physical System: A State of Art Survey”, discussed the evolution from single authentication to multifactor authentication. Also stated five high-level categories of features of user authentication in the gadget-free world including security, privacy, and usability aspects, and pointing some of the major withdrawals of SFA, which has been overcome using MFA. The authors [4] proposed method that uses perceptual hash function which is based on computing similar hash values for similar type of images. It can be determined whether two images are perceptually distinct or not using an appropriate distance or similarity function to compare two perceptual hash values. Images can be identified, authenticated, or verified using perceptual image hash algorithms. We also present a hash method that is resistant to non-malicious alteration while being sensitive to malicious tampering in their article. A paper by, Krishna Nand Pandey, Md. Masoom, Supriya Kumari, and Preeti Dhiman, proposes that OTP verification has been the effective way to authorize the user as it is the one-time login authentication method. After certain unsuccessful attempts, the account is blocked and the informs the user.

3 Problem Statement MFA standard techniques [2, 3] include username and password, [4] followed by OTP, captcha, or biometric approach. Despite the rise in the number of attack tools capable of bypassing MFA, these attacks are still incredibly rare and have not been automated at scale. As a result, it is also the user’s responsibility to create a strong password in order to defeat common attacks such as brute force, dictionary attacks, and so on. Furthermore, [2, 3] MFA adds a layer of authentication to the user and protects against the majority of common cyberattacks. However, the combination of elements/factors required to establish an MFA is equally critical.

Implicit Methods of Multi-factor Authentication

23

4 Implicit Methods of Multifactor Authentication The aim of this proposed methods is to make the authentication less vulnerable to the security threats and form a user-friendly, effective way of authenticating a user. There are 3 methods to be proposed that can be added in MFA, as one of the factors to identify the user. This proposed method consists of several algorithms, and each algorithm is responsible for one type of process. All the required validations process will be taken in considerations by this proposed method. The following are the proposed methods: 4.1 Mouse Events as Passcode Mouse events are can create a unique combination of passwords too, [8] there can be different types of mouse events such as left button click, right button clicks, doubleclick, etc. To manage these events, we need to design callback functions for each type of mouse click event while the window or frame is opened by OpenCV (python). The callback function will be helpful to implement what type of functionality you want with a particular mouse click event. Mouse events such as left button clicks, right button clicks, double-click, and so on are used to generate a unique password combination [8]. To handle these events, we need to create callback routines for each type of mouse click event that occurs while OpenCV is opening the window or frame (python). The callback function will assist you in implementing the type of functionality you desire for a certain mouse click event. This summery of this method is by converting the user’s mouse events (single clicks, double clicks, scroll ups and downs) into a form of passcode (password). This passcode will be used to verify the user, whenever the user tries to login. This process will be done during signing up into a website/application. The following algorithms are used in the proposed method: Algorithm for Parameters Acquisition

1. Open a window to recognize your mouse events. 2. Read all the mouse events performed by the user (inside the window). 3. Save the mouse events or close the window to autosave the mouse events.

Algorithm for Conversion

1. Consider each mouse event. E.g., “Left-click, right-click, 2 scroll ups and double middle-click”. 2. Assign the equivalent keyword for above mouse events according to the table (Table 1). 3. Now, add these keywords, which contains user’s mouse events, into an array. 4. Repeat it until every mouse event has got its keyword. 5. Finally, we get the array of keywords (mouse events of user). i.e. [11, 13, 1, 1, 22]

24

C. Monisha et al.

Algorithm for Hashing Now as we got our array of keyword, we have to convert them into hash code. 1. Read each element from the array. 2. Add the element to a string. 3. Use any encryption algorithm and convert the string to a hash code (Fig. 2).

Left/Right Click Window

Left/Right Double Click

Array

String

Hash code

Scroll Ups and Downs

Fig. 2. Mouse events as passcode

Table 1. Keywords with their equivalent mouse events Mouse Left-click Right-click Left Right Middle Middle Scroll-up Scroll-down events double-click double-click click double-click 11

13

21

23

12

22

1

−1

4.2 Image as Passcode Using an image for creating a password reminds of [8] image hashing (also called perceptual hashing) is the process of constructing a hash value based on the visual contents of an image. We use image hashing for CBIR, near-duplicate detection, and reverse image search engines [7]. This summery of this method is by using an image as digital fingerprint to identify the user. This process will be done during signing up into a website/application. The following algorithms are used in the proposed method: Algorithm for Parameters Acquisition 1. Upload an image. 2. Read the image.

Algorithm for Conversion 1. 2. 3. 4.

Convert the image to greyscale. Resize the uploaded image to 9 × 8 form. Using Difference algorithm to create a unique hash code for the image. Store the hash code in a string.

Implicit Methods of Multi-factor Authentication

25

Algorithm for Hashing Now as we got our array of keyword, [1] we have to convert them into hash code. 1. Read each element from the array. 2. Add the element to a string. 3. Use any encryption algorithm and convert the string to a hash code (Fig. 3).

Fig. 3. Image as passcode

4.3 Patterned OTP OTP stands for One-time Password, [4, 9] an OTP is a created secret word which just substantial once. It is an automatically produced numeric or alphanumeric string of characters that validates the client for a single transaction or login session. In OTP-based validation strategies, the client’s OTP application and the verification server depend on shared insider facts. This summery of this method is using the pattern to verify the OTP. This process will be done during signing up into a website/application, instead of typing the password. The following algorithms are used in the proposed method: Algorithm for Parameters Acquisition 1. Create a matrix square matrix. 2. Length of the OTP. 3. Generate the OTP with random numbers from 1 to n (n = order of square matrix) using following methodology; OTP = (random digit) + (digit from previous row or next row existing preceding digit).

Algorithm for Conversion 1. Get the OTP through mail or SMS. 2. Read the OTP.

26

C. Monisha et al.

Algorithm for Hashing

1. 2. 3. 4. 5.

Connect the dots, according to the OTP. E.g., “11-9-12-23” Verify the pattern. If the pattern matches the OTP, Authorize the access. Else, asks to retry. In case, of unmatched pattern for 3 times, generate a new OTP and send to user (Fig. 4).

Fig. 4. Patterned OTP

5 Implementation Following are the implementation of the above proposed methods of multi-factor authentication (MFA). 5.1 Mouse Events as Passcode Step 1: When the user login/Signup’s window opens. Step 2: Inside the window, you’re all your clicks are detected including single and double clicks and scroll ups and downs. Step 3: Later when you submit your clicks or scrolls as password, it converts your all your clicks or scrolls into array. Step 4: Then after that array is converted to, we convert that into string format. Step 5: Later they string is hashed before storing it into the database. Step 6: Hence this way, this way you can create a strong password instead of remembering a long 8- or 16- alphanumeric characters, just by few clicks and scrolls. 5.2 Image as Passcode Step 1: When the user login/Signup’s they give an image as a digital fingerprint/pass image. Step 2: To generate a hash code for that image, it goes through the following steps.

Implicit Methods of Multi-factor Authentication

27

2.1 It converts the image to grayscale and discard any color information (helps to faster to examine). 2.2 After converting to grayscale, image is resized to 9 × 8 pixels (ignoring ratio). 2.3 Then the image is computed through difference hash algorithm, which works by computing difference between adjacent pixels. To create a 64-bit hash. 2.4 In the final step, corresponding set of pixels P, we apply the following test: P[x] > P[x + 1] = 1 else 0. 2.5 In this case, we are testing if the left pixel is brighter than the right pixel. If the left pixel is brighter, we set the output value to one. Otherwise, if the left pixel is darker, we set the output value to zero. Step 3: After the hash code is generated, we convert that into string format. Step 4: Later they string is hashed before storing it into the database. Step 5: Hence this way, image is used to generate a hash code and the image is used to authorizes the user. 5.3 Patterned Based OTP Step 1: When user authenticated themselves using password fand username, then an email has been sent to the user, address the OTP pattern. Step 2: Later the user bower directs them to OTP page, where the user has to enter the OTP. Step 3: Here user will not enter the OTP, instead they will match the pattern of OTP, by connecting the dots, similar to android lock patterns.

6 Experimental Results Following are the experimental results of the above proposed methods of multi-factor authentication. 6.1 Mouse Events as Passcode Below is the demonstration of array formed using the mouse events performed by the user. Later this generated array is hashed before storing into the database (Table 2). 6.2 Image as Passcode Below is a demonstrating of the image hashing. Image hashing is using difference algorithm to generate an image hash code. Later, the hash code is being hashed using any hashing algorithm before being the stored in the database (Table 3).

28

C. Monisha et al. Table 2. Final arrays formed by the mouse events

S. no

Mouse Events performed by the user (for generating password)

Final array of the mouse events

1

Left click, right click, scroll up

11, 13, 1

2

Scroll up, scroll down, middle click, double right click

−1, 1, 12, 23

3

Double left clicks, middle click, scroll up, scroll 21, 12, 1, −1, 22 down, double middle clicks

4

Scroll up, scroll up, scroll down, scroll down, double middle clicks, scroll up, scroll down

1, 1, −1, −1, 12, 1, −1

5

Double left clicks, double right click, double middle click

21, 22, 23

Table 3. Hash code generated using the images. S .no

Image used by the user

Hashcode

1.

808eef470e0c1c3c

2.

80aeef470e0c1c3c

In the above table, it had tabulated the hash code generated by image, produced by the user for authentication. If we observe carefully, the images provided by the user is similar but the hash code generated by the images are different. This is because the image provided by the user are slightly different from each other, if we consider image 1 is the original ones and

Implicit Methods of Multi-factor Authentication

29

image 2 as duplicate of first image. As addressed before image1 is slightly different from image2 because image2 is cropped a little, which can’t be identified easily by naked eye. So as a conclusion we can say that, even a small alteration in the image can cause change in the hash code, as a result the user is not authenticated. 6.3 Pattern Based OTP Below is a table demonstrating the patterns that are drawn, using the computer-generated OTPs. This way instead of typing the password., those are patterned similar to the pattern lock in smartphones. Later the patterned OTP is used to authenticate the user (Table 4). Table 4. Patterns formed by the generated OTPs S.no

1.

2.

3.

4.

5.

Generated OTP

21-17-22-19-15

1-10-14-17-25

25-19-15-8-2

20-23-17-15-10

1-7-11-17-21

Patterned OTP . . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

30

C. Monisha et al.

7 Analysis Following are the Main operational challenges of MFA include. 7.1 Usability Usability is defined as, the degree of something that is fit and used. While talking of MFA, [3] The main usability challenges emerging in the authentication process could be characterized from many perspectives, which include task efficiency and effectiveness and user preferences. In Case of, Mouse Events as Passcode, time efficiency (to register and time to authenticate with the system) is comparatively more effective, as it takes less time compare to textual ones. As well as Task effectives (the number login attempts to authenticate with the system) is equally most effective too, as after 2 or 3 attempts it blocks the user’s account. In Case of, Image as Passcode, compare to textual methods, this method also has high time effectiveness and time efficiency as it takes less time compares to textual ones. On the other hand, [4] Patterned OTP, may be less usability than regular textual ones but on the other hand it is secure than textual ones. 7.2 Security and Privacy MFA is the most used and trusted way to authenticate the user, using many layers of security [3]. Any MFA framework is a digital system composed of critical components, such as sensors, data, storage, processing devices, and communication channels. All of those are typically vulnerable toa variety of attacks at entirely different levels, ranging from replay attempts to adversary attacks. Security is thus a necessary tool to enable and maintain privacy. In Case of, Mouse Events as Passcode, the user’s mouse events are only read in the window. And when user performs any mouse event, it can can’t be seen or displayed on the screen, which makes it immune from shoulder-surfing. And apart from that, compare normal textual password, this mouse events are easier for the user to remember as, it takes only few couple of mouse events to create a strong password rather than remembering and typing 8 and 16 alphanumeric characters. Eventually the hash code created from the sequence of mouse events of the user are hashed and stored in the database, which makes it more secure from attacks too. More number of mouse events, more security. Coming to Image as Passcode, in this method image used by the user to identify as digital fingerprint should be unique. Even small alteration in the image (cropping/expanding) produces a different hash code, hence the user is not authenticated. So, for a regular another user, it will be difficult to distinguish between the original and fake image. Apart from that, the chance of creating same hash code for 2 different image is mostly unlikely.

Implicit Methods of Multi-factor Authentication

31

Later, Patterned OTP, [4] compared to regular textual manner of entering the OTP is vulnerable to some common attacks like Brute force, and it merely takes few seconds to crack the OTP, even though it has more than 7 characters, still it is only combinations of numbers. If we replace regular textual OTP with Patterned OTP, we obtain the combinations of dots that has to be traced, instead of entering the OTP. After few wrong attempts in resends a new OTP. Which secure method of entering the OTP and less vulnerable to some common attacks like keylogger, as it is difficult for a computer to trace the dots, only human intelligence can Identify and trace the dots according to the OTP obtained.

8 Conclusion This research has suggested a new implicit method for authentication. Mouse event as passcode has been suggested to authenticate the user by using mouse events to create a strong passcode instead of using 8 or 16 characters long alphanumeric password. Image as passcode, has also been used to authenticate the user, this method can also be called as digital figure print, because the user has to use the same image for authenticating themselves, any alternation (cropping or expanding) may change the pixels and hence the user can’t be authenticated. Pattern OTP, is the last method proposed in this paper, instead of typing the OTP in certain period of time, we trace the OTP to avoid certain attacks like brute force. Hence the above proposed methods are secure and immune from regular cyberattacks like dictionary attacks or brute force attacks, compare to traditional way of authenticating the user using username and password. Hence, we state that more the layer of security, the more our is data protected. And this report states that there not only one way of authenticating a user or protecting the data, an many ways of authentication are to be emerged.

References 1. Bhattacharjee, S., Kutter, M.: Compressing tolerant image authentication. In: Proceedings of the IEEC-ICEIP 19981, September (1998) 2. Jansone, A., Lauris, K., Saudinis, I.: Multi factor authentication as a necessary solution in the fight with information technology security threats. In: Environment. Technology. Resources. Proceedings of the International Scientific and Practical Conference, p. 114 (2015) 3. Ometov, A., et al.: Multi-factor authentication: a survey ( 2018) 4. Huang, Y., Huang, Z., Zhao, H., Lai, X.: A new one-time password method (2018) 5. Gadekar, Mr.A.R., Shendekar, Ms.P.S.: Implicit password authentication system. Int. J. Sci. Eng. Res. 4(6), 77–81 (2013) 6. https://jumpcloud.com/blog/mfa-effectiveness 7. https://www.pyimagesearch.com/2017/11/27/image-hashing-opencv-python 8. https://www.geeksforgeeks.org/handle-mouse-events-in-python-opencv/\ 9. Hussain, A.: E-authentication system with QR code & OTP. Int. J. Trend Sci. Res. Dev. (IJTSRD) 4(3). ISSN: 2456-6470

Comparative Analysis of Different Classifiers Using Machine Learning Algorithm for Diabetes Mellitus Santosh Kumar Sharma1 , Ankur Priyadarshi1(B) , Srikanta Kumar Mohapatra2 , Jitesh Pradhan3 , and Prakash Kumar Sarangi4 1 C.V. Raman Global University, Bhubaneshwar, Khordha, India

[email protected]

2 Chitkara University Institute of Engineering and Technology, Chitkara University,

Chandigarh, Punjab, India 3 Siksha ‘O’ Anusandhan (Deemed to be University), Bhubaneswar, India 4 School of Computer Science, Lovely Professional University, Phagwara, Punjab, India

Abstract. Due to hazardous situations, diabetes is common for both youngsters and the elder which affects both psychologically and physical illness. Diabetes increases the sugar scale levels in the human body very rapidly. It is very important to identify the disease at initial symptoms by regular check-ups for diagnosis and also visit a physician during a certain interval. It is a very critical disease that hikes the death ratio in the world. As per the International Diabetes Federation (IDF) one of the 11 people in the world who have been diagnosed with Diabetes and is one of the 4 in the people of middle or lower-middle economy countries. People with Diabetes have been affected severely. So, it is more vital to concuss about how to detect and predict at the initial stage so that the death rate can be minimized. In this paper, we use different classifiers in machine learning by using the Support Vector Machine (SVM) algorithm, KNN (K-Nearest Neighbors) algorithm, and Logistic Regression (LR) algorithms are used in accuracy prediction. In this methodology, we use the PIMA Indian dataset that is used for accuracy prediction & verification. Keywords: Diabetes mellitus · KNN · SVM · LR · Classifier

1 Introduction Diabetes is a chronic disease which can rapidly increase day-to-day life. Such types of disease can affect all age groups. Diabetes is a complicated disease that can be caused by the absence of Beta-cells in the pancreas. In women, due to removal of pancreas and also unbalancing in sugar level. It is basically called type-III diabetes which is a metabolic disorder in women when diabetes can be detected during pregnancies. When hormones (Glucose) can be accessed from external sources due to insufficient ratio of Beta-cell of the pancreas cannot be produced in the human body. Diabetes is also caused by fats, carbohydrates, proteins, and secretion of insulin [1]. © The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 M. N. Mohanty et al. (Eds.): METASOFT 2022, AISSE 1, pp. 32–42, 2022. https://doi.org/10.1007/978-3-031-11713-8_4

Comparative Analysis of Different Classifiers Using Machine Learning Algorithm

33

It is a lifelong disease due to sugar imbalance occurring during the lifetime of the human being. Diabetes occurs in old age (Type-II) is more harmful than other types of diabetes. More than 90% of person’s are affected under adult-onset diabetes in World [2]. It is predicted in the coming 15 years that the number of diabetes patients will be around 600 million [3]. This disease becomes so harmful when there is an increase in glucose scale in blood and affects the blood vessels in the heart and kidney. In the current situation, most of the persons affected under type 2 due to insulin reside in the human body but are unbalanced. Another type of diabetes where the glucose level increases, but to balance it, we have not taken any outside supported insulin i.e., by taking medicine or other therapies. Such disease could affect 90% of people in India in the next 10 to 20 years. As per the statistical report in 2016, the percentage of death rate due to diabetes was 3.1% while in 2021 it will increase to 18% [4]. By using PIMA Indian Dataset in Machine Learning different classifiers are used to specify better comparative study that is emerging to improve the quality of life by increasing the no diabetic ratio. Machine Learning (ML) algorithms are the programs that can learn the hidden patterns from data, predict output, and improve performance. Different algorithms can be used for classification problems to find the accuracy comparison through classifiers. In this paper, Logistic Regression and K-Nearest Neighbors (KNN) algorithms are used to pre-process the data to find the accuracy.

2 Related Work A system implementation can be performed on a proposed design to predict diabetes at a particular age. The output can be highly acceptable [5–7]. Different Machine Learning algorithms are used by several researchers in different articles. This type of algorithm gives optimal accuracy as related to other models as these to find the algorithms that use some other classifier algorithms [8]. The prediction model is designed for prediction of diabetes by KNN and Logistic Regression algorithms are used. Other algorithms like ANN (Artificial Neural Network). Several researchers were also used different algorithms to get accurate results. In Deep learning, Long Short-Term Memory (LSTM), Convolutional Neural Network (CNN) are also used in diabetes prediction. That leads to better prediction outcomes [9]. Back propagation ANN and CNN) are also used to predict diabetes mellitus by using Boltzmann method to predict by using different types of diabetes. The experiments show that deep learning methods can be helpful for physicians to predict diabetes in its early stages. Recurrent Deep Neural Network (RDNN) was also implemented upon PIMA Indian data sheet to predict diabetes [10, 11]. NonsoNnamoko et al. [12] offered envisaging diabetes onset: a collaborative supervised learning method. The outcomes are obtainable and related with analogous methods that used the same dataset in the works. Tejas N. Joshi et al. [13] offered by using ML methods targets to envisage diabetes by way of diverse supervised ML procedures. This intends an actual practice for prior discovery of diseases. According to World Health Organization (WHO), Diabetes disease has seven main reasons of demise. Several researchers have been focusing on extrapolations usually in decrease model. That can be classify and calculated for different types of diabetes and both low and high diabetes ratio [14].

34

S. K. Sharma et al.

2.1 Control Mechanism Process for Diabetes Mellitus It is most important factor that to be taken into consideration is self-control & self-manage at the pandemic that leads to increase the health care facility only upon educated person having knowledge about to control and manage at the emergency [15]. Nowadays, there are several digital devices are available for checking it regularly through digital Glucometer by find the accuracy and convenient blood sugar monitoring device. To measure the performance of PWD, which from child to old age person and sociodemographic group [16]. During Pandemic, a diabetes patient must focus on avoiding regularly use hygienic foods, yoga & timeline for any task etc. For PWD, in proper medicine at right time based on glucose level when that patient is affected by serious condition then it is particular for hospitalized patients. Several research can occur based on diagnosis and testing. To focus the Vitamin C deficit in Blood Glucose can be major factor based on PWD to follow anti-diabetic with proper health care. 2.2 Focus on Digital System To control & managing hyperglycemia it is necessary that to take important diabetes teams by preferred for proper people. To focus on any sensitive condition of PWD insulin must be referred to those patients comes under hyperglycemia [17]. Nowadays, PWD follows digital health care facility by that is “virtual “system to decrease manual methods of record keeping by telecom system by communicating between patients and diabetes providers. In this communication process several constraints to be followed to manage diabetes by specifying their self-glucose and insulin level in proper way so that they are focus on virtual care team, and some organizations can continuously measure the level of insulin at right time. Virtual system can be finding out by provide proper diagnosis from level of managing its rules. 2.3 Medication During Pandemic, a diabetes patient must focus on avoiding regularly use hygienic foods, yoga & timeline for any task etc. For PWD, in proper medicine at right time based on glucose level when that patient is affected by serious condition then it is particular for hospitalized patients. Several research can occur based on diagnosis and testing. To focus the Vitamin C deficit in Blood Glucose can be major factor based on PWD to follow anti-diabetic with proper health care. 2.4 Diabetes Services There may be several rules arise on diabetes that differ from regions to regions. These themes are analysing in vary from consultants which found different positive results, that can be generalizability to the situation [18]. Within PWD, it focusses on little evidence.

Comparative Analysis of Different Classifiers Using Machine Learning Algorithm

35

2.5 Psychological Issues and Tensity In Current pandemic situation, the impact on PWD affects physical problem [19]. Today general population rather than mental health can be more prone [20]. When a person affected under diabetes can be distressed to adherence to attend worsen, as follows for on or before disasters [21–21]. In recent era, diabetes plays an important role to pint out anxiety during this malicious time. To decrease the rate of anxiety, to follow pear-to pear analysis to current scenario an unsuitable format [23]. It is basically more stress and stain during such pandemic having a challenge on diabetes to communicate between several countries by number of guidelines to how to control such situations [24, 25]. Such protocols can be obeyed by Chinese Geriatric Endocrine organization that concentrate on how to avoid and manage such type of medicine and suggests in brief analysis on how to solve such process in bit-by-bit manner based on digital based services on diabetes via smartphones. To focus on patients living in house and control how such virus can transmit within such radius and follow some protocols to manage medical professors how to control by using different application digitally. Insulin is very important for those patients comes under treatment of hyperglycemia. The Government of England announce the guideline among the society by social distance and proper sanitation and self-isolation. When maximum number of infections detected then government shielded such area and declared as contentment zone. Only allow those serious patients can allow to access the hospital and rest materials are possible to buy through online or e-shopping by different online services otherwise rest peoples to obey the guidelines of government which can also for PWD in healthcare centres [26]. During Such pandemic social media plays an important role for aware all people during such alarming situations. Number of NGOs are also provide helping hand at that time to control such situation by announce about the guideline decided by government of England factors as well as oxygen. Testing also other issues that can be obtained during such pandemic i.e., basically it focusses on regularly checking up glucose level in each patient. In U.K., physical and online for chronic diseases which down to 50% [27–29]. Strategies suggested during emergencies. For strategies point of view to focus on some conditions to be follows different ideas that can be provide risk which can followed by events that can be managed and also, advice to stay at home if any symptoms are available and also critical like problems in breathing due to oxygen issues then only allowed to admitted to hospitals. For PWD, lack of medicine that are down to 10% also another important factors.

3 Methodology To predict Diabetes Mellitus using the Logistic Regression (LR) classifier we have implemented 3 criteria. • Data Gathering • Data Preparation • Implement a Classifier for Evaluation of Accuracy.

36

S. K. Sharma et al.

3.1 Data Gathering PIMA Indian dataset is accessed from M L library. It is a collection of 768 instances with 8 features which are described in Table 1. The dataset consists of 500 instances through the non-diabetic class, the remaining 268 diabetes mellitus instances [30]. Table 1. Features for prediction Number

Name

1

Pregnancy

2

Plasma glucose concentration

3

Blood pressure

4

Skin fold thickness

5

Insulin serum for two hours

6

BMI

7

Pedigree function

8

Age

3.2 Data Prediction Total data is then tested to check whether there is any missing data available or not. The process is checked with cross validation techniques. 3.3 Implementation of Classifier for Evaluation of Accuracy Our dataset is divided into 60/40 as training/testing to get best accuracy. Again 10% taken as testing for optimization from training which gives best performance of the best performance of the system. By using weighted values of each input parameters and added with bias through random selection experiments in iterative manner to minimize the error. The LR is the most powerful classification of several problems on diabetes. By applying several classifiers have detected and pick out the different classification algorithms. Hence, system does not recognize better result that leads to focus on medication. So, in this article, focus on better performance using ML algorithms namely LR & K-NN compare through various matrices. Finally, the best diabetes predicting algorithms. Hadoop. Rajesh et al. [31] by using such an algorithm for detection of diseases through several classifiers in ML model used in PIMA Indian data repository.

Comparative Analysis of Different Classifiers Using Machine Learning Algorithm

37

4 Proposed Architecture See Fig. 1.

PIMA Dataset

Pre-Processing

Comparave Study

Result Analysis

Performance Classiﬁcaon

Process Data

Parameter Tuning

Final Result

Fig. 1. Proposed architecture diagram for prediction

5 Logistic Regression Due to severely spread of diabetes disease that can be affected in rural and urban earning of countries than developed countries. Diabetes is an important cause of visual impairment, Acute renal failure, cardiac arrest, cerebrovascular accident and bellow jaws [32]. It is announced that about 84.1 million Americans who are 18 years or senior citizens have prediabetes [33]. Due to insufficient insulin in pancreas which is unable to create Beta cell disorder occurs. When diabetes can detected in women during pregencies which leads to Gestational Diabetes (Type-3) in pregnant woman [34]. In Logistic model is the group of events which is used for the probability of a certain class. It is basically used for prediction algorithms and based on the concept of probability. If we predict in our PIMA Indian dataset that is whether a person has diabetes or not. Such type of analysis comes under logistic regression. It is basically depending upon the outcomes of diabetes whether it is true or false and 0 or 1. If we use LR with a data point, then the expected output (Y) value lies between 0 and 1. But in the case of LR then the expected output value exceeded 1. Here, we don’t have good results or output. LR can be figured out Fig. 2, if the output is 1 or 0. So, there

38

S. K. Sharma et al.

is a threshold value where all the points are known to be +ve part then, 1 and all the points below threshold then are considered to be false. So, this is how the LR is going to perform operations.

Y Diabec (1) P e r s o n s

Threshold

Non-Diabec (0)

O

Glucose Level

X

Fig. 2. Diabetic vs. non-diabetic prediction

We will focus on the PIMA dataset by taking eight attributes and try to determine whether reviewer is either affected or not. By using LR, first we split the dataset into two sets i.e., is dependent dataset and independent datasets to find whether a person comes under diabetic or not. So, by taking a dependent dataset we include Glucose, Pregnant, Insulin, BMI, Age, Bp, and Pedigree as dependent data. As X axis. Hence, Level is Y that weather person having 1 for affected and 0 for not affected. We take two datasets i.e., training and testing data. LR to fit in our model.

6 K-Nearest Neighbors Algorithms (K-NN) K-NN works on feature similarity, we can do classification using K-NN classifier. When we use K-NN to classify whether a person is diabetic or not through different symptoms and also proper clinical diagnosis then we classify people affected by diabetes or not. It is one of the simplest supervised ML algorithms mostly used for classification where data points are based on how its neighbors are classified. K-NN stores all available cases and classifiers new classes based on similarity measures. In K-NN, algorithms are based on features similarities. Choosing the right values of K is a process called parameter tuning and it’s important for better accuracy. To choose a value of K, sort (n), where ‘n’ is the total number of data points. Odd values of K are selected to avoid confusion between two classes of data. K-NN considers two outcomes based on whether a person has a diabetes group or not. If it comes under true (1) and non-diabetic (0) or false. In case of diabetic, when a person comes under diabetic through

Comparative Analysis of Different Classifiers Using Machine Learning Algorithm

39

the symptoms that is weight loss and unbalance in glucose label. That a person when its sugar level is between the range or not. According to the Euclidean distance formula, the distance between two scale of measurement with coordinates (x, y) and (a, b) is given by dist(d ) = (x − a)2 + (y − b)2

Class X

Class Y

Y G L U C O S E

O

X

INSULIN (Test Sample)

Fig. 3. Diabetic (class Y vs. non-diabetic class X)

In Fig. 3 Class X is represented as non-diabetic and Class Y is represented as diabetic people in this classifier model. To classify the new distance from both classes by finding the distance between two classes by taking the test samples. To calculate the Euclidean distance X of unknown in Fig. 3 data points from all the points. We have a dataset of 768 women having diabetic or not. To train the model then calculates the performance.

7 CART (Classification and Regression Tree) This particular model is applied in classification regression problems, it is a substitute model in decision tree system. It is nothing but take Gini (outlook fair). It is an index for simplification problems. By using it we can specify the manipulation of different features by using Gini index. Formula in Cart Gini = 1 −

C

(Pi)2

i=1

Step to pick a decision node by: CART Algorithms: 1-Manupulate outlook features on every component.

40

S. K. Sharma et al.

2-Calculate the average of the value of outlook features for that system. 3-find out its attributes by the minimum outlook features. 4-Continue to step 1, 2, 3 until it nonexclusively three generated. It is based on decision trees, by using PIMA dataset, its golf precision dataset so what’ happening is that based on some into in few attributes such as Glucose, BMI, Insulin to use so the output can only be Q. So, it is binary classification task so it can only be either we “No” or either be” Yes”. So, you can have to check the decision whether a person is affected by diabetes or not. So, here in our decision tree we have to do it. It has to create a model based on the state. So, it will just generalize. By using the decision tree CART classification can be calculated. Its outcome is either 0 or 1 format i.e., the person is affected by diabetic or not [35–38] (Tables 2 and 3). Table 2. Confusion matrix for PIMA Indian dataset A

B

A (−ve)

500

0

B (+ve)

268

0

This confusion matrix paper DOI has not been generated. as it is available, we will be communicating immediately.

Table 3. Result analysis Classification Precision Recall F-measure Accuracy LR

0.788

0.789

0.788

78.85

KNN

0.804

0.794

0.798

79.42

CART

0.82

0.83

0.81

83.16

8 Conclusion Diabetes is a very vital clinical issue in which the number of persons is developing in a slowly manner. Although so many online treatments can be provided but that is not so enough for chronic patients to collects data sets for analysis the disease in proper manner. Hence, it is very serious condition for PWD, that can affect those patients physically and mentally stress which leads to chance of unhealthy condition due to lack of hospital treatment and online advice that can less-advantages for patients. But the main challenge is to identify diabetes at preliminary stage. In this article we have tried to sketch a predictive system to predict diabetes. CART is a most important algorithm that is easier to use in python than another ML classifier. It allows us to create

Comparative Analysis of Different Classifiers Using Machine Learning Algorithm

41

a model very fast and not require much calculating power. Such algorithms are careful not to over fit our data, it leads to improving our model’s performance like random forest algorithms. In ML classifier comparison all model predictions on PIMA dataset with high accuracy. We use for pre-processing data using 7 instances (Glucose, Insulin, BMI, Pedigree, Pregnancy, Age, BP) & single result instance (Outcome). By using several algorithms LR, K-NN and CART algorithms to predict diabetes and evaluate its efficiency and accuracy. We predict all algorithms that have better output by using several parameters like precision, recall, F-measure and accuracy. Several outcome results show good that is above 70% but CART having 83% accuracy was better that other model. But our future outcomes are to implement other ML classifiers to get better than it.

References 1. Guerra, S., Gastaldelli, A.: The role of the liver in the modulation of glucose and insulin in non-alcoholic fatty liver disease and type 2 diabetes. Curr. Opin. Pharmacol. 55,165–174 (2020) 2. Panwar, M., Acharyya, A., Shafik, R.A., Biswas, D.: K-nearest neighbor-based methodology for accurate diagnosis of diabetes mellitus. In: 2016 Sixth International Symposium on Embedded Computing and System Design (ISED), pp. 132–136. IEEE, December 2016 3. Wu, H., Yang, S., Huang, Z., He, J., Wang, X.: Type 2 diabetes mellitus prediction model based on data mining. Inform. Med. Unlocked 10, 100–107 (2018) 4. https://www.who.int/news-room/fact-sheets/detail/diabetes 5. Orabi, K.M., Kamal, Y.M., Rabah, T.M.: Early predictive system for diabetes mellitus disease. In: Perner, P. (ed.) ICDM 2016. LNCS, vol. 9728. Springer, Cham (2016). https://doi.org/10. 1007/978-3-319-41561-1_31 6. Mujumdar, A., Vaidehi, V.: Diabetes prediction using machine learning algorithms. Procedia Comput. Sci. 165, 292–299 (2019) 7. Sayadi, M., Zibaeenezhad, M., Taghi Ayatollahi, S.M.: Simple prediction of type 2 diabetes mellitus via decision tree modeling. Int. Cardiovasc. Res. J. 11(2), 71–76 (2017) 8. Enagi, A.I., Sani, A.M., Bawa, M.: A mathematical study of diabetes and its complications (2017) 9. Swapna, G., Vijayakumar, R., Soman, K.P.: Diabetes detection using deep learning algorithms. ICT Express 4(4), 243–246 (2018) 10. Ramesh, S., Caytiles, R.D., Iyengar, N.C.S.: A deep learning approach to identify diabetes. Adv. Sci. Technol. Lett. 145, 44–49 (2017) 11. Zou, Q., Qu, K., Luo, Y., Yin, D., Ju, Y., Tang, H.: Predicting diabetes mellitus with machine learning techniques. Front. Genet. 9, 515 (2018) 12. Nnamoko, N., Hussain, A., England, D.: Predicting diabetes onset: an ensemble supervised learning approach. In: IEEE Congress on Evolutionary Computation (CEC) (2018) 13. Joshi, T.N., Chawan, P.M.: Diabetes prediction using machine learning techniques. Int. J. Eng. Res. Appl. (Part -II) 8(1), 09–13 (2018) 14. U. M. L. Repository. https://archive.ics.uci.edu/ml/index.php 15. Grenard, J.L., Munjas, B.A., Adams, J.L., et al.: Depression and medication adherence in the treatment of chronic diseases in the United States: a meta-analysis. J. Gen. Intern. Med. 26 , 1175–1182 (2011) 16. Viana, L.V., Gomes, M.B., Zajdenverg, L., Pavin, E.J., Azevedo, M.J.: Brazilian type 1 diabetes study group. Interventions to improve patients’ compliance with therapies aimed at lowering glycated educational interventions. Trials 17, 94 (2016)

42

S. K. Sharma et al.

17. Gupta, R., Ghosh, A., Singh, A.K., Misra, A.: Clinical considerations for patients with diabetes in times of COVID-19 epidemic. Diabetes Metab. Syndr. 14, 211–212 (2020) 18. World Health Organization. Mental Health and psychosocial considerations during the COVID-19 outbreak, 18 March 2020. https://www.who.int/docs/default-source/coronaviruse/ mental-health-considerations.pdf. Accessed 18 Apr 2020 19. Shelvin, M., McBride, O., Murphy, J., et al.: Anxiety, depression, traumatic stress, and COVID-19 related anxiety in the UK general population during the COVID-19 pandemic, 18 April 2020 [preprint]. PsyArXiv. https://doi.org/10.31234/osf.io/hb6nq 20. The All-Party Parliamentary Group for Diabetes (APPG Diabetes). Diabetes and Mental Health 21. https://www.diabetes.org.uk/resources-s3/2018-08/Diabetes%20and%20Mental%20H ealth%20%28PDF%2C%205.7MB%29.pdf. Accessed 18 Apr 2020 22. Krousel-Wood, M.A., Islam, T., Muntner, P., et al.: Medication adherence in older clinic patients with hypertension after Hurricane Katrina: implications for clinical practice and disaster management. Am. J. Med. Sci. 336, 99–104 (2008) 23. Khan, Y., Albache, N., Almasri, I., Gabbay, R.A.: The management of diabetes in conflict settings: focus on the Syrian crisis. Diabetes Spectr. 32, 264–269 (2019) 24. Chew, B.H., Vos, R.C., Metzendorf, M.I., Scholten, R.J., Rutten, G.E.: Psychological interventions for diabetes-related distress in adults with type 2 diabetes mellitus. Cochrane Database Syst. Rev. 9, CD011469 (2017) 25. Wondafrash, D.Z., Desalegn, T.Z., Yimer, E.M., Tsige, A.G., Adamu, B.A., Zewdie, K.A.: Potential effect of hydroxychloroquine in diabetes mellitus: a systematic review on preclinical and clinical trial studies. J Diabetes Res. 2020, 5214751 (2020) 26. NHS London Clinical Networks. Management of diabetes in emergency department during coronavirus pandemic. https://www.england.nhs.uk/london/wp-content/uploads/sites/8/ 2020/04/Covid-19-Management-of-diabetes-in-emergency-department-crib-sheet-updated150420.pdf. Accessed 18 Apr 2020 27. Linong, J., Guangwei, L., Qiuhong, G., et al.: Guidance on diabetes management in elderly during COVID- 19 pandemic. Chin. J. Diabetes 28, 1–6 (2020) 28. Linong, J., Jiajun, Z., Zhiguang, Z., et al.: Recommendation on insulin treatment in diabetes patients affected with COVID-19. Chin. J. Diabetes 28, 1–5 (2020) 29. Association of British Clinical Diabetologists. COVID-19 (Coronavirus) information for health-care professionals (2020). https://abcd.care/coronavirus. Accessed 24 Apr 2020 30. IQVIA. Monitoring the impact of COVID-19 on the pharmaceutical market. https:// www.iqvia.com/-/media/iqvia/pdfs/files/iqvia-covid-19-market-tracking-us.pdf?_515873 34105503. Accessed 19 Apr 2020 31. Zemedikun, D.T., Gray, L.J., Khunti, K., Davies, M.J., Dhalwani, N.N.: Patterns of multimorbidity in middle-aged and older adults: an analysis of the UK biobank data. Mayo Clin. Proc. 93, 857–866 (2018) 32. https://www.kaggle.com/uciml/pima-indians-diabetes-database 33. Rajesh, K., Sangeetha, V.: Application of data mining methods and techniques for diabetes diagnosis. Int. J. Eng. Innov. Technol. (IJEIT) 2(3), 224–229 (2012) 34. WorldHealthOrganization (2021). https://www.who.int/news-room/fact-sheets/detail/dia betes 35. National Institute of Diabetes and Kidney Diseases (2021). https://www.niddk.nih.gov/hea lth-information/diabetes 36. Mujumdar, A., Vaidehi, V.: Diabetes prediction using machine learning algorithms. In: International Conference on Recent Trends in Advanced Computing, ICRTAC (2019) 37. https://www.sciencedirect.com/science/article/pii/S2405959521000205 38. Koundal, D., Gupta, S., Singh, S.: Computer aided thyroid nodule detection system using medical ultrasound images. Biomed. Signal Proc. Control 40, 117–130 (2018)

Survey on Machine Learning Techniques for Software Reliability Accuracy Prediction Suneel Kumar Rath1(B) , Madhusmita Sahu1 , Shom Prasad Das2 , and Jitesh Pradhan3 1 Department of Computer Science Engineering, C. V. RAMAN Global University,

Bhubaneswar, Odisha, India [email protected], [email protected] 2 Department of Computer Science Engineering, Birla Global University, Bhubaneswar, Odisha, India 3 Department of Computer Science Engineering, Indian Institute of Technology, Dhanbad, Jharkhand, India Abstract. The field of software measurement is still in its early stages. For expressing software reliability, there isn’t much in the way of a quantitative mechanism. Existing approaches aren’t universal and have several flaws. A variety of ways could be used to improve software reliability. Not only must time be balanced, but also financial constraints must be addressed. The software product interaction model affects entire framework quality; the extended a flaw in the technique goes unrecognized, the worse it gets and the more difficult it is to resolve it. In this study, the reliability and maintainability of various machine learning techniques are evaluated. Researchers have identified several software reliability parameter models in recent times, including ones based on stochastic differential equations, nonhomogeneous Poisson processes, and the Bayes process. Although these models can precisely foresee the quantity of programming flaws in explicit testing settings, no single model can anticipate the quantity of programming deficiencies for all situation situations. To overcome these restrictions, traditional statistical procedures are being replaced with intelligent reliability computing techniques, which have resulted in a huge increase in software dependability expectations in recent years. For a range of soft computing technologies presented by diverse researchers, we have undertaken a quick look to investigate this issue. Traditional statistical approaches are being replaced by intelligent reliability computing techniques to overcome this limitation that has resulted in a massive growth in software dependability demands in recent years. To explore this topic, we took a quick look at a range of various computing technologies presented by various researchers. Keywords: Software reliability model · Software quality · Testing · Machine learning

1 Introduction Due to considerations of corporate profitability, user safety, and environmental preservation, programming quality has become more pivotal to providing dependable programming. In the software development life cycle [2], reliability is a critical consideration [3] because unreliable software is more likely to have flaws or problems that can © The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 M. N. Mohanty et al. (Eds.): METASOFT 2022, AISSE 1, pp. 43–55, 2022. https://doi.org/10.1007/978-3-031-11713-8_5

44

S. K. Rath et al.

cause system failure. Software reliability, Particle Swarm Optimization (PSO), Modified Genetic Swarm Optimization (MGSO), Ant Colony Optimization (ACO) and Learning Approaches are some of the optimization techniques that have been used in software reliability. Software’s reliability could be represented by several non-functional features or traits. Even though there are numerous elements, dependability and maintainability are the most essential factors in determining software quality [4]. A good piece of software should be dependable and error-free. When both human employees and intelligent machines of varying competence and intelligence participate in a shared endeavour, software dependability [2] is a proportion of likelihood or trust in the product’s capacity to characterise reciprocal learning. We look into common scenarios involving human and machine learning, as well as human and machine competency in industrial systems, to find an answer that will work in its intended context. Component-based software engineering encourages users to reuse existing and previous software to develop high-quality products while saving time, memory, and money. Basic soft computing approaches like Support Vector Machines (SVM), Neural Networks (NN), Genetic Algorithms (GA), Fuzzy Logic, Ant Colony Optimization (ACO), Particle Swarm Optimization (PSO), and so on were examined in this study.

2 Research Methodology and Contribution Singh et al. [5] created a model that offers decision tables, which are the simplest spaces for inference, are easy to understand, and use a supervised learning approach. Continuous features are found in some datasets, while value-containing features are found in others. When validating a data set using incremental cross-validation, leave one to ten crosses independent of the number of folds selected and taken at a time when to approximate the accuracy of the ascending node, tenfold cross-validation was used. Due to the considerable uncertainty of the calculation, cross-validation was repeated before the standard deviation of the mean fell below 1%. Kumar et al. [7] presented many steps of computational model research for evaluating the usefulness of applications in a given scenario. In this section, we will focus on input back propagation algorithms and network design difficulties. An artificial neural network is made up of a brain with many nodes that resemble nerves. Each node has a feature that determines the node’s performance for a particular set of inputs by setting local settings. The organization of neurons may be single-layer or multi-facet. Parametric and non-parametric software reliability growth models are the two principle structures. In view of the non-homogeneous Poisson process, the majority of parametric models are technically stable. Karunanithi et al. [8] presented a different model with varying predictive capacities in many test phases, with no model relying on specific forecasts. The Connectionist paradigm looked at various network regimes and data representation methods. Five well-known programme reliability development models employ real data sets and are integrated to improve predicted accuracy. It predicts application responsiveness using neural networks and the findings acquired through them. Feed forward networks and recurrent networks, as well as alternative training regimes, can be used to estimate device stability. Hu et al. [9] proposed software reliability models for detecting data from software faults. The neural connectionist method’s robust model is less limiting. Elman’s persistent neural network

Survey on Machine Learning Techniques for Software Reliability Accuracy Prediction

45

is being investigated for its ability to predict software faults. The approaches employed are the Non-homogeneous Poisson Process and Connectionist models. With no uniform norm, specific predictive models have emerged in a variety of scenarios. Multiple neural network models provide predicted errors that are higher than, or at least equivalent to, NHPP development models. Hoi et al. [10] suggested a model based on a paradigm that establishes a modern quantitative metric for systematically evaluating the quality and output of the software process as well as the defect management process in four real-world industrial software projects. The solution to machine learning in this article is to employ controlled classification algorithms to classify each task execution as normal or abnormal. Preprocess data by converting specific data to generating sequences, establishing a specific sequence classifier, marking data sets, and testing the classifier on abstract raw data stored in database databases. Pattanayak et al. [11] proposed that early detection work on expected fault-prone software modules in the software development process is pivotal to conserving the effort involved in the process. For software quality prediction, several machine learning applications like neural networks, fuzzy logic, and Bayesian models are applied. Neural Networks, Bayesian Networks, Genetic Algorithms, Fuzzy Logic, Tree-based Algorithms, Decision Tree Algorithms, and Revised CRT Algorithms are examples of machine learning approaches. Almost all solutions protect designated observation points. Costa et al. [12] suggested a model that may be applied to a variety of conventional and non-parametric models, with the process of genetic engineering and boosting being utilised to improve performance. This technology outperforms classical genetic engineering, which necessitates ten times the number of executions. Product efficiency depends on its ability to be reliable. Modelling the efficiency of devices is a less expensive way of Classical genetic programming models concentrate on the period during which the boosting technique employs tenfold booting rounds and model software combinations. Lo Jung Hua [13, 14] I proposed an early phase testing strategy for stable applications that provides a more effective way to problem solve (failure data). Various types of input data are utilised to forecast output models. Support Vector Machine models with an emphasis on genetic algorithms were utilized. SVM is a method for tackling problems involving nonlinear analysis and time series that is based on a mathematical approach. When examining particular prediction models, genetic algorithms are utilised to evaluate SVM parameters. Lee et al. [15] suggested three multiple-step forward prediction methodologies for component-based programme reliability models and compared their prediction performance to data counting and time length failures.The recursive technique allows for more accurate fault tracking data estimation in terms of the time between fault results. The result indicates the relevance of data-driven procedures in long-term forecasting as well. Finding a solid multistep forecast approach is advantageous.

3 Reliability Measurement Methods and Process Reliability metrics are utilized to quantify the different software product’s dependability. The decision of which boundary to not entirely settled by the kind of framework to which it applies as well as the application domain’s needs. The current techniques

46

S. K. Rath et al.

of software reliability estimation can be isolated into four classifications: the first is the measurements used to make the artefacts, such as system design documents, requirement specification documents, and so on, are called product metrics. These measurements help in deciding if the item is satisfactory by monitoring maintainability, usability, portability, and reliability. These measurements are taken from the source code itself. The size of software is predicted to be intricacy, development effort, and constancy. Line of Code is a straightforward method for assessing size of software. LOC is based on the idea that programme length may be used to anticipate programme attributes such as effort and simplicity of upkeep. By executing tests on software items and accepting that product dependability is a component of the piece of programming that is effectively confirmed or tried, test coverage metric size fault and reliability can be determined. The ability to describe complexity is critical since it is directly tied to software stability. By converting the code into a graphical representation, complexity-oriented metrics can be used to determine the complexity of a program’s control structure. At different stages of software product development, quality metrics are used to assess the product’s quality. Defect Removal Efficiency is a vital quality statistic (DRE). As a result of numerous quality confirmation and control activities performed all through the development process, DRE gives a measure of quality. The second is the qualities also; executions of an undertaking are characterized by its measurements. Assuming the software engineer deals with the task well, we can produce better results. The capacity to achieve projects on schedule and inside the expected quality targets has a link to the development process. When developers adopt ineffective methods, the cost rises. A stronger improvement process, risk the board cycle, and arrangement the executive’s process can all help to increase reliability. The third is the software development process and their environments are measured using process metrics. They can identify if a process is running well since they report on measurements like process duration and modify time. The motivation behind a process metric is to finish the interaction accurately the initial time around. The strategy straightforwardly affects the item’s quality. Accordingly, process measurements can be utilized to enhance, monitor and estimate software’s constancy and quality. The effectiveness and quality of the procedures that produce the software product are described by process metrics. Last but not least, a defect is imperfections in a program that arises when the software engineer commits an error and makes the program ail when run under specific circumstances. The failure-free execution not entirely settled by these measurements. “Figure. 1” displays the flow of the software reliability engineering process in a flow graph. At first, the software’s reliability objectives are listed. After that, you should create an operation profile. Software testing and failure data collection should be done with these inputs. The present software reliability is detected by using an appropriate tool or by using SRGM. The software is deployed in the real-time environment if the reliability objective is met. Software testing will continue if this is not the case. Finally, the procedure is halted when it has been validated for reliability [37].

Survey on Machine Learning Techniques for Software Reliability Accuracy Prediction

Fig. 1. Flow of the software reliability process

47

48

S. K. Rath et al.

4 Related Work Both natural and artificial ideas are used in machine learning. Soft computing techniques are useful in a variety of fields; machine learning, and artificial intelligence are used in engineering applications for example, communication networks, mobile robotics, power electronics and airplanes and so on. They’re suggested because they have the most effective real-time mapping method. The SRS papers incorporate responsibility portion, testing strategies that utilization testing in which the client checks the product’s activity in a true setting, testing information for two-point approval, and client commitment. [16, 17] are a couple of words that come to mind when thinking about It’s pretty disturbing to represent software failure as a mathematical model that depends on input factors. [19] Factor Analysis. According to the Kaiser Criterion, eigenvalues or characteristic roots are factors’ deciding criteria. The Eigenvalue is considered a factor if it is greater than one, else it is not [20]. Exploratory component analysis is a method that permits any factor or variable to be linked to another and is not based on any prior assumption. When it comes to accessing various software reliability growths, software reliability models are critical in senior executives’ decision-making. Poisson Process with Nonhomogeneous Distribution [11, 21] models are a reliable and successful method for predicting, regulating, and analysing software reliability. Software Reliability Growth Models (SRGM) was utilized in a lovely troubleshooting situation to examine any of the associations’ product issue disappointments during the decade’s exploration progress in issue forecast.

5 Motivations and Objectives 5.1 Motivation It is vital to obtain higher levels of reliability while building software for critical applications. In today’s software sector, this is a particularly difficult problem. Because selfadaptive software is expected in today’s software, it is frequently more sophisticated. Obtaining reliability for a complicated system necessitates solving a multi-objective issue across multiple domains. Over the most recent forty years, several SRGM have been presented to improve reliability prediction. Though there are a variety of reliable models available presently, none of them operate well across several projects. 5.2 Objective Computational intelligence techniques outperform statistical methods as far as expectation, and so can be used to more accurately forecast software faults [22]. We are attempting to survey and assess dependability predictions in this article. The industry is producing reports on best practises or case studies. The majority of them are generic, such as [23, 24], and they only cover a few components of quality assurance or testing. [25] Gave simple questions for evaluating testing activity. These questions provide valuable insight into which characteristics, such as monitoring input features, should be considered. Our recommendations, which address these issues, provide more precise guidance, including an in-depth examination of certain topics.

Survey on Machine Learning Techniques for Software Reliability Accuracy Prediction

49

6 Methodology At the point when the main release of the rules was created, the consortium comprised of 39 experts from the scholarly world and industry and three associations. Experts in system safety, software engineering, quality assurance and machine learning are among those who will be in attendance. Factory automation, Entertainment, electronics, communications, IT solutions, and more are only some of the application domains represented by the participants. To develop the recommendations, the consortium sponsored two types of discussions. The first session focused on quality assurance concerns in distinct application domains. The goal was to draw particular insights rather than generic ones. Subsequently, general experiences might be excessively digest for the requests of different fields. The QA4AI Consortium is a non-profit organisation based in Japan that meets to discuss AI quality assurance based on machine learning. Its objectives are to encourage the implementation of AI systems based on machine learning by diminishing the risks related with AI/ML and raising public comprehension of their quality, including limits. It is coordinated in the following manner, which corresponds to the two forms of writing. Only a few dependability modelling tools are used in real-time systems. Different normalized test scores for statistical considerations are intellectual property rights and statistics generator specificity, as well as quantity and cost, significance and requirements, population-sample relationships, Outliers and missing values, bias and pollution, ownership, the intelligence and independence of confirmatory statistics are both included in this guiding principle. 6.1 Axes of Quality Evaluation There are two approaches to software development: deductive and inductive. The first is that, in the case of traditional software, engineers have extensive development experience. Process assessment, estimation, surveys, and testing are generally instances of value assurance expertise. Since ML-based frameworks are naturally produced, nonlinear, and sophisticated, engineers have little experience with them. As a result, traditional methods of process evaluation, measurement, and review are ineffectual. The usefulness of FEET (frequent, entire, and exhaustive testing) has not waned. Data Integrity, System Quality, Model Robustness, Customer Expectation and Process Agility are the five factors that these pointers extract as exceptional criteria for AI-based systems. Data integrity describes the high level of certainty that samples of inputs and outputs provide. 6.2 Technical Catalogue The bulk of technical manuals generalises and describes successful industrial procedures and techniques, at least in the main companies. Techniques or methods for ensuring the quality of machine learning models or systems, on the other hand, are still in their early stages of development and research. As a result, they gathered patterns from the most recent research articles in the software engineering community. They also included a list of popular machine learning concepts, including precision/recall, over/underfitting, and cross-validation, which are mostly used for performance evaluation. The current trends covered in the principal form of the rule are as per the following: pseudo-oracle usage,

50

S. K. Rath et al.

e.g., [26], testing for metamorphic, e.g., [27, 28], Searching for adversarial cases and evaluating robustness, e.g., [27, 29], for each output, AI includes a local explanation, e.g., [29, 30], and the trained model’s overall explanation, e.g., [24]. Adversarial machine learning requires secure coding.

7 Evaluation Criteria When comparing the efficiency of different software dependability prediction models, recall and precision measurements are the most widely used metrics. The distinction between actual and expected fault-prone modules is referred to as precision. Precision is useful in determining the percentage of correct affirmative identifications. It has the following definition: Precision = TP/ (TP + FP)

(1)

True Positive is taken as TP, while False Positive is taken as FP. The recall is calculated by dividing the quantity of accurately anticipated fault-prone modules. The goal of recall is to figure out how many true positives were correctly detected. It has the following mathematical definition: Recall = TP/ (TP + FN)

(2)

False Negative is taken as FP, while True Positive is taken as TP. The F1 measure is a term that refers to a worth that joins review and accuracy. It is characterized as follows: F1 = (2 ∗ Recall ∗ Precision)/ (Recall + Precision)

(3)

Accuracy is one more significant element in assessing whether or not a deformity expectation is effective. The proportion is equivalent to the quantity of accurately anticipated examples in the total test isolated by the complete number of tests taken in the test set. It’s feasible to communicate it numerically as follows: (TP + TN) / (TP + FN + TN + FP) = Accuracy

Table 1. Prediction of software quality using machine learning. Observed True

False

Classifier +VE True +ve (TP) False +ve(FP) predictions −VE False −ve(FN) True −ve(TN)

(4)

Survey on Machine Learning Techniques for Software Reliability Accuracy Prediction

51

8 Experiments and Results 8.1 Defect Prediction Here we are utilized different machine learning approaches to predict the defects. 8.1.1 Data Collection The PROMISE information storage [33] provided the datasets for fault expectation. Poi, Tomcat, Forrest, Velocity, Ant, and Workflow are among the open-source projects featured in our databases. We chose the most recent version of these projects, despite the fact that there were many versions accessible. Table 1 contains more information about the datasets. Several superfluous OO metrics were discovered in the PROMISE datasets and were eliminated. Only some metrics listed above were preserved and used to train the classifiers since, after a thorough review of the literature, they were deemed to be the most successful for fault prediction (Table 2). Table 2. Defect analysis datasets Techniques Version Total Fault (%) (Approx.) objects Poi

3.1

442

64

Ant

1.8

745

24

Forrest

0.9

32

18

Tomcat

6.2

856

11

Velocity

1.7

229

34

We extensively examined the dataset for any missing or unknown values. To examine the classifier’s capacity to discern whether or not a given occurrence is inadequate, we classified all samples as faulty or on the other hand not flawed and involved them as the ordering class. The number of problem occurrences did not appear to be evenly distributed across all of the data sets evaluated, like Poi, had 64% of erroneous data, while Tomcat only had 11%. This was done on purpose in order to give the classifiers a diverse and balanced set of data to work with. Because the quantity of imperfections fluctuated, the whole informational collection was analysed and validated using tenfold cross-validation techniques during the training phase. 8.1.2 Experimentation We picked J48, Weka’s C4.5, and the Random Forest (RF) outfit method for testing with option trees, both of which are accessible in Weka [33]. We didn’t apply any further filtering in these approaches because the problematic instances were manually determined in view of the number of problems they created. As a result, the J48 classifier just takes nominal classes into account and does not require any additional filters. Lack

52

S. K. Rath et al.

of Cohesion in Methods was found to be at the root of J48’s pruned tree, indicating that it was one of the most important choice factors, followed by the coupling between methods (CBM), response for class (RFC) and depth in inheritance tree (DIT). One of the most basic is the Naive Bayes model, which makes the naive supposition that all qualities in the training set are independent of one another, which isn’t necessarily the case. We used the Simple Estimator (in Weka) [36] to create Bayesian networks. The important estimators are WMC, DIT, and the number of children (NOC), followed by LCOM, RFC, and inheritance coupling (IC) and coupling between objects (CBO). We used the PART classifier from Weka [33] to test the rule-based classifiers. The Ibk, known as Weka execution of nearest neighbors [33], was utilized to test the nearest neighbor class of classifiers. The judgment is defenseless to commotion if the ‘k’ esteem is set excessively low; then again, if the ‘k’ esteem is set too high, a bigger locale of occurrence space should be covered, which prompts erroneous classifications. With the help of AUC curve accuracy is measured. The area beneath the AUC curve is used to determine accuracy. The AUC curve is two-dimensional [36], where y denotes true positive and x axis denotes False Positive. The flaw detection evaluation results are shown in Table 3. We discovered that non-defective cases had good precision and recall, with most classifiers accurately predicting them 90% of the time. When it came to recognising defective occurrences, however, the majority of the classifiers performed badly. This explains why we chose AUC above accuracy) as our evaluation statistic. For defective cases, the TP, accuracy, and recall were all zero, notably in the case of SVM. Table 3. Detection of defects evaluation results Methodology

Accuracy

Call back

AUC curve

ANN

0.728

0.631

0.741

Naïve Bayes

0.615

0.552

0.712

SVM

0.395

0.498

0.498

KNN

0.806

0.622

0.791

J48

0.754

0.625

0.721

Bayesian Network

0.778

0.697

0.848

Random Forest

0.741

0.653

0.802

PART

0.777

0.618

0.761

According to the Support Vector Machine and AUC, the least successful model for defect prediction is SVM outperformed Naive Bayes by a small margin. With an AUC of 0.74, ANN did reasonably well. An AUC of 0.8 was also given to Random Forest. The Bayesian network, with an AUC of 0.84, outperformed all the other classifiers in terms of overall performance. As seen in the Bayesian network, DIT, WMC, and NOC are critical components that lead to software system errors.

Survey on Machine Learning Techniques for Software Reliability Accuracy Prediction

53

Table 4. Error rate VS. Accuracy Methodology

Accuracy

Mean absolute error

Mean root squared error

ANN

67%

0.282

0.410

SVM

64%

0.316

0.412

PART

67%

0.238

0.425

Bayes. Network

63%

0.341

0.276

KNN

65%

0.257

0.401

J48

69%

0.247

0.426

Naïve Bayes

62%

0.341

0.254

Random Forest

68%

0.261

0.373

Almost all of the classifiers fared similarly in terms of accuracy in the maintenance prediction task. The classifiers’ accuracy varies from 62% to 69%. The Artificial Neural Network, J48, PART techniques were the most precise, with a total accuracy of 67%. Table 4 displays each classifier’s accuracy; mean root squared error, and Mean Absolute Error. When it came to forecasting high maintenance occurrences, the majority of classifiers had a low accuracy rate. The least accurate methods were the Bayesian network and naïve Bayes.

9 Conclusion Various SRGM have been utilized for a long time to work on the estimation of software reliability estimate like the quantity of software reliability, software failure rate, and residual faults. Software dependability is a crucial criterion for assessing software quality. Can software dependability be greatly improved only by removing or considerably lowering software defects and failures? Software dependability is stochastic as well as dynamic. It’s a probabilistic metric that treats software errors as though they happened at random. We experimented with a range of common classifiers using data from open source projects in the PROMISE data repository. Because AUC accounts for class distribution fluctuation, it was chosen as the evaluation statistic. Based on our findings, Random Forest appears to be an excellent AUC classifier for predicting defects and maintainability. These methodologies could be used in the future to develop a new model that incorporates factors like dependency, complexity, component interaction, reusability, failure rate, and so on to forecast software reliability.

References 1. Brooks, F.P.: The Mythical Man Month: Essays on Software Engineering. Addison Wesley, Reading (1998)

54

S. K. Rath et al.

2. Musa, J.D.: A theory of software reliability and its application. IEEE Trans. Softw. Eng. SE-1, 312–327 (1971) 3. Ong, L.F., Isa, M.A., Jawaw, D.N.A., Halim, S.A.: Improving software reliability growth model selection ranking using particle swarm optimization. J. Theor. Appl. Inf. Technol. 95(1) (2017) 4. AL-Saati, D., Akram, N., Abd-AlKareem, M.: The use of cuckoo search in estimating the parameters of software reliability growth models. (IJCSIS) Int. J. Comput. Sci. Inf. Secur. 11(6) (2013). arXiv preprint arXiv:1307.6023 5. Singh, Y., Kumar, P.: Application of feed forward neural networks for software reliability prediction. ACM SIGSOFT Softw. Eng. Notes 35(5), 1–6 (2010) 6. Rana, R., Staron, M.: Machine learning approach for quality assessment and prediction in large software organizations. https://doi.org/10.1109/ICSESS.2015.7339243 7. Singh, Y., Kumar, P.: Prediction of software reliability using feed forward neural networks. https://doi.org/10.1109/CISE.2010.5677251 8. Karunanithi, N., Whitley, D., Malaiya, Y.K.: Prediction of software reliability using connectionist models. IEEE Trans. Softw. Eng. 18(7) (1992). https://doi.org/10.1109/32. 148475 9. Hu, Q.p., Dai, Y.S., Xie, M., Ng, S.h.: Early software reliability prediction with extended ANN model. https://doi.org/10.1109/COMP SAC.2006.130 10. Khoshgoftaar, T.M., Seliya, N.: Tree based software quality estimation models for fault prediction. https://doi.org/10.1109/METRIC.2002.1011339 11. Pattnaik, S., Pattanayak, B.K.: A survey on ma chine learning techniques used for software quality prediction. https://doi.org/10.1504/IJRIS.2016.080058 12. Costa, E.O., Pozo, A.T.R., Vergilio, S.R.: A genetic programming approach for software reliability modeling. https://doi.org/10.1109/TR.2010.2040759 13. Lo, J.H.: Predicting software reliability with support vector machines. In: 2010 Second International Conference on Computer Research and Development (2010). https://doi.org/10.1109/ ICCRD.2010.144 14. Osakada, K., Yang, G.B., Nakamura, T., Mori, K.: Expert system for cold-forging process based on FEM simulation, CIRP Ann. 39(1), 249–252 (1990) 15. Park, J., Lee, N., Baik, J.: On the long term predictive capability of data driven software reliability model: an empirical evaluation. https://doi.org/10.1109/ISSRE.2014.28 16. Kumar, L., Misra, S., Rath, S.K.: An empirical analysis of the effectiveness of software metrics and fault prediction model for identifying faulty classes. Comput. Stand. Interfaces 53, 1–32 (2017) 17. Machida, F., Xiang, J., Tadano, K., Maeno, Y.: Lifetime extension of software execution subject to aging. IEEE Trans. Reliab. 66(1), 123–134 (2017) 18. Rumbaugh, J., Jacobson, I., Booch, G.: Unified Modeling Language Reference Manual. Pearson Higher Education, Hoboken (2004) 19. Treleaven, P., Yingsaeree, C.: Algorithmic trading. IEEE Comput. Soc. 44(11), 61–69 (2011) 20. Herzig, K., Nagappan, N.: Empirically detecting false test alarms using association rules. In: Proceedings of the 37th International Conference on Software Engineering, Florence, Italy, 16–24 May 2015 21. Malhotra, R., Negi, A.: Reliability modeling using particle swarm optimization. Int. J. Syst. Assur. Eng. Manag. 4(3), 275–283 (2013). https://doi.org/10.1007/s13198-012-0139-0. The Society for Reliability Engineering, Quality and Operations Management (SREQOM), India and The Division of Operation and Maintenance, Lulea University of Technology, Sweden 22. Amershi, S., et al.: Software engineering for machine learning: a case study. In: The 41st International Conference on Software Engineering: Software Engineering in Practice (ICSE SEIP 2019), pp. 291–300, May 2019

Survey on Machine Learning Techniques for Software Reliability Accuracy Prediction

55

23. Zinkevich, M.: Rules for reliable machine learning: best practices for ML engineering. In: NIPS 2016 Workshop on Reliable Machine Learning in the Wild, December 2017 24. Breck, E., Cai, S., Nielsen, E., Salib, M., Sculley, D.: What’s your ML test score? A rubric for ML production systems. In: NIPS 2016 Workshop on Reliable Machine Learning in the Wild, December 2017 25. Pei, K., Cao, Y., Yang, J., Jana, S.: DeepXplore: automated white box testing of deep learning systems. In: The 26th Symposium on Operating Systems Principles (SOSP 2017), pp. 1–18, October 2017 26. Tian, Y., Pei, K., Jana, S., Ray, B.: DeepTest: automated testing of deep neural network driven autonomous cars. In: The 40th International Conference on Software Engineering (ICSE 2018), pp. 303–314, May 2018 27. Dwarakanath, A., et al.: Identifying implementation bugs in machine learning based image classifiers using metamorphic testing. In: The 27th ACM SIGSOFT International Symposium on Software Testing and Analysis (ISSTA 2018), pp. 118–120, July 2018 28. Huang, X., Kwiatkowska, M., Wang, S., Wu, M.: Safety verification of deep neural networks. In: Majumdar, R., Kunˇcak, V. (eds.) CAV 2017. LNCS, vol. 10426, pp. 3–29. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-63387-9_1 29. Ma, L., et al.: DeepGauge: multi granularity testing criteria for deep learning systems. In: The 33rd ACM/IEEE International Conference on Automated Software Engineering (ASE2018), pp. 120–131 (2018) 30. Ribeiro, M.T., Singh, S., Guestrin, C.: Why should I trust you? Explaining the predictions of any classifier. In: The 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD 2016), pp. 1135–1144, August 2016 31. Friedman, N., Geiger, D., Goldszmidt, M.: Bayesian network classifiers. Mach. Learn. 29, 131–163 (1997) 32. McCallum, A., Nigam, K.: A comparison of event models for naive Bayes text classification. In: AAAI 98 Workshop on Learning for Text Categorization (1998) 33. Hal, M., et al.: The WEKA data mining software: an update. In: SIGKDD Explor., vol. 11, no. 1, pp. 10–18 (2009) 34. Cover, T.M., Hart, P.E.: Nearest neighbour pattern classification. IEEE Trans. Inf. Theory IT-13(1), 21–27 (1967) 35. Boetticher, G., Menzies, T., Ostrand, T.J.: Promise repository of empirical software engineering data (2007). http://promisedata.org/repository 36. Bradley, A.P.: The use of the area under the ROC curve in the evaluation of machine learning algorithms. Pattern Recog. 30(7), 1145–1159 (1997) 37. Lyu, M.R.: Software reliability engineering: a roadmap. In: Future of Software Engineering, pp. 153–170. IEEE Computer Society, May 2007

Classification of Pest in Tomato Plants Using CNN K. N. S. Dharmasastha1 , K. Sharmila Banu1 , G. Kalaichevlan2 , B. Lincy2 , and B. K. Tripathy3(B) 1 School of Computer Science and Engineering, VIT, Vellore, TN 632014, India

[email protected]

2 School of Agricultural Innovations and Advanced Learning, VIT, Vellore, TN 632014, India

{gkalaichelvan,lincy.b}@vit.ac.in

3 School of Information Technology and Engineering, VIT, Vellore, TN 632014, India

[email protected]

Abstract. Pests are very much harmful for crops and their number is increasing day by day. In order to control them several pesticides have been developed over the years. But, to apply pesticides, identification of their category and utility is highly essential. As a part of this process, we carry our study to compare the different methods of managing the pests of tomato in a field environment. Consequently, a CNN model is developed via mobile application for the identification of pests on tomato plants. The image pre-processing consisted of data cleaning and image augmentation. A training accuracy of 0.9985 with a test accuracy of 0.9891 is attained. It provides insight to farmers on this said application of CNN, based on the best pest management techniques, derived via comparison. Keywords: Tomato · Pests · Multi-class classification · Convolution neural networks · Agriculture · Transfer learning · Tomato pests

1 Introduction One of the significant food crops are Tomatoes with a large impact on health [1] and are available round the year, consumed in high amounts and have enough health benefits. Its cultivation is around the world; small-holders get its value in cash and is equally important for farmers of medium-scale commercially. As far as importance is concerned it follows potatoes as the highest preferred crop [2]. Among the top tomato producing states of India are Tamil Nadu, Telangana, Uttar Pradesh, Haryana, Bihar, Maharashtra, Chhattisgarh, West Bengal, Odisha, Karnataka, Madhya Pradesh and Andhra Pradesh. A high percentage of the total production of tomatoes comes from these states. In India, approximately 22337.27 kilo tons of tomatoes are produced and approximately 260.4 thousand hectares of land in different states are used for the cultivation of this crop in India [3]. The main pests affecting tomatoes are fruit borers, pinworms, aphids and whiteflies [4]. Agricultural practices have been developing over centuries, from the Neolithic revolution during 10,000 BC to the Green Revolution during the 1950s and 1960s [5]. © The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 M. N. Mohanty et al. (Eds.): METASOFT 2022, AISSE 1, pp. 56–64, 2022. https://doi.org/10.1007/978-3-031-11713-8_6

Classification of Pest in Tomato Plants Using CNN

57

This development was based on agriculture engineering, which dealt with improving the mechanical efficiency of the working of tools. Neural Networks are being used for advancements in many fields [6, 7]. In agriculture, the applications of Neural Networks are species identification, disease and weed detection, selective breeding, water and soil management, and many more. To identify the pests in tomato plants, we use Convolution Neural Networks (CNNs) as they are the most suited network for dealing with images and videos [8]. A well-built CNN model is better at finding the distinction between the damage patterns caused by the pests, which can potentially be overlooked by a human. This classification is a prelude to finding and suggesting various options for dealing with the identified pest on the crop. The schools of thought on farming tomatoes can be narrowed down to two methodsConventional and Organic. Integrated Pest Management (IPM) is a separate decisionmaking process that is described as a strategy that is ecosystem-based and concentrates on preventing pests for a long period of time or uses a combination of techniques to prevent damages due to them. IPM can be implemented in both organic and conventional methods of farming but has shown significantly more success in conventional farming [9, 10], partly due to the wide availability of chemical pesticides available for cheap. For this paper, we will be focusing on comparing Organic vs. a combination of IPM and Conventional methods.

2 Materials and Methods 2.1 Sample Collection Infested tomato samples required for image processing were collected from various sources. Details on the location, date and description of the sample collected are provided in Table 1. Table 1. Visits for collecting samples S. No

Location

Date

Description

1

Jamunamarathur

5th December 2019

Two different tomato farms were identified with severe infestation of pests. The samples were carefully selected based on the requirement

2

Vegetable Market, Katpadi

16th January 2020

Tomatoes infested with pests and exhibiting damage symptoms were collected from the local vendors

3

Vegetable Market, Katpadi

31st January 2020

Tomatoes infested with pests and exhibiting damage symptoms were collected from the local vendors

58

K. N. S. Dharmasastha et al.

2.2 Hardware Setup • • • • •

16 GB (2 x 8GB) 2666 MHz DDR4 UDIMM NonECC 2.5 512 GB SATA Class 20 Solid State Drive 3.5 4TB 5400 rpm SATA Hard Disk Drive NVIDIA Quadro P400, 8 GB GPU Intel i7 7th gen CPU

2.3 Software Setup • cuDNN v7.6.5 + CUDA Toolkit 10.0 • Latest versions of tensorflow-gpu and Keras in Python • Code written for execution on Jupyter notebook on Anaconda environment 2.4 Image Processing Damaged tomatoes collected from farmer’s fields and local markets were brought to the laboratory. The samples were documented for dataset preparation using Canon EOS 650D camera and their specifications are as follows: • • • • • •

575 g. 133 x 100 x 79 mm 18 MP - APS-C CMOS Sensor Optical (pent mirror) viewfinder 1920 x 1080 video resolution 3” Fully Articulated Screen ISO 100 – 12800 (expands to 25600)

The collected samples were documented using 18–55 lens as they can focus small objects and in close range. The setup for documenting the images of collected samples include a well-lit room and black colored A4 sheet were used as the background of objects (Fig. 1). Here we have collected only the images of affected tomatoes as the

Fig. 1. Before cropping

Fig. 2. After cropping

Classification of Pest in Tomato Plants Using CNN

59

pests are applicable only to such tomatoes and our study is concentrated around them. The images were documented in the same environment and with the same camera after every visit. After the required numbers of images were documented, the images were cropped evenly as per requirement (Fig. 2). Later, these images were augmented together using multiple methods (Fig. 3) and used for training and testing.

Fig. 3. Group of samples of tomatoes from multiple classes

Fig. 4. Sample size of one for each batch

The individual sizes were different for each sample, but all the images were documented in fixed dimensions when fed to the deep learning model (Fig. 4). 2.5 Design and Setup of Data and CNN Architecture Data setup goes as follows, • Anthracnose Images = 307 • Fruit Borer Images = 301 • Pinworm Images = 322 These images have been split in the ratio of 70:30 for Training and Testing. Deep Neural Networks (DNN) are extensions of normal neural networks and are capable of processing natural data in their raw form by generating their own feature vectors. The representational learning approach is followed by DNNs by generating the representation automatically followed by detection or classification [11]. Among the various applications of CNN, there are applications to image processing [12], audio signal identification [13], image retrieval [14], sign language techniques [15], sentiment analysis [16] and single image super resolution [17]. Convolutional Neural Networks (CNNs) can be described as those networks which use the convolutional operator instead of matrix multiplication in one or more of their layers. For a detailed description of CNNs from the elementary level point of view one can refer to [18].

60

K. N. S. Dharmasastha et al.

CNN (Convolution Neural Network) is a class of deep neural networks most commonly applied to visual imagery [19]. We used CNN because of the convolutional layers used in them. The purpose of convolutional layers is to extract features from the original image that are used to make predictions. Charmine Llora et al. [20] used Tensorflow’s Inception V3 architecture for detecting the pests and diseases in tomato plants. But we conducted a comparative analysis of the capabilities of different CNN architectures. We trained fourteen architectures with different hyper parameters with each. The two best models among the fourteen architectures we trained are ResNet101V2 and Inception V3. The specifics of these two architectures are tabulated below (Table 2).

Table 2. Convolution neural networks’ architecture setups ResNet101V2

Inception V3

Data split (Train:Test)

70:30

70:30

Pre-trained Weights

None

None

Pooling

Average

Average

Loss

Categorical cross-Entropy

Categorical cross-Entropy

Learning rate

1e-3

1e-3

Optimizer

SGD

SGD

Momentum

0.5

0.5

Fig. 5. Basic setup of the model

Among the Convolution Neural Networks we trained, ResNet101V2 had the best results. We used Transfer Learning where we imported the pre-trained ResNet101V2 architecture but not the pre-trained weights. This way the entire model will be re-trained for our dataset (Fig. 5). The second best Convolution Neural Network that we obtained was Inception V3. As for the previous model ResNet101V2, we used Transfer Learning for this model too.

Classification of Pest in Tomato Plants Using CNN

61

3 Results and Discussion

Table 3. Model performance of ResNet101V2 ResNet101V2

Accuracy

Loss

Training

0.9985

0.0015

Testing

0.9891

0.0324

The graph of model’s training and testing accuracies is plotted in Fig. 6 and Fig. 7:

Fig. 6. Accuracy

Fig. 7. Loss

Table 4. Model performance of inception V3 Inception V3

Accuracy

Loss

Training

0.9922

0.0011

Testing

0.9927

0.0127

The graph of model’s training and testing accuracies is plotted in Fig. 8 and Fig. 9: The data obtained were plotted in graphs (Fig. 6 and 7) after training and testing for the ResNet101V2 model and (Fig. 8 and 9) for the Inception V3 model to check their performance. These graphs represent a model’s accuracy and losses of training and testing. Overfitting or Underfitting of the model needs to be eliminated by adjusting the hyper parameters such as complexity of the model, optimizer, learning rate etc. Overfitting occurred due to the model remembering the training data instead of learning the necessary features. This caused bad performance on the test data. Underfitting occurred due to the model not learning the features at all and so the accuracies of both training and test data had been barely changed. Reduced loss signifies a well-trained model as the training and testing losses are converging at around 0.03 in Fig. 6 and 0.01 in Fig. 8. The curves’ steady continuance at this low value is preferred. It shows that the model is addressing the features needed to

62

K. N. S. Dharmasastha et al.

Fig. 8. Accuracy

Fig. 9. Loss

classify the damaged tomatoes into their respective classes. Now looking at the training and testing accuracies converging as close to one as possible [8]. The values of Table 3 and Table 4 suggest the ability of the models to have learnt the features well. We have to maintain high and similar accuracies for both training and testing accuracies. This implies that the model is not overfitting. These high accuracy values accompanied by the low loss values of training and testing show that the hyper parameters chosen are the correct choices. [8]. 3.1 Comparison with Existing Works It may be noted that some similar works have been performed in [21, 22]. The study in [21] is of general flavor as it is focused on the pest management in practice as an integrated model and provides the directions for their successful utility and through that to protect the crops. The study in [22] is closer to our work focusses on the detection of tomato disease through various methods based on CNN. It also uses object detection models. Our focus in this work is on predicting and classifying the pests for the tomato plants. 3.2 Limitations As of now, the models can only predict and classify between the three defined classes considered for training. For it to be able to identify more classes, we need to retrain the models, ideally with the same hyper parameters. Doing so will help in identifying new pests’ bites. The models also can only classify the input pictures as a whole as there were no bounding boxes used during training to highlight the pests’ bites on the tomato fruit. To achieve this, we need to create a new model where we train it with bounding boxes specifically.

Classification of Pest in Tomato Plants Using CNN

63

4 Conclusions Out of the architectures we trained and experimented with to get to the models at hand, we had to train on the entirety of architecture rather than the last few layers as typically seen in transfer learning. Among the optimizers, SGD with 1e-3 learning rate and momentum 0.5 proved to give consistently good results as we progressed through the model development. Using average pooling instead of max pooling has shown to produce better results due to the need to pay attention to smaller details (the bite marks). The modified architectures of ResNet101V2 and Inception V3 with the hyperparameters mentioned above have produced the best results for us.

References 1. Willcox, J.K., Catignani, G.L., Lazarus, S.: Tomatoes and cardiovascular health. Crit. Rev. Food Sci. Nutr. 43(1), 1–18 (2003) 2. Bergougnoux, V.: The history of tomato: from domestication to biopharming. Biotechnol. Adv. 32(1), 170–189 (2014) 3. Report on Tomato, Department of Agriculture & Farmers Welfare, Government of India (2018) 4. Jones, J.B., Zitter, T.A., Momol, M.T.: Compendium of Tomato Diseases and Pests. Internet Bookwatch (2014) 5. Simmons, A.H.: The Neolithic Revolution in the Near East: Transforming the Human Landscape, Goodreads (2007) 6. Dasgupta, D.: Advances in Artificial Immune Systems. IEEE computational intelligence magazine 1(4), 40–49 (2006) 7. Burt, J.R., et al.: Deep learning beyond cats and dogs: recent advances in diagnosing breast cancer with deep neural networks. Br. J. Radiol. 91(1089), 20170545 (2018) 8. Zheng, Q., Yang, M., Yang, J., Zhang, Q., Zhang, X.: Improvement of generalization ability of deep CNN via implicit regularization in two-stage training process. IEEE Access 6(1), 15844–15869 (2018) 9. Hassan, A.S.M.R., Bakshi, K.: Pest management, productivity and environment: a comparative study of IPM and conventional farmers of Northern Districts of Bangladesh. Pak. J. Soc Sci. 3(8), 1007–1014 (2005) 10. Llorca, C., Yares, M.E., Maderazo, C.: Image-based pest and disease recognition of tomato plants using a convolutional neural network. In: Proceedings of International Conference Technological Challenges for Better World (2018) 11. Bhattacharyya, S., Snasel, V., Hassanien, A.E., Saha, S., Tripathy, B.K.: Deep Learning: Research and Applications (Vol. 7). Walter de Gruyter GmbH & Co KG (2020) 12. Garg, N., Nikhitha, P., Tripathy, B.K.: Image retrieval using latent feature learning by deep architecture. In: 2014 IEEE International Conference on Computational Intelligence and Computing Research, pp. 1–4 (2014) 13. Bose, A., Tripathy, B.K.: Deep learning for audio signal classification. In: Deep Learning Research and Applications, De Gruyter Publications, pp. 105–136 (2020) 14. Singhania, U., Tripathy, B.K.: Text-Based image retrieval using deep learning. In: Encyclopedia of Information Science and Technology, Fifth Edition, IGI Global, USA, pp. 87–97 (2020) 15. Prakash, V., Tripathy, B.K.: Recent advancements in automatic sign language recognition (SLR). In: Computational Intelligence for Human Action Recognition, CRC Press, pp. 1–24 (2020)

64

K. N. S. Dharmasastha et al.

16. Sai Surya, K.Y., Rani, T., Tripathy, B.K.: Social distance monitoring and face mask detection using deep learning. In: Proceedings of the ICCIDM (2021) 17. Adate, A., Tripathy, B.K.: Understanding single image super resolution techniques with generative adversarial networks. In: Advances in Intelligent Systems and Computing, Springer, Singapore, vol. 816, pp. 833–840 (2019) 18. Maheshwari, K., Shaha, A., Arya, D., Rajasekaran, R., Tripathy, B.K.: Convolutional neural networks: a bottom-up approach. In: Deep Learning Research and Applications, vol. 7, pp. 21– 50 (2020) 19. Bhandare, A., Bhide, M., Gokhale, P., Chandavarkar, R.: Applications of convolutional neural networks. Int. J. Computer Science and Inf. Technol. 7(5), 2206–2215 (2016) 20. Siddique, S., Hamid, M., Tariq, A., Kazi, A.G.: Organic farming: the return to nature. In: Ahmad, P., Wani, M.R., Azooz, M.M., Phan Tran, L.-S. (eds.) Improvement of Crops in the Era of Climatic Changes, pp. 249–281. Springer, New York (2014). https://doi.org/10.1007/ 978-1-4614-8824-8_10 21. Way, M.J., Van Emden, H.F.: Integrated pest management in practice - pathways towards successful application. Crop Prot. 19(2), 81–103 (2000) 22. Wang, Q., Feng, Q., Sun, M., Jianhua, Q., Xue, J.: Identification of Tomato Disease Types and Detection of Infected Areas Based on Deep Convolutional Neural Networks and Object Detection Techniques, Hindawi Computational Intelligence and Neuroscience, Volume 2019, Article ID 9142753, 15 https://doi.org/10.1155/2019/9142753

Deep Neural Network Approach for Identifying Good Answers in Community Platforms Julius Femi Godslove1(B)

and Ajit Kumar Nayak2

1 Department of Computer Science and Engineering, Siksha ‘O’ Anusandhan University, J-15,

Khandagiri Marg, Dharam Vihar, Jagamara, Bhubaneswar, Odisha 751030, India [email protected] 2 Department of Computer Science and Information Technology J-15, Siksha ‘O’ Anusandhan University, Khandagiri Marg, Dharam Vihar, Jagamara, Bhubaneswar, Odisha 751030, India [email protected]

Abstract. Community Question and Answering (CQA) forums are a great source of information and also act as a knowledge repository. Users can post questions or answers to questions on such platforms. Although the benefits of Community Question Answering are numerous, any misleading or false information can harm users thus, it calls for great concern. This paper presents a model for determining good answers in community platforms and compares techniques for quality answer selection and ranking. Keywords: Question quality · Community question answering · Deep learning · CNN-Bi-LSTM-CRF · NLP · Convolutional Neutral Network (CNN) · Conditional Random Field (CRF) · Bidirectional Long Short Term Memory (Bi-LSTM) · Deep neural network

1 Introduction Many community questions and answering platforms like Yahoo Answers, Quora, Stack Overflow, and many others offer a medium of knowledge and information exchange with an underlying infrastructure powered by Question Answering (QA) systems. The system undoubtedly provides the avenue for users to ask and answer questions in natural language. A combination of Information Retrieval (IR), Information Extraction (IE), and techniques in Natural Language Processing (NLP) are fundamental to the sphere of Question Answering. The classification of Question Answering systems can be either open or closed domain-centric. They also include Community Question Answering forums that operate as closed-domain Question Answering in which discussion, questions, and answers are within the confines of specific expertise. Meanwhile, others work as an open domain framework that allows reacting to questions and answers across multiple disciplines. Users always want comprehensive well-formed questions and precise answers regardless of which Question Answering system domain [1]. Thus, the most important objective to all QA systems is retrieving answers to questions other than complete document extraction or matching of passages with the best similarity, as obtainable © The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 M. N. Mohanty et al. (Eds.): METASOFT 2022, AISSE 1, pp. 65–71, 2022. https://doi.org/10.1007/978-3-031-11713-8_7

66

J. F. Godslove and A. K. Nayak

from most current IR systems. Community question answering enables a wide variety of question and answer activities in natural human language and provides a pool of resources needed to actualize an automated Question Answering system for either open or closed domains. An example of one amongst numerous questions asked by users is “How do I get a good model figure?” on the wikiHow platform. There are also, users available with the readiness to share personal experiences by answering questions. Identifying Answers that are of the best quality and suitable to a question involves tagging and cross-referencing to ascertain the authenticity of its quality for aiding automatic validation of whether an answer is good or not, thus leading to a collection of pairs of high-quality question answers [2]. However, one of the challenges of community question and answer forums is the information being shared may not always be factual. Several factors present an explanation for the existence of incorrect answers in Community Question and Answering platforms; these factors include but are not limited to: misinterpretation, weakly formed questions, poor understanding of a subject matter, ignorance, or rudeness in the responder’s tone. Worsened by realizing that majority of the Community Question and Answering systems do not have smooth and systematic control with an operational quality check mechanism. Besides, there is a high sensitivity of truth within time in today’s changing world: yesterday’s reality can be false today. We have focused this paper on 1) automatic identification of answer quality using ensemble attention-based CNN-BiSLTM-CRF Deep learning techniques in CQA context [1, 2]. 2) Exploration of various methods and models for classifying answers to questions in Community Question Answering as “Good,” “Potential,” or “Bad” [3]. This emphasizes the need for good answers that address a question to give users timely reliable insight. Our motivation and contribution in this paper are to present a model for identifying a good answer to a question amongst several responses to the same question which will aid users of CQA systems quick access to good answers to a similar question that may have been answered and decrease perusal time of going through all unnecessary responses. Furthermore, moderators can have oversight of users who contribute in offering relevant answers tagged “Good” from those who use the platform for ulterior motives [2].

2 Ranking and Tagging of Quality Answers in CQA 2.1 Tagging Answers in CQA Using Ensemble Deep Learning Model The focus is on context-based features with dependency on labels making it a useful concept, which indicated an essential role in predicting quality answers in a CQA platform. It was observed that instinctively when a rational answerer decides to answer a question, they may be faced with the following scenarios: • Offer an answer if the question is new (and yet to be answered). • Offer a good answer if it is already answered, but existing responses are found to be wrong. • If a satisfactory answer has been given, a user may provide another unique perspective to the solution. • Some other actions.

Deep Neural Network Approach for Identifying Good Answers

67

The example below showcases an excerpt from a CQA forum such that A4 and A5 are accompanied by “Bad” or “Potential” and “Good” comments. Nonetheless, they both appear to be good answers, with noticeably helpful information. The assumptions above show the possibility of varying answer quality as a result of constraints in a mild form among the tags (in that a “Good” answer has “Bad” answers following it). The relationship amongst answer content can be affected by contextual dependency relation and the interactions between the tags denoting the answer quality. Q. Where can I buy a Samsung phone in Bhubaneswar? A1. Mobile phones are usually sold without a sim. (Bad) A2. There is a lot of phone stand in Delhi. (Bad) A3. Consider searching on google. (Potential) A4. You can order a Samsung phone from an online store and they can deliver it to you in Bhubaneswar. (Good) A5. Go to Samsung mobile shop at Kandagiri in Bhubaneswar they sell various models of Samsung phones. (Good) Example of Community Question Answering thread. For the effective modelling of the contextual information, two neural networkdependent models are proposed with distinct mix modes of CNN, LSTM, and CRF with a Glove embedding layer. As illustrated in Fig. 1 Architecture-1 (ARC-1), an ensemble stack of networks mentioned above; has been viewed as a mix of Recurrent Convolutional Neural Network (RCNN) [4] and Long Short–Term Memory with Conditional Random Forest (LSTM-CRF) [6]. LSTM was applied on succession encoded questionanswer matching sets. In ARC-1, Transition probabilities are memorized by CRF at the final layer over the tag extracts. In [7], it is indicative that the inclusion of backward LSTM and CRF revealed a significant boost. The contrast between [8] and ours is the adoption of LSTM-CRF within the level of comments (sentence-level in actual). We implemented tagging of the sequence aided by CNN sentence modelling [9]. The second model known as Architecture-2 (ARC-2) is more straightforward and has an addition of attention mechanism. In ARC-2, questions and their respective answers have sequence linear connection and, CNN encoding is performed on them. An LSTM that is based on attention is further used on the converted sequence. The model learns by paying attention to the degree to which the context of a question influences the present answer that is predicted. And similar to ARC-1, there is the addition of a layer of CRF. With the help of a simple function used in attention, ARC-2 trains faster than ARC-1 and reduces the size of the parameter space. Experiments were carried out on the SemEval-2015 dataset. At the addition of Bidirectional Long Short-Term Memory (Bi-LSTM) [5] plus CRF, a 58.96% F1 macro score was obtained for tagging quality answers by ACR-1, with a significant 2.82% improvement on modern neural-based network methods. ARC-2 shows a better performance for “Good” and “Bad” groups with an encouraging overall F1 results of 58.29%. The study reflects: (1) From tests results, the dependency on label encoding helps label quality answers;

68

J. F. Godslove and A. K. Nayak

(2) It is supposedly one of the first works utilizing Glove Embedding added on CNNBi-LSTM-CRF sentence-level sequence labelling.

Q

CNN

A

CNN

Q

CNN

A

CNN

HIDDEN

LSTM

HIDDEN

LSTM

HIDDEN

LSTM

LSTM

HIDDEN

BiLSTM

CNN

Y

Y

CRF

Fig. 1. Architecture-1 (ARC-1) of CNN-LSTM-CRF model

Figure 1 shows the journey with a collection of stacked networks being CNN, BiLSTM, and CRF. At the CNN phase, every pair of QA encoded with Glove embedding of 100 dimensions is converted to fixed-length vectors side-by-side CNNs working in sync with each other simultaneously with a layer of fully connected hidden networks. The Bi-LSTM phase comes after the CNN layer; it learns the correlation among the sequence that has been encoded at the preceding layer. Softmax activation function for multi-class classification and a layer with full connection was added later for the generation of the tags predicted. Also, the final layer of ARC-1 calibrates the cost of the entire network via a CRF layer over the generated tags by utilizing the transition probabilities. As observed in [3], they incorporated forward linear-chain CRF.

Q

CNN

BiLSTM αi1

h1 A

CNN

BiLSTM h2

A

CNN CNN

αi2 αiN

BiLSTM hN

+

HIDDEN

Y

HIDDEN

Y

si HIDDEN

ATTENTION-LSTM

Y CRF

Fig. 2. Architecture-2 (ARC-2) attention layer addition and represent the unit ith weight of attention concentrated on the jth component that has been encoded

Deep Neural Network Approach for Identifying Good Answers

69

Added parameters may be found to have been introduced by encoding the sequence of QA match with a possible increase in training time at a high proportion as an effect of this optimization. An intelligent approach is learning directly from the question and answer sequence composition. Figure 2 represents such a model. The first layer of the ARC-2 model consists of a CNN layer that encodes each question/answer using a single CNN. After that, the application of attention-based Bi-LSTM at the LSTM layer learns the context information throughout the sequence. An attention technique can mitigate partiality challenges of RNNs (GRU/LSTM) by computing the encoded distribution of weighted components at every timestamp [10]. The distribution reflects the correlation between the current answer and the surrounding context. In similarity with ARC-1, a CRF phase was added to the final layer for transition learning. In the network simplification, an attention method of simple style was implemented: The vector under consideration is undermined by similitudes between answers at present and context-oriented sentences (i.e., inquiry alongside the appropriate responses). Moreover, there exist a few variations depending on the proposed structures. By eliminating reverse LSTM or CRF, an approved commitment of every module to the eventual outcome was acquired. And by adding CRF with the recently anticipated label a test on the predominance of demonstrating the label succession delicately over complex encoding was completed. Method We treated this task with a focus on multi-class and multi-label classification. For each comment, an extraction of several features from questions and Comments alike was performed, we trained a classifier to identify tagged comments like “Good”, “Bad” or “Potential” as it relates to the question thread. A single question may have several responses with more than one tagged as “Good” or “Bad” or “potential” as the case may be and there is a score provided by the classifier concerning the answers to the question. Meanwhile, we used a prediction mix alongside Google’s corresponding ranking allotted for related queries. Preprocessing and Feature Extraction Preceding any extraction of features, we defined a function for the pre-processing stage in which data cleaning was performed removing symbols, emoji, URLs, punctuations and special characters. We further performed text tokenization by matching continuing alphabetic characters with the inclusion of underscore. Afterwards, the result was lowercased. We used cosine semantic similarity vector and feature metadata groups of Glove 100d, trained on varying unannotated data sources that proved valuable in this experiment.

3 Results and Discussion The tables below showcase the result from our experiment on different models for comparison. The results of these various experiments on-base SemEval annotations dataset [2]. In Table 1, we present the records that indicate a higher accuracy score for the ARC1 + ARC2 model on the multiclass classification on the same dataset compared to the outcome we got from CNN, BiLSTM, and TFBert individually.

70

J. F. Godslove and A. K. Nayak Table 1. Comparison of experimental results.

Techniques

Dataset

Accuracy

Multi Class classification CNN

SemEval

50.46

BiLSTM

SemEval

53.80

ACR1 + ACR2 (CNN + Bi-LSTM + CRF)

SemEval

72.62

TFBert model (Finetuned)

SemEval

42.60

ARC1 + ARC2 (CNN + Bi-LSTM + CRF)

SemEval

82.24

Semantic fine-tuned word embedding (Todor Mihaylov et al., 2016) Subtask A

SemEval

73.39

(Joty et al., 2016)

-

80.5

Binary classification

It is observed from [3] that both CRF and Bi-LSTM complement one another and contribute significantly to deciding quality tag definition. CRF also plays an essential role here in that, the previous answer comment is captured by Bi-LSTM (that is, a wrong answer likely precedes a negative statement) and was noticed to have improved the ARC-1 model by 1%. The introduction of CRF brought an increase of 2% from the baseline. Furthermore, observation from most cases shows that label dependency has more reliance on contextual information. We can see that ARC1 + ARC2 gave a better performance against the technique used in [10] on binary classification using the same dataset. Our methods also reduced the overhead cost of using large vector sizes from [10], which needed several parameter configurations. A 72.29 point of accuracy (about an improved 19 points beyond majority baseline, absolute) is notable and 86.54 of MAP value (indicating an improved point record of 23 over the previous baseline record). There is an indication of the usability of this model on real-life applications or community forums to filter negative or bad comments as answers posted in regards to questions. Although the dataset has been the first of its kind in tackling such a problem and may not cover a range of other domains and demographical information, the success rate of this experiment proves a possibility in exploring a vast corpus that cuts across multiple streams.

4 Conclusion and Future Direction This study has explored different approaches and models for quality answer selection in CQA forums. We will like to include a veracity check to ensure that good tagged or ranked answers are factual alongside an expert recommendation system. Such solutions will boost the credibility of CQA platforms and offer users a genuine information model for allocating a degree to which an answer is factual or not. This approach will also present caution to users whether to trust a particular response or encourage them to

Deep Neural Network Approach for Identifying Good Answers

71

do a double-check. The combination of ARC1 + ARC2 can be extended in the future to cover a broader community base and include an answerer ranking reward system for every verified answer that is factually true. It can also be enhanced for the implementation of a CQA expert recommendation system such that the system can refer users to a verified domain expert when they need more in-depth factual and trustworthy clarification.

References 1. Karadzhov, G., Nakov, P., M‘arquez, L., Barron´-Cedeno, A., Koychev, I.: Fully automated fact-checking using external sources. In: Proceedings of the International Conference on Recent Advances in Natural Language Processing, pp. 344–353 (2017) 2. Lin, Z., Feng, M., Dos Santos, C.N., Yu, M., Xiang, B., Zhou, B., Bengio, Y.: A structured selfattentive sentence embedding. In: 5th International Conference on Learning Representations (ICLR) (2017) 3. Zhou, C., Sun, C., Liu, Z., Lau, F.: A-C-LSTM Neural Network for Text Classification (2015) 4. Nakov, P., Hoogeveen, D., Màrquez, L., Moschitti, A., Mubarak, H., Baldwin, T., Verspoor, K.: SemEval-2017 Task 3: Community Question Answering (2019) 5. Huang, Z., Xu, W., Yu, K.: Bidirectional LSTM-CRF Models for Sequence Tagging (2015) 6. Xiaoqiang, Z., Hu, B., Chen, Q., Tang, B., Wang, X.: Answer Sequence Learning with Neural Networks for Answer Selection in Community Question Answering. 2 (2015). https://doi. org/10.3115/v1/P15-2117 7. Kim, Y.: Convolutional neural networks for sentence classification. Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing. (2014). https://doi.org/ 10.3115/v1/D14-1181 8. Bahdanau, D., Serdyuk, D., Brakel, P., Ke, N., Chorowski, J., Courville, A., Bengio, Y.: Task Loss Estimation for Sequence Prediction (2015) 9. Kaushal, A., White, L., Innes, M., Kumar, R.: WordTokenizers. Jl: Basic tools for tokenizing natural language in Julia. Journal of Open Source Software. 5(46), 1956 (2020). https://doi. org/10.21105/joss.01956 10. Todor M., Preslav N.: Semantics at SemEval-2016 Task 3: Ranking Relevant Answers in Community Question Answering Using Semantic Similarity Based on Fine-tuned Word Embeddings

Time Series Analysis of SAR-Cov-2 Virus in India Using Facebook’s Prophet Sushree Gayatri Priyadarsini Prusty1,2 and Sashikanta Prusty2(B) 1 Sudhananda Group of Institutions, Nachhipur, Balianta, Bhubaneswar, India 2 Raajdhani Engineering College, Mancheswar, Bhubaneswar, India

[email protected]

Abstract. A novel coronavirus (CoV) is a strain of SAR-Cov-2, currently spread all over globe from the end of 2019. COVID-19 plague has already been going on for well over a year, affecting people with a wide variety of illnesses, varying from minor to moderate to serious. However, a survey has taken here, from the day of the start of Covid-19 in Statewise that how many people are affected, activated, cured, discharged, and vaccinated in India. On basis of that, a Facebook’s Prophet Library has been proposed here that will forecasts the time series data by using statistical method, for non-linear movements with annually, weekly, and daily periodicity, as well as holiday. Moreover, it predicts the number of confirmed and discharge rates in coming days. Although, a methodology has been carried out here that might help the physicians to make better decision while predicting the new covid-19 cases. Keywords: Covid-19 · Prophet · Data analysis · Vaccination performance

1 Introduction According to official figures as of August 17, 2021, with 32.2 million recorded instances of COVID-19 infection, India had the world’s second-highest number of confirmed cases (after the United States), and the third-highest number of COVID-19 deaths (after the United States and Brazil), with 432,079 deaths [1, 2]. The ailment induced by a novel coronavirus identified in Wuhan, China, has been called coronavirus disease 2019 (COVID-19), with the letters CO standing for coronavirus, VI for the virus, and D for disease. The virus is spread by direct contact with an infected person’s respiratory droplets (which are produced by coughing and sneezing) and by contacting virus-infested surfaces. Symptoms include fever, coughing, and shortness of breath but in more serious cases, the infection might result in pneumonia or respiratory issues. The disease might be lethal in rare cases. COVID-19 has symptoms that are comparable to the flu (influenza) or a common cold, which are far more prevalent. So that’s why testing is required to identify whether a person is positive with COVID-19. On 30 January 2020, the first instances of COVID-19 in India were recorded in three locations in Kerala where three Indian medical students returned from Wuhan [3, 4]. In 2021 Lakshadweep became the final region of India to report its first case on January © The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 M. N. Mohanty et al. (Eds.): METASOFT 2022, AISSE 1, pp. 72–81, 2022. https://doi.org/10.1007/978-3-031-11713-8_8

Time Series Analysis of SAR-Cov-2 Virus in India Using Facebook’s Prophet

73

19, 2021, over a year after the country’s first case was recorded. A Punjabi preacher who had visited recently Italy and Germany became a superspreader after joining a Sikh celebration in Anandpur Sahib, India, from March 10–12 [5]. Festivals (Holi and the Haridwar Kumbh Mela) have been related to at least 1,700 positive cases between 10 and 14 April 2021, including cases among Hindu seers. Till April 12th, 2021, India has overtaken Brazil as the country with the world’s second-highest number of COVID-19 cases. Also, India had crossed 2.5 million active cases by late April, with an average of 300,000 new cases and 2,000 deaths every day. A second wave, which reached the country in March 2021 make shortages of vaccines, hospital beds, oxygen cylinders, and other treatments, which were considerably worse than the first in some locations [6]. The second wave put pressure on the healthcare system, causing a scarcity of liquid medical oxygen, prompting the announcement of a huge number of new oxygen plants. Figure 2, shows the process of every step that a citizen follows to overcome Covid-19 disease and also to clarify whether he/she has been affected or not. If affected with this disease. If found positive, then he/she must have to be hospitalized to take vaccine (of both 1st and 2nd dose) and will be discharged only after confirmation of his/her Covid-19 negative report (Fig. 1).

Sample collection -ve Covid-19 Testing

Discharge

Confirmed -ve

+ve Confirmed +ve

Admitted

Treatment

Cured

Fig. 1. Process of Covid-19 measurement and evaluation

As of April 30, 2021, India was the first country to record more than 400,000 new cases in 24 h [7]. India began its vaccination program on January 16, 2021, with the AstraZeneca vaccine (Covishield) and the indigenous Covaxin [8]. Sputnik V and the

74

S. G. P. Prusty and S. Prusty

Moderna vaccine were later approved for use in an emergency [9]. The country had delivered about 550 million vaccine doses as of August 17, 2021 [8–10]. Transmissions went up throughout the month when numerous people with a travel history to afflicted countries, as well as their contacts, tested positive. On March 12, a 76-year-old man who had previously traveled to Saudi Arabia became India’s first covid-19 fatality [11]. However, it has been difficult to predict for the number of covid-19 patients in coming days, due working with the seasonable data that are quite noisy and hard to tune. Therefore, in this article a Facebook’s prophet library has been proposed that can easily handle the time series data. It is quick, powerful and provides better modeling to the researcher in predicting before it happens.

2 Methodology Understanding time-based patterns, such as how many individuals are admitted, cured, and discharged in a given day, week, or month, is crucial for any health-care organisation. This is why time series forecasting is today considered one of the most important tools for data analysis. Therefore, a methodology has been carried in Fig. 2 that describes the process regarding how a model will predict, when the data changes in yearly, weekly and daily seasonally plus holidays.

Fig. 2. Process design for prophet prediction

2.1 Data Collection The operating procedure for Coronavirus Disease-19 (COVID-19) is intended to train doctors, nurses, paramedics, and lab technicians incorrect sample collection, labeling, and transportation protocols [12]. In this work, our Covid-19 dataset has been collected from the Kaggle repository, containing information regarding the confirmed positive, cured and death of Covid-19 patients in India. Nasopharyngeal Swab. In this process, the swab has been inserted along the nasal septum, parallel to the nasal passage’s floor until this feels resistance. After that, the swab has been inserted into the VTM (Viral transport medium) for further proceeding.

Time Series Analysis of SAR-Cov-2 Virus in India Using Facebook’s Prophet

75

Oropharyngeal Swab. In this case, the mouth has to be wide open for collecting this swabs for three times and to be placed for 10 s. 2.2 Testing COVID-19 diagnostic testing is performed to determine whether or not you are infected with SARS-CoV-2 (severe acute respiratory syndrome coronavirus 2). In general, there are two types of Covid-19 testing such as: Polymerase Chain Reaction (PCR) Test. A fluid specimen is made by putting a long nasal swab (nasopharyngeal swab) into an individual’s nostril and collecting fluid from the back of the nose, or by using a shorter nasal swab to obtain a sample. When conducted correctly, this tests gives an accurate, and quick result than others [13, 14]. Antigen Test. Some antigen tests can yield results in minutes by using a lengthy nose swab to collect a fluid sample. When the directions are strictly followed, a positive antigen test result is deemed accurate; nonetheless, there may be the possibility of false-negative findings [15, 16]. 2.3 Data Visualization

(a)

(b) Fig. 3. (a) Confirmed and (b) Cured Covid-19 positive cases in India

However, the huge number of confirmed positive cases are found due to the novel corona virus in these days, becoming a great challenge to recover completely for the government

76

S. G. P. Prusty and S. Prusty

of India. As of 7th June 2021, Maharashtra has led in confirmed positive cases who are affected by this virus i.e. 6, 10, 49170 as shown in the above Fig. 3 (a). Among which 5,86,1720 are cured and 1, 23,136 are died due to this novel virus. This represents the number of confirmed Covid-19 positive patients that are found in the top ten states of India. Out of these confirmed cases, the patients who are cured of this Covid-19 disease has shown in the below Fig. 3 (b) as of 7th June 2021. However, Fig. 4, represents the total number of active covid-19 cases across India, as of August 2021 (i.e. 464357 cases). States Maharashtra Kerala Karnataka Tamil Nadu Andhra Pradesh Odisha Assam West Bengal Telangana Manipur Chhattisgarh Meghalaya Tripura Jammu and Kashmir Mizoram Arunachal Pradesh Gujarat Uttar Pradesh Punjab Goa Puducherry Sikkim Uttarakhand Himachal Pradesh Bihar Nagaland Haryana Rajasthan Delhi Jharkhand Madhya Pradesh Lakshadweep Ladakh Chandigarh Dadra and Nagar Haveli and Daman and Diu Andaman and Nicobar Islands

Active cases 120061 101097 42019 34926 33964 26347 23590 17950 11704 5974 5220 4354 3962 3774 3730 3118 2333 2181 2118 1934 1871 1869 1555 1357 1305 1192 1113 1092 912 609 466 278 216 116 35 15

Fig. 4. Total number of Covid-19 active cases in India

Time Series Analysis of SAR-Cov-2 Virus in India Using Facebook’s Prophet

77

2.4 Comparison During this pandemic period in India, some people are still hospitalized, some are cured, some are discharged and also some of them are died due to Covid-19. At the beginning of the Covid-19 pandemic situation in India (i.e. January 2020), people are completely unaware of this disease. For which it spreads easily from people to people, city to cities and state to state in India. Figure 5, shows the comparative analysis on active, cured, and dead cases in the top 3 affected states in India. This comparison implies that death rates are still creating a headache for the government of India, which is above the saturation point.

Fig. 5. Comparison of all active, cured, and death cases in top 3 states

However, the discharge and death rate per each month in from start to till date, has been displayed in above Fig. 6. It has found that, at the end of the first wave i.e. in the year 2020, the death rate had increased rapidly due to lack of consciousness and unavailability of vaccines at that time. But at the beginning of the 21st century, India has started the vaccination process in the most of affected cities, as shown in Fig. 6 (d). The dates for this time series data, on the other hand, are in a column called ‘ds’, while the series values are in a column called ‘y’, going to be described in next section.

78

S. G. P. Prusty and S. Prusty

(a)

(b)

(c)

(d)

Fig. 6. Discharge and death rates of both the year 2020 and 2021

2.5 Facebook’s Prophet As Covid-19 pandemic has started on the first of January, 2020 in India but still becoming a challenge to be cured of this disease completely. India has provided approximately 588 million doses of vaccines, including first and second doses of currently licensed, as of August 23, 2021 [17]. Time series prediction can be difficult since there are so many distinct methods to choose from, each with its own set of hyper-parameters. Therefore, in this article, facebook’s prophet library has been proposed that is absolutely a free, open-source tool for anticipating distributed lag datasets and simple to use [18]. It is designed in such a way that identifies a decent class of hyperparameters for this model to create an accurate prediction for data having a periodic pattern [19]. This library has been designed by Facebook and can handle time-series data. The prophet needs data frames where data is stored in Pandas for forecasting in advance. Additionally, this model forecasts the confirmed positive and the discharge rates in millions as shown in Fig. 8 and 9 respectively. Moreover, this library predicts for the next seven days i.e. from 12th August 2021 to 18th August 2021. The Facebook Prophet algorithm is an open source time series forecasting system developed by Facebook. It constructs a model by determining the best clean edge, which is represented by: y(t) = g(t) + s(t) + h(t) + ε where: g (t) = overall growth trend s (t) = yearly seasonality, weekly seasonality

(1)

Time Series Analysis of SAR-Cov-2 Virus in India Using Facebook’s Prophet

79

h (t) = holiday effect The data is evaluated in this demonstration by leveraging Facebook Prophet to break it into multivariate analysis. This is due to the fact that this method is unable to model some of the training data points. The training of the model, after prediction was made in Fig. 7:

Fig. 7. Prediction after model training

Fig. 8. Predict for seven days on confirmed cases

80

S. G. P. Prusty and S. Prusty

Fig. 9. Predict for seven days on deaths cases

3 Conclusion and Future Work As on the second wave of Covid-19 pandemic situation, the virus affects approximately more than half of the citizens in India. AS government of India had already started its vaccination process but still it is uncontrollable. The state-level cluster analysis, that can be classified according to the strain exerted on the healthcare system in terms of the number of confirmed cases, large population, etc., and the health centers specialized in COVID-19 treatment. Although during the mid of August 2021 Kerala has faced a large number of positive cases, which is going to be a major concern for the government of how to overcome this situation. This article has shown that Facebook’s prophet library would predict the time series or periodical data easily as shown in the above Fig. 7. This model has been designed in such a way that easily identifies the class of hyperparameters for making an accurate prediction. Also here this model predicts in advance what would be the possible positive cases and discharge rates for the next seven days. Also, the healthcare specialists predict that in the coming third wave of the Covid-19 virus, this virus may affect children below under 18 which is a major concern these days. Therefore now the government of India has planned to work on preparing vaccines for children below 18. In probable this model would be able to predict in advance about covid-19 positive cases.

Time Series Analysis of SAR-Cov-2 Virus in India Using Facebook’s Prophet

81

References 1. Scroll Staff. Coronavirus: India records 25,166 new cases in 24 hours – lowest in 154 days. Scroll. (2021) 2. Dong, E., Du, H., Gardner, L.: An interactive web-based dashboard to track COVID-19 in real time. The Lancet Infectious Diseases 20(5) 533–534 (2020). https://doi.org/10.1016/S14733099(20)30120-1. ISSN 1473–3099. PMC 7159018. PMID 32087114 3. Andrews, MA., Areekal, B., Rajesh, K.R., Krishnan, J., Suryakala, R., Krishnan, B., Muraly, C.P., Santhosh, P.V.: First confirmed case of COVID-19 infection in India: A case report. Indian Journal of Medical Res. 151(5), 490–492 (2021). https://doi.org/10.4103/ijmr.IJMR_2 131_20. PMC 7530459. PMID 32611918 4. Narasimhan, T.E.: India’s first coronavirus case: Kerala student in Wuhan tested positive. Business Standard India (2020) 5. Wallen, J.: 40,000 Indians quarantined after ‘super spreader’ ignores government advice. The Telegraph (2021) 6. Michael, S.: India’s shocking surge in Covid cases follows baffling decline. The Guardian (2021) 7. Coronavirus | India becomes first country in the world to report over 4 lakh new cases on 30 April 2021. The Hindu. Special Correspondent. 30 April 2021. ISSN 0971–751X. (2021) 8. IndiaFightsCorona COVID-19. MyGov.in. 16 March 2020. (2021) 9. Livemint, Cipla gets nod to import Moderna’s vaccine for emergency use in India: Report. (2021) 10. Cumulative Covid vaccine doses administered in India cross 55 crore: Govt. Business Standard India. Press Trust of India. 16 August 2021. (2021) 11. India’s first coronavirus death is confirmed in Karnataka. Hindustan Times. 12 March 2020. (2020) 12. Shrestha, L.B., Pokharel, K.: Standard operating procedure for specimen collection, packaging and transport for diagnosis of SARS-COV-2. JNMA. Journal of the Nepal Medical Association 58(228), 627–629 (2020). https://doi.org/10.31729/jnma.5260 13. Torres, I., Poujois, S., Albert, E., Colomina, J., Navarro, D.: Evaluation of a rapid antigen test (Panbio™ COVID-19 Ag rapid test device) for SARS-CoV-2 detection in asymptomatic close contacts of COVID-19 patients. Clin. Microbiol. Infect. 27(4), 636-e1 (2021) 14. Mair, M.D., et al.: A systematic review and meta-analysis comparing the diagnostic accuracy of initial RT-PCR and CT scan in suspected COVID-19 patients. Br. J. Radiol. 94(1119), 20201039 (2021) 15. Agullo, V., et al.: Evaluation of the rapid antigen test Panbio COVID-19 in saliva and nasal swabs in a population-based point-of-care study. J. Infect. 82(5), 186–230 (2021) 16. Amer, R.M., et al.: Diagnostic performance of rapid antigen test for COVID-19 and the effect of viral load, sampling time, subject’s clinical and laboratory parameters on test accuracy. J. Infect. Public Health 14(10), 1446–1453 (2021) 17. Vaccination state wise. Ministry of Health and Family Welfare. (The data on this site changes daily) (2021) 18. Vera, A., Banerjee, S.: The bayesian prophet: a low-regret framework for online decision making. Manage. Sci. 67(3), 1368–1391 (2021) 19. Khayyat, M., Laabidi, K., Almalki, N., Al-Zahrani, M.: Time series Facebook Prophet model and python for COVID-19 outbreak prediction. Computers, Materials, & Continua, 3781– 3793 (2021)

Model-Based Smoke Testing Approach of Service Oriented Architecture (SOA) Pragya Jha(B) , Madhusmita Sahu, and Sukant Kishoro Bisoy Department of Computer Science and Engineering, C V Raman Global University, Bhubaneswar, Odisha 752054, India [email protected], {msahu,sukantabisoyi}@cgu-odisha.ac.in

Abstract. This work focuses on unified modeling language (UML) use case diagram to generate test cases in the context of enterprise resource planning (ERP). After comparing some existing approaches and issues we found out that modelbased smoke testing of service-oriented architecture (SOA)-based applications is good to find a bug in the minimum period of time. We have explained this using a use-case diagram. This paper conveys the UML-based smoke testing method using a use-case diagram and generates a test case considering ERP as a case study. Keywords: Unified modeling language (UML) · Service-oriented architecture (SOA) · Smoke testing · Enterprise resource planning (ERP)

1 Introduction To assure that technology advances to benefit society, it is very critical that sustainable digital systems are continuously developed. This ensures economic growth and global development of the society and this has become all more important in today’s information age. There is a huge investment in infrastructure and innovation to see that these systems do the work they were originally designed to do [1]. Software Testing is one such method applied by companies to ensure that the product is not defective and to verify the actual software product meets the requirement of the user. Organizations at their core are a sophisticated network of different system platforms, applications, and systems enterprises. This comes with its own sets of architectural challenges. So, communication between these interacting platforms and applications in the organization is of utmost importance. To ensure that these communications take place seamlessly it is necessary to increase one-to-one integration or implement middle-ware solutions, for example, enterprise application integration. A well-known approach used by the companies is to SOA provides increased agility and better efficiency and more flexibility. It also allows better utilization of the resources available to the organization

© The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 M. N. Mohanty et al. (Eds.): METASOFT 2022, AISSE 1, pp. 82–91, 2022. https://doi.org/10.1007/978-3-031-11713-8_9

Model-Based Smoke Testing Approach of Service Oriented Architecture (SOA)

83

and thus helps the organization to navigate the market changes more efficiently. The implementation of SOA bridges the gap between industry and software. Most organizations do not focus on SOA as a catalyst for driving organizational changes; instead, they use a more technical approach to SOA. This enables them to achieve increased agility, better efficiency, and IT-business alignment [2]. It is however observed that testing service-centric applications are difficult using SOA due to its dynamic and adaptive nature. It is very challenging to validate and verify the operations of SOA as it becomes a very complex problem [3]. A smoke testing makes it easier to find a bug with the least amount of time. Organization and Contribution The rest of this paper is organized as follows: Sect. 2 describes some related research works on ERP implementations and challenges in its implementations. Section 3 provides a brief description of SOA testing and Smoke testing. In Sect. 4 we provide our approach to ERP implementation testing. Our paper focuses on how to overcome failure if n-numbers of the order are made at the same time, and for this, we take a model-based approach to generate test cases. We consider use-case diagrams for more clarity of test cases and present bug-free software in an earlier stage of testing. We conclude in Sect. 5.

2 Related Works on ERP As stated by Chen [4], only 40% of ERP implementation have done only partial implementation and 20% efforts are scrapped as a total failure. We find the related case study, “A case study on Hershey’s ERP implementation failure: The importance of Testing and Scheduling” by Jonath Gross [5]. Through this case study we try to describe how system issues and business process makes operational system paralyzed which leads to 19% drops in profit and 8% decline in stock price. This study also described how an effective ERP system testing and scheduling important projects can help a company mitigate its exposure to risks of failure and other related damages. Since this study, a lot of research has been done to show why testing phases are considered to be safety nets for any organization and should never be compromised [6–12]. A few have been summarized in Table 1.

84

P. Jha et al. Table 1. Summary of study on ERP software

References

Description

P Gerrard [6]

Describes an overview of possible tools for testing in packaged ERP suites. The main aim to realize an approach to testing ERP suites which caters to the users’ needs and risks

R Kenge, and Z Khan [7]

Studies the ERP system implementation processes and the recent trends of the ERP software. It also finds vulnerabilities in the implementation of the ERP software and provides solution to fix these vulnerabilities, illustrating the need of testing while implementation of ERP software

S.Matende, and P. Ogao [8]

Studies the implementation of ERP software from the user’s point of view. It highlights the importance of user’s participation in the implementation and success of ERP software

S. Bagchi, S. Kanungo, and Dasgupta, S [9]

Evaluate how the user participates and involve the ERP systems. Exploits three different use cases to explain the statistical results obtained

B. Ozorhon, and E. Cinar [11]

Explores the various critical success factors of the implementation of ERP software in the area of construction in a developing country like Turkey

3 Overview of SOA Testing and Smoke Testing In this section, we briefly describe the Service-oriented architecture or SOA and Smoke Testing. SOA architectural design supports service orientation. As a consequence of this SOA is generally applied in the software designs where services are provided by application components to the other components of the system. These services are provided between components through a communication protocol over a network such as SOAP (simple object access protocol)/HTTP or JSON/HTTP—to send requests to read or change data. SOA testing is the testing of this architectural design. SOA testing generally focuses on three system layers: the services layer, the process layer, and the consumer layer. The service layer consists of those services which are exposed by a system from business functions. The process layer consists of processes, a collection of services that are part of a single functionality. These processes might be a tool to read data from some database or a part of the user interface. The consumer layer consists of the user interface and based on this layer the SOA testing is categorized into three layers: Service level, Interface level, and End to End level. An overlay of the SOA is shown in Fig. 1.

Model-Based Smoke Testing Approach of Service Oriented Architecture (SOA)

85

Fig. 1. An overlay of SOA

SOA is a paradigm for industry and utilizing distributed capabilities that may be controlled by different proprietorship areas and implemented using various techniques. To increase day by day demand of industry or IT companies. IT-department has free to combine business services from multiple applications which provide end-to-end support for business processes because SOA implementation usually enables loosely coupled, where IT departments can easily update or change applications without impacting other applications. Most SOA implementations are based on web-application services, where the aim of many critical implementations is the need for complex processes which is not possible by traditional development methods. The SOA-based application needs testing to maintain reliability, risk-tolerance, and quality [13]. 3.1 The Importance of SOA SOA testing has many aspects, but the lowest line is gracefulness and suppleness. SOA makes enticing; IT paradigm is the reason why different testing method is required in SOA implementations. The most specific difference between architectures used before SOA and after SOA is shown in Fig. 2. SOA testing needs interfaces and services, because SOA supports service discovery that may gather various systems and platforms, besides other performance and securityrelated dimensions. Service oriented architecture has ability using different languages to connect each other. SOA provides loose coupling between [14] consumers and providers and new ideas of well-known dependencies between consumers and service providers. A general comparison between SOA and microservices with respect to various metrics is described in Table 2.

86

P. Jha et al.

Fig. 2. Architectural difference before and after SOA Table 2. Comparison between SOA and microservices Metrics

Micro-Services

SOA

Deployment of services

The services are individually deployment

All the services are deployed at once i.e., the deployment is monolithic

Managing teams

One team manages all individual services

Different teams manage user interface, integration, and services

User interface

It is one aspect of the complete service

The user interface acts as a portal for all the required services

Scope of architecture

It is considered to be one complete project

It caters to the whole enterprise

Flexibility

The deployment is fast and parallelizable

The processes catering to business is prioritized

Integration mechanism

The integration mechanism is simple and primitive

The integration mechanism is complex and smart

Integration technology

The technology is heterogenous

The complete integration technology can be considered to be one vector

Integrates cloud

No

Yes

Management

The complete management is distributed and one team caters to one individual service

The management is centralized and all the teams comes under one umbrella for one project

Storing data

The data is stored per unit

The data is shared among various services

Fit

It is fit for infrastructure with medium-sized resources

It is fit for enterprises/ companies with a large infrastructure

Model-Based Smoke Testing Approach of Service Oriented Architecture (SOA)

87

3.2 Challenges in SOA Testing One of the most difficult challenges faced is managing services metadata. SOA based interface can include many services that exchange messages to perform multiple tasks. A single interface/application generates n-numbers of messages. Thus when the organizations have to deal services which belongs to different domains and needs to be delivered to different organization, it becomes very complex. SOA main aim to focus delivers agility to business, it is important to focus testing for more clarity to find issue in the architecture [15]. Service interaction with different domains is the one of the most challenging things like availability of requirement, service and their security, cost of the services. So, the availability of the applications becomes extremely necessary throughout the integration testing and end-to-end testing of the business method. 3.3 Smoke Testing Smoke testing is done at the time of building software and it refers to testing the basic functionality of the build. These are a subset of software testing which should be run on any build and the quality assurance team considers the results of smoke testing as a confirmation whether or not to proceed with further tests. Smoke testing can be categorized into three methods: Manual Method, Automation Method, and Hybrid Method. The steps followed in smoke testing are as follows: 1. Identify smoke test cases: This is an essential step whereas acting the smoke tests. It’s necessary to spot the minimum range of test cases to hide the crucial functionalities of the merchandise so that they’ll be dead quickly. 2. Create smoke test: The known smoke test ought to be costumed produce test cases around them. The check cases area unit is developed manually and also checks scripts which is created to perform automation. 3. Run smoke test: Once the smoke tests area unit is created, they will be ready to run on the build, and results are often analyzed. 4. Analyze smoke test: After the smoke tests area unit is performed, the results ought to be analyzed to grasp whether or not the build could be a pass or a failure.

4 Case Study: Enterprise Resource Planning (ERP) In this section we discuss a case study, “Enterprise Resource Planning (ERP)” that gives clarity of our work and also help to describe a system from an external point of view. We here try representing the view point of a tester and what actual goals he wants to get from a system. Figure 3 describes the functionality of the testing framework which is represented by oval shape. The human stick figure represents the actor of the framework which interacts between user and system. We addressed the ‘Enterprise Resource Planning’ test case to apply our test and smoke testing. Here different use cases represent the different services that are provided by different service providers. We explain the terms in Fig. 3 below.

88

P. Jha et al.

– Construct ERP Model: To verify or validate the ERP test case, check model initially before running any action. A model may be extended or reused for existing test model. On successful construction the model is continue directly to ERP model. – Create ERP Model: A model compose components and then construct the test model leads to a line use case for construct a model component. A model part is employed of course test state to check information came back by an ERP model. – Cache ERP Model: Anytime a model part is created or logged, it’s cached by the check framework to facilitate out-of-process communication ERP client; when a driver is run during a different method to its test cases that is run within the context of ERP shopper. – Retrieve ERP Model: model part is retrieved to populate the cache just in case of an out-of-process check situation to easily access a model element during the state verification time. – Executed information check model: This pattern support two aspects of verification: – Information discovered by ERP foundation categories – Information which is only applicable by model. Applying the principle which check suite can specially verify one part and its associated information parts.

Fig. 3. ERP use case diagram

Model-Based Smoke Testing Approach of Service Oriented Architecture (SOA)

89

– Access ERP design model: Data which is applicable throughout a model design part is accessed by test and then compared with the expected state for verification. – Test case assembly abstract Model: In Fig. 4 the rectangular border denotes the external interfaces which check the integration. Figure 4 provides the static context of the ERP test case framework or highlights the ideas that have must be support to satisfy the earlier stage. Test Suite: This is the abstraction of an element that defines the contract for the implementation of data verification test. Every test case contains a group of tests that verify associate degree ERP definition by accessing corresponding ERP data. It is liable for constructing the associate degree ERP Model part. All connected test suites are prepacked into one test assembly that drives masses improper context for execution of tests. Test Suite conception is further analyzed the section of ERP test suite abstract Model.

Fig. 4. ERP depicting test assembly context

• Test Case: It is the process that each step to be verify of the model. It is the dependency of a check expected state of the system. This is to verify the results throughout its execution. Legal action conception is furthered analyzed within the section ERP test case model. • Test case phase: Check each test case, the action has distributed in three phases: preexecution, execution, and post-execution. Each phase defines what actually want to be

90

P. Jha et al.

done and properly observe the result of a system below the system under test (SUT). ERP system gives a controlled set of inputs. While Fig. 3 describe the necessary ideas where the test framework must be supported and next part covers the necessary behavior that has to be supported by the test case model • Test case identification: This testing enforces single responsibility, consistency, loose coupling, flexibility, and reusable; key characteristics of maintainable software that is resilient to underlying changes in either an ERP client used to drive the tests. We show here few snapshots of the demo test cases we created and tested on the free testing suite (one-time free use) provided by Oracle NetSuite (Fig. 5).

Fig. 5. ERP demo suite template snapshot

5 Conclusion In this paper, we focus on UML use case diagram to generate test cases in the context of the case study Enterprise Resource Planning. After comparing some of the approaches and issues we found out that model-based smoke testing of SOA-based applications is good to find a bug in the minimum period of time. In the future, we will focus on a more effective method for the selection of test cases that reduces redundancy in test case selection for smoke testing. We would be adding different types of UML diagrams in the smoke test for better results.

Model-Based Smoke Testing Approach of Service Oriented Architecture (SOA)

91

References 1. Hustad, E., Olsen, D.H.: Creating a sustainable digital infrastructure: the role of serviceoriented architecture. Procedia Computer Science 181, 597–604 (2021) 2. Baskerville, R.L., Cavallari, M., Hjort-Madsen, K., Pries-Heje, J., Sorrentino, M., Virili, F.: The strategic value of SOA: a comparative case study in the banking sector. Int. J. Inf. Technol. Manage. 9(1), 30–53 (2010) 3. Mohanty, R.K., Pattanayak, B.K., Puthal, B., Mohapatra, D.P.: A roadmap to regression testing of service-oriented architecture (soa) based applications. J. Theoretical Applied Inf. Technol. 36(1) (2012) 4. Chen, I.J.: Planning for ERP systems: analysis and future trend. Business Process Management Journal (2001) 5. Gross, J.: A case study on Hershey’s ERP implementation failure: The Importance of Testing and Scheduling. Pemeco Consulting (2011) 6. Gerrard, P.: September. Test methods and tools for ERP implementations. In: Testing: Academic and Industrial Conference Practice and Research Techniques- MUTATION (TAICPART-MUTATION 2007), pp. 40–46. IEEE (2007) 7. Kenge, R., Khan, Z.: A research study on the ERP system implementation and current trends in ERP. Shanlax Int. J. Management 8(2), 34–39 (2020) 8. Matende, S., Ogao, P.: Enterprise resource planning (ERP) system implementation: a case for user participation. Procedia Technol. 9, 518–526 (2013) 9. Bagchi, S., Kanungo, S., Dasgupta, S.: Modeling use of enterprise resource planning systems: a path analytic study. Eur. J. Inf. Syst. 12(2), 142–158 (2003) 10. Lyytinen, K., Newman, M.: A tale of two coalitions–marginalising the users while successfully implementing an enterprise resource planning system. Inf. Syst. J. 25(2), 71–101 (2015) 11. Ozorhon, B., Cinar, E.: Critical success factors of enterprise resource planning implementation in construction: case of Turkey. J. Manag. Eng. 31(6), 04015014 (2015) 12. Das, S., Dayal, M.: Exploring determinants of cloud-based enterprise resource planning (ERP) selection and adoption: a qualitative study in the Indian education sector. J. Inf. Technol. Case Appl. Res. 18(1), 11–36 (2016) 13. Nickull, D., Reitman, L., Ward, J., Wilber, J.: Service oriented architecture (soa) and specialized messaging patterns. Tech. rep., Adobe Systems Incorporated (2007) 14. Mahmood, Z.: July. Service oriented architecture: potential benefits and challenges. In: Proceedings of the 11th WSEAS International Conference on COMPUTERS, pp. 497–501 (2007) 15. Chen, J.Y., Wang, Y.J., Xiao, Y.: SOA-based service recovery framework. In: 2008 The Ninth International Conference on Web-Age Information Management, pp. 629–635. IEEE (2008)

Role of Hybrid Evolutionary Approaches for Feature Selection in Classification: A Review Jayashree Piri1(B) , Puspanjali Mohapatra1 , Raghunath Dey1 , and Niranjan Panda2 1 Computer Science and Engineering, IIIT, Bhubaneswar, India

{c118001,puspanjali,c118003}@iiit-bh.ac.in

2 Computer Science and Engineering, Siksha ‘O’ Anusandhan Deemed To Be University,

Bhubaneswar, India [email protected]

Abstract. Feature Selection (FS) is a critical pre-processing phase in machine learning (ML) that identifies the best features with the highest classification accuracy, and it has a significant effect on the efficiency of the subsequent learning models. Researchers are focusing on multiple evolutionary algorithms and attempting to design modern hybrid algorithms to solve FS problems because of their dominance over conventional optimization methods. Thus, several studies have been conducted on hybrid evolutionary algorithms for FS, such as the Artificial Bee Colony algorithm (ABC) with Genetic Algorithm (GA), Mayfly Algorithm (MA) with Harmony Search (HS), etc. The contribution of this study is to present a detailed literature search on the hybridization of various evolutionary algorithms to solve the FS problem, as well as to critically evaluate the suggested hybrid techniques. This paper covers reviews of some related studies on hybrid algorithms published from 2009 up to 2021. This review paper serves to give a complete analysis of the evolutionary algorithms used in hybridization, classifiers used, datasets used, application, fitness function used, means of hybridization, and the performance of the new hybridized algorithm in comparison to the existing algorithms. Moreover, emerging problems and concerns are discussed in order to identify widely researched realms for further investigation. Keywords: Feature selection · Evolutionary algorithm · Hybridization · Machine learning

1 Introduction FS is the method of automatically or manually choosing the fewest number of features that can represent a dataset as the original features. Irrelevant features in the data set enforces the model to learn insignificantly resulting low recognition rate and substantial decrease in results. To address these problems, FS is used for dimensionality reduction while also improving the consistency of the feature vector by deleting insignificant and obsolete features. Filter, wrapper, and embedded techniques are the three general categories of FS approach [39]. © The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 M. N. Mohanty et al. (Eds.): METASOFT 2022, AISSE 1, pp. 92–103, 2022. https://doi.org/10.1007/978-3-031-11713-8_10

Role of Hybrid Evolutionary Approaches for Feature Selection

93

Filter techniques are commonly applied as a pre-processing stage. This approach starts with the full range of features and chooses the strongest quality subset of features by the help of statistical assessments including the Wilcoxon Mann Whitney test, the Pearson correlation, chi-square, linear discriminant analysis and ANOVA. Now a days many researchers have turned to wrapper methods, which use classifiers as evaluation mechanisms and often affect the training algorithm [41]. Wrapper strategies attempt to train the algorithm using a portion of features which calculates the training model performance. This makes these methods more computationally complex and also very expensive in comparison to the existing filter methods. Conventional wrapper approaches include recursive feature elimination, backward feature elimination, forward feature collection, and so on [39]. Evolutionary wrapper methods are more popular when the search space is very large. Exploration and exploitation are two opposing conditions to be considered when developing meta- heuristics. In exploration the algorithm searching for new solutions in new regions, while exploitation means using already exist solutions and make refinement to it so it’s fitness will improve. Each nature-influenced methodology has its own positive and negative aspects. We can not find the optimal solution for each kind of function with the individual optimization algorithm. The implementation and proposal of modern meta- heuristics with high precision for actual implementations have therefore become a challenge to scientists. As a result, the hybridization of evolutionary methods has engaged many research people to solve FS problems. The aim of hybridization is to identify compatible alternatives in order to ensure optimal output of optimization methods, which is accomplished by combining and coordinating the exploration and exploitation processes. Papers from web of science, Scopus, and Google Scholar, were compiled for this study. We plan to offer a detailed survey of the hybrid evolutionary methods for the FS problem and address open questions and challenges for further research. This review brings out the attention of researchers practicing on a variety of Evolutionary computing (EC) frameworks, allowing them to further explore positive and motivated ways to address new FS issues.

2 Background The aim of FS in ML is to create useful models of studied phenomena by finding the right collection of features [38]. The key factors of the FS task shown in Fig. 1 are searching techniques, criteria for evaluation, and number of objective functions. 2.1 Searching Techniques Searching or optimization techniques are compulsory for obtaining an optimal state for the FS problem. Some well-known search techniques are exhaustive search, heuristic search, and evolutionary computation (EC). The EC technique has been employed in recent years as an effective method to solve the FS problem. These methods comprise Genetic Algorithm (GA) [38], Binary Harris Hawks Optimization (BHHO) [39], chimp optimization [41], and Binary Ant Lion Optimization (BALO) [40] etc.

94

J. Piri et al.

Fig. 1. Key factors of FS

2.2 Criteria for Evaluation The classification efficiency with the chosen attributes is taken as the assessment criteria for wrapper feature selection approaches. Many standard classifiers like Decision Tree (DT), Support Vector Machine (SVM), Naive Bayes (NB), K-Nearest Neighbor (KNN), Artificial Neural Network (ANN), and Linear Discriminant Analysis (LDA) have been employed as wrappers for FS tasks [1]. Measures from various fields, including information theory, correlation measures, distance measures, and consistency indicators, have been introduced in the case of filter techniques. 2.3 Number of Objectives A single objective (SO) approach is a method of aggregating the feature count and the accuracy of the classifier into one function. However, the multi-objective (MO) strategy refers to a process intended to locate the trade-off solutions on the Pareto front. In the SO case, the winning of a solution over other solutions is accomplished by comparing values of the objective function while the dominance concept is used in multi-target problems to obtain the finest output [2].

3 Systematic Literature Review This study covers reviews of 35 papers on hybrid algorithms published from 2009 up to 2021. These are ABC-GA [3], CPSO-DE [4], HS-RTEA [5], MA-HS [6], MBA-HS [7], PSO-GE [8], SSA-SCA [9], GWO-PSO [10], SOA-TEO [11], SHO-SA [12], MFO-DE

Role of Hybrid Evolutionary Approaches for Feature Selection

95

[13], GWO-CSA [14], PSO-BA [15], BPSO-FLA [16], SCA-DE [17], WOA-SA [18], ABC-PSO [19], ACO-PSO [20], MA-KHA [21], GA-SA [22], GA-ACO [23], TS-BPSO [24], SCA-BPSO [25], GA-PSO [26], ALO-GWO [27], ALO-DE [28], ALO-SCA [29], TLBO-SCA [30], HHO-CS [31], SCA-HHO [32], GWO-HHO [33], ACO-ABC [34], DE-ABC [35], ACO-CS [36], and SCA-CS [37]. However the details of only 10 papers from 2020 and 2021 are described below: 3.1 Thawkar et al. (2021) [30] AIM: This report introduces a TLBO with a SSA to pick the attributes with an ANN model as a fitness assessor. EXPERIMENTAL EVALUATION: The hybrid process yielded 651 breast cancer screenings, and the test results show that TLBO-SSA outperforms the TLBO base algorithm. Compared to GA, the result achieved by TLBO-SSA shows that it is better than GA. ASSESSMENT METRICS: Sensitivity (SE), Specificity (SP), Classification Accuracy (CA), F-Score (FSc), Kappa coeff, FPR, FNR. SEARCH METHOD: Teaching– learning based optimization (TLBO) - Salp swarm algorithm (SSA) FITNESS FUNCTION: Artificial Neural Network. MEANS OF HYBRIDIZATION: The population update is achieved via the TLBO technique or the SSA method during the teaching and learning stage. CLASSIFIER USED: Adaptive Neuro-Fuzzy Inference System (ANFIS). DATASET USED: Digital database for screening mammography (DDSM) dataset, Breast Cancer Wisconsin (WBC) Diagnostic dataset. APPLICATION: To solve the feature selection and classification problems in digital mammography. 3.2 Hussain et al. (2021) [32] AIM: A hybrid optimizing approach is suggested in this article, which incorporates SCA in HHO. SCA integration aims to address inefficient HHO discovery and also improves exploitation with the complex adjustment of nominee solutions to prevent solution stagnation in HHO. EXPERIMENTAL EVALUATION: The proposed SCHHO, is assessed by using CEC’17 computational optimizing trials and 16 low and high dimensional data sets with a total of over 15,000 characteristics and compared with actual SCA, HHO, and other existing optimizers. SCHHO reduces the attribute dimension up to 87% and achieves precision up to 92% by increasing the convergence rate. ASSESSMENT METRICS: avg CA, Mean Fitness (MF), avg number of selected features (NSF), selection ratio (SR), avg running time (RT), standard deviation (STD). SEARCH METHOD: SCA-HHO. FITNESS FUNCTION: f i = w1 × i + w2 × (d i /D), w1 = 0.99, w2 = 1 − w1 Here, i : error produced by KNN, d i : count of features chosen, D: total feature count. MEANS OF HYBRIDIZATION: SCA and HHO are paired to execute their discovery task by SCA and exploitation by HHO with several significant modifications.

96

J. Piri et al.

CLASSIFIER USED: KNN. DATASET USED: Exactly, Exactly2, Lymphography, SpectEW, CongressEW, IonosphereEW, Vote, WineEW, BreastEW, Brain Tumors1, l1 Tumors, Leukemia2, SRBCT, DLBCL, Prostate Tumors, and 14 Tumors. APPLICATION: To improve the FS task specifically for high dimensional data. 3.3 Wajih et al. (2021) [33] AIM: A discrete hybrid GWO and HHO approach called HBGWOHHO is offered in this article for solving FS task. EXPERIMENTAL EVALUATION: 18 UCI datasets have been used to verify the accuracy of the presented system. In terms of precision, chosen function dimension and processing time, the presented approach outshines the GWO. ASSESSMENT METRICS: avg CA, MF, best fitness (BF), wrost fitness (WF), mean NSF, avg RT SEARCH METHOD: GWO-HHO. FITNESS FUNCTION: fitness = α(ER) + (1α) |S f /T f |, Here α lies between 0 and 1, ER: error, S f : count of chosen attribute, and T f : total attribute count. MEANS OF HYBRIDIZATION: Here, exploration is carried out by HHO while exploitation is done by GWO. The designed BGWOHHO is more likely to overcome local optimums, while improving the precision of the solution. CLASSIFIER USED: KNN. DATASET USED: Breastcancer, BreastEW, CongressEW, Exactly, Exactly2, HeartEW, Iono-sphereEW, KrvskpEW, Lymphography, M-of-N, PenglungEW, SonarEW, SpectEW, Tic-tac-toe, WineEW, and Zoo. APPLICATION: for FS task. 3.4 Bindu et al. (2020) [3] AIM: This paper aims to look into the possibility of enhancing the Artificial Bee Colony algorithm (ABC) by hybridizing it with GA. EXPERIMENTAL EVALUATION: Experimental studies on different datasets show that the presented hybrid solution outperforms the current ABC strategy. ASSESSMENT METRICS: CA, NSF. SEARCH METHOD: Artificial Bee Colony algorithm(ABC)-GA. FITNESS FUNCTION: Classification Accuracy. MEANS OF HYBRIDIZATION: After a complete round of individual ABC and GA, a collection of the good origins of food along with a fresh population of high fit solutions is built. The proposed solution involves sharing the results of both protocols. Both the methods performed their upcoming round with the shared population as an input. CLASSIFIER USED: Random Forest. DATASET USED: UCI (Zoo,Wine, Heart-C,Glass,Tic Tac toe). APPLICATION: for FS.

Role of Hybrid Evolutionary Approaches for Feature Selection

97

3.5 Bhattacharyya et al. (2020) [6] AIM: This study aims to introduce a FS approach named MA-HS, which is built on Mayfly Optimization and Harmony Search. EXPERIMENTAL EVALUATION: The suggested MA-HS technique was tested on 18 UCI data and con- trasted to 12 other cutting-edge meta-heuristic FS approaches, as well as 3 high-dimensional microarray datasets. In contrast to others, test findings show that MA-HS is capable of reaching the required high classification efficiency and lower number of attributes. ASSESSMENT METRICS: CA, NSF. SEARCH METHOD: Mayfly Algorithm(MA) - Harmony Search(HS). FITNESS FUNCTION: fitness = y × λ + (1-y)(|f |/|F|) where, |f|: #features in the feature subset, |F|: width of the dataset, λ: error rate and y lies in between 0 and 1. MEANS OF HYBRIDIZATION: The MA and HS have been hybridized by following the pipeline model. CLASSIFIER USED: KNN. DATASET USED: UCI (Zoo, Breastcancer, BreastEW, CongressEW, Exactly, Ionosphere, M-of-n, PenglungEW, SonarEW,Vote, WineEW, Exactly2, HeartEW, Tic-tac-toe, WaveformEW, KrvskpEW, Lymphogra- phyEW, SpectEW) Microarray(Leukaemia2, DLBCL, SRBCT). APPLICATION: It focuses on feature selection for cancer classification. 3.6 Alweshah et al. (2020) [7] AIM: The purpose of this work is to enhance the FS procedure by applying the mine blast algorithm (MBA) to refine the FS in the discovery phase, and then hybridizing MBA with simulated annealing (SA) to improve the solutions found in MBA. EXPERIMENTAL EVALUATION: The recommended method was evaluated on 18 UCI datasets, and the final findings show that MBA–SA outperforms the other 5 strategies listed in the article. ASSESSMENT METRICS: CA, precision (PR), Recall (RE), FSc. SEARCH METHOD: MBA-SA. FITNESS FUNCTION: fitness = αγ R(D) + β ∗ (|R|/|N|) Where, D: classification error rate, R: #selected features, N: width of the original dataset, α ∈ [0, 1] and β = (1 − α). MEANS OF HYBRIDIZATION: SA is inserted into the MBA to select the optimal solution that is. close to the randomly chosen solution and the famous one. The SA is here understood to be an MBA. administrator that improves the exploitation capacity of the MBA process. CLASSIFIER USED: KNN. DATASET USED: UCI (Zoo, Breastcancer, BreastEW, CongressEW, Exactly, Ionosphere, M-of-n, PenglungEW, SonarEW,Vote, WineEW, Exactly2, HeartEW, Tic-tactoe, WaveformEW, KrvskpEW, LymphographyEW, SpectEW, Credit, Derm, Derm2, LED, Lung, Mushroom, WQ). APPLICATION: for boosting FS.

98

J. Piri et al.

3.7 Meera et al. (2020) [8] SAIM: In this article, a Hybrid PSO-Grammatical Evolution (GE) is suggested to improve performance, minimize query processing time, and shorten the processing load of PSO. EXPERIMENTAL EVALUATION: The findings of the tests demonstrate that the hybrid PSO-GE approach is more efficient than current approaches. ASSESSMENT METRICS: PR, RE, CA, FSc, time complexity. SEARCH METHOD: PSO - GE. FITNESS FUNCTION: Classification accuracy. MEANS OF HYBRIDIZATION: Here GE and PSO are hybridized in pipeline mode. CLASSIFIER USED: KNN, Naive Bayes classifiers. DATASET USED: The Product Opinion Dataset from Amazon. APPLICATION: To do an efficient feature selection in big data. 3.8 Neggaz et al. (2020) [9] AIM: This study presents a new Salp Swarm Optimizer (SSA) form, known as ISSAFD, for FS. Using sinusoidal mathematical functions inspired by the Sine Cosine optimizer, ISSAFD adjusts follower (F) location in SSA. EXPERIMENTAL EVALUATION: The findings of the experiment show that the proposed algorithm works better than other FS approaches, including sensitiveness, precision, exactness and the amount of features chosen. ASSESSMENT METRICS: avg of CA, SE, SP, fitness, NSF, RT, STD. SEARCH METHOD: SSA - Sine Cosine algorithm (SCA). FITNESS FUNCTION: Fit i = λ × γ i + μ × ( |BX i |/Dim) Where, λ ∈ [0, 1] and (μ = 1 − λ), γ i : the test error rate. BX i : length of chosen attribute vector. λ: an equalization factor. MEANS OF HYBRIDIZATION: In this approach, SSA is used to update the leader population and SCA. is used to update the followers’ population. CLASSIFIER USED: KNN. DATASET USED: Exactly, Exactly2,HeartEW, Lymphography,M-ofn, PenglungEW, SonarEW, SpectEW, CongressEW, IonosphereEW, KrvskpEW, Vote, WaveformEW, WineEW, Zoo BreastEW, Brain Tumors. 2 9 Tumors, Leukemia 3, Prostate Tumors. APPLICATION: for improving FS. 3.9 Hans et al. (2020) [29] AIM: A SCALO, which is the hybridization of SCA with ALO, is proposed in this article. With the intention of eliminating irrelevant features and enhancing the classification precision, the proposed algorithm is mapped into discrete representations with the theory of transfer functions.

Role of Hybrid Evolutionary Approaches for Feature Selection

99

EXPERIMENTAL EVALUATION: In addition, the efficiency of SCALO binary variants is contrasted with a few of the new evolutionary techniques based on diverse criteria. The experiment findings demonstrate that the suggested SCALO binary variation does better on different validation criteria to solve the FS task. ASSESSMENT METRICS: avg CA, MF, WF, BF, STD,avg NSF, FSc. SEARCH METHOD: ALO-SCA. FITNESS FUNCTION: fitness = Z 1 |ES / N| +Z 2 ∗ μ(D)) Where, μ(D): wrapper classifier’s error, |FS|: size of chosen feature string N: actual feature vector size, Z 1 , Z 2 : constants, Z 1 [0, 1] and Z 2 = (1 -Z 1 ). MEANS OF HYBRIDIZATION: The suggested approach updates the top half of the population with the SCA system and the bottom half with the ALO system. The SCALO hybrid algorithm takes advantage of the SCA in order to consolidate discovery and exploitation. An increased variety of solutions is achieved by completing random ALO walks based on the optimal solution and the solution chosen by the tournament choice operator. CLASSIFIER USED: KNN. DATASET USED: Zoo, Statlog credit, Lung cancer, Exactly, Exactly2, Heart, Vote, Spect Heart,Australian, Ionosphere, Water treatment, Wine, Waveform, Glass Identification, Breast cancer, Sonar, Statlog Vehicle Xab, Vowel. APPLICATION: To enhance the feature selection. 3.10 Khamees et al. (2020) [37] SAIM: A novel optimizer is discussed in this work to be used for choosing attributes. In order to achieve an optimal effective solution, the suggested methodology uses the strength of the SCA and CS optimization technique to explore and check the region. EXPERIMENTAL EVALUATION: The findings of the tests have shown precisely the efficacy of the hybrid technique in exploring and using the finest feature area and enhanced the high rating with time for running all data sets. ASSESSMENT METRICS: MSE, CA, RT SEARCH METHOD: SCA-CS. FITNESS FUNCTION: cost function = α ∗ ErRate(d) + β(L/T) Here, α and β are consts, L: #chosen attributes, T: #total attributes. MEANS OF HYBRIDIZATION: In this case, SCA starts the search process by generating the random solutions, which are used as a CS input. A number of applicant solutions from the CS algorithm are introduced in the optimization process of SCA, which will be refined and checked regularly with the objective function. It retains the optimal solution and does further testing. CLASSIFIER USED: KNN. DATASET USED: Breast, Leukemia, Heart, Iris. APPLICATION: for FS in classification.

100

J. Piri et al.

Fig. 2. Number of papers per technique

4 Analysis As per the study, the bulk of research used the wrapper approach rather than the filter, owing to the dominance of wrappers in terms of high accuracy relative to filters. Many researchers have attempted to combine filter and wrapper strategies in order to reap the benefits of both techniques. Figure 2 shows the number of papers according to the evolutionary techniques used in their research for hybridization. This clearly shows that PSO is used by the highest number of research (10) for the purpose of fusion. This is probably due to the fact that PSO is free from derivatives and its concept and coding is easy as compared to others. PSO is a little less sensitive to the character of the target function. In contrast with the other rival evolutionary approaches, PSO has a small set of parameters that include inertia weight and two acceleration coefficients only. There are 22 studies out of 35 that used KNN as a wrapper in their fitness computation process because it is easy to understand and takes less computation time. Also, the training process of KNN is very fast because it does not use any training data to reach a decision. Experiments on different datasets show that the suggested hybrid methods outperform the current FS technique after hybridization, which seeks to find similar alternatives to achieve the best outcomes when solving optimization tasks.

5 Conclusion Scholars researching on ML and knowledge mining have given hybrid evolutionary FS a lot of thought over the years. Even though, along with the (NFL) theorem, there was not and would never be an optimization strategy capable of addressing all challenges. To help researchers in their endeavors, we attempted a systematic literature review considering the studies published from 2009 up to 2021, to highlight the main challenges

Role of Hybrid Evolutionary Approaches for Feature Selection

101

and techniques used for hybrid evolutionary feature selection, which serves to give the complete analysis of the evolutionary algorithms used in hybridization, classifier used, datasets used, application of the hybrid algorithm, fitness function used, means of hybridization and the performance of the new hybridized algorithm in comparison to the existing algorithms. According to the results of this survey, significant attempts have been made to increase the performance of evolutionary wrapper FS approaches through hybridization in terms of precision and size of feature subsets, opening the way for potential developments.

References 1. Liu, H., Zhao, Z.: Manipulating data and dimension reduction methods: Feature selection. In: Encyclopedia of Complexity and Systems Science, pp. 5348–5359, Springer (2009). https:// doi.org/10.1007/978-0-387-30440-3_317 2. Piri, J., Dey, R.: Quantitative association rule mining using multi-objective particle swarm optimization. Int J Sci Eng Res 5(10), 155–161 (2014) 3. Bindu, M.G., Sabu, M.K.: A hybrid feature selection approach using artifcial bee colony and genetic algorithm. 2020 Advanced Computing and Communication Technologies for High Performance Applications (ACCTHPA), Cochin, India, pp. 211-216 (2020). https://doi.org/ 10.1109/ACCTHPA49271.2020.9213197 4. Ajibade, S.-S.M., Binti Ahmad, N.B., Zainal, A.: A hybrid chaotic particle swarm optimization with differential evolution for feature selection. 2020 IEEE Symposium on Industrial Electronics & Applications (ISIEA), TBD, Malaysia, pp. 1-6 (2020). https://doi.org/10.1109/ ISIEA49364.2020.9188198 5. Ahmed, S., Ghosh, K.K., Singh, P.K., Geem, Z.W., Sarkar, R.: Hybrid of harmony search algorithm and ring theory-based evolutionary algorithm for feature selection. IEEE Access 8, 102629–102645 (2020). https://doi.org/10.1109/ACCESS.2020.2999093 6. Bhattacharyya, T., Chatterjee, B., Singh, P.K., Yoon, J.H., Geem, Z.W., Sarkar, R.: Mayfly in harmony: a new hybrid meta-heuristic feature selection algorithm. IEEE Access 8, 195929– 195945 (2020). https://doi.org/10.1109/ACCESS.2020.3031718 7. Alweshah, M., Alkhalaileh, S., Albashish, D., Mafarja, M., Bsoul, Q., Dorgham, O.: A hybrid mine blast algorithm for feature selection problems. Soft. Comput. 25(1), 517–534 (2020). https://doi.org/10.1007/s00500-020-05164-4 8. Meera, S., Sundar, C.: A hybrid metaheuristic approach for efficient feature selection methods in big data. J. Ambient. Intell. Humaniz. Comput. 12(3), 3743–3751 (2020). https://doi.org/ 10.1007/s12652-019-01656-w 9. Neggaz, N., Ewees, A., Elaziz, E.A., Mohamed and Mafarja, Majdi.: Boosting salp swarm algorithm by sine cosine algorithm and disrupt operator for feature selection. Expert Syst. Appl. 145, 113103 (2019). https://doi.org/10.1016/j.eswa.2019.113103 10. Al-Tashi, Q., Abdul Kadir, S.J., Rais, H.M., Mirjalili, S., Alhussian, H.: Binary optimization using hybrid grey wolf optimization for feature selection. In: IEEE Access, vol. 7, pp. 39496– 39508 (2019). https://doi.org/10.1109/ACCESS.2019.2906757 11. Jia, H., Xing, Z., Song, W.: A new hybrid seagull optimization algorithm for feature selection. IEEE Access 7, 49614–49631 (2019). https://doi.org/10.1109/ACCESS.2019.2909945 12. Jia, H., Li, J., Song, W., Peng, X., Lang, C., Li, Y.: Spotted hyena optimization algorithm with si mulated annealing for feature selection. IEEE Access 7, 71943–71962 (2019). https://doi. org/10.1109/ACCESS.2019.2919991

102

J. Piri et al.

13. Abd Elaziz, M., Ewees, A.A., Ibrahim, R.A., Lu, S.: Oppositionbased moth-flame optimization improved by differential evolution for feature selection. Mathematics and Computers in Simulation, 168, pp. 48–75 (2020), ISSN 0378–4754 https://doi.org/10.1016/j.matcom.2019. 06.017 14. Arora, S., Singh, H., Sharma, M., Sharma, S., Anand, P.: A new hybrid algorithm based on grey wolf optimization and crow search algorithm for unconstrained function optimization and feature selection. IEEE Access 7, 26343–26361 (2019). https://doi.org/10.1109/ACC ESS.2019.2897325 15. Tawhid, M.A., Dsouza, K.B.: Hybrid binary bat enhanced particle swarm optimization algorithm for solving feature selection problems. Applied Computing and Informatics 16(1/2), 117–136 (2018). https://doi.org/10.1016/j.aci.2018.04.001 16. Rajamohana, S.P., Umamaheswari, K.: Hybrid approach of improved binary particle swarm optimization and shufed frog leaping for feature selection. Computers and Electrical Eng. 67, 497–508 (2018). ISSN 0045–7906 https://doi.org/10.1016/j.compeleceng.2018.02.015 17. Abd Elaziz, M.E.: A Hybrid Method of Sine Cosine Algorithm and Differential Evolution for Feature Selection (2017) 18. Mafarja, M.M., Mirjalili, S.: Hybrid whale optimization algorithm with simulated annealing for feature selection. Neurocomputing 260, 302–312 (2017), ISSN 0925–2312. https://doi. org/10.1016/j.neucom.2017.04.053 19. Mendiratta, S., Turk, N., Bansal, D.: Automatic speech recognition using optimal selection of fea- tures based on hybrid ABC-PSO. In: 2016 International Conference on Inventive Computation Technologies (ICICT), Coimbatore, India, pp. 1–7 (2016). https://doi.org/10. 1109/INVENTIVE.2016.7824866 20. Menghour, K., Souici-Meslati, L.: Hybrid ACO-PSO based approaches for feature se- lection. International Journal of Intelligent Engineering and Systems. 9, 65–79 (2016). https://doi.org/ 10.22266/ijies2016.0930.07 21. Hafez, A.I., Hassanien, A.E., Zawbaa, H.M., Emary, E.: Hybrid Monkey Algorithm with Krill Herd Algorithm optimization for feature selection. In: 2015 11th International Computer Engineering Conference (ICENCO), Cairo, Egypt, pp. 273-277 (2015). https://doi.org/10. 1109/ICENCO.2015.7416361 22. Azmi, R., Pishgoo, B., Norozi, N., Koohzadi, M., Baesi, F.: A hybrid GA and SA algorithms for feature selection in recognition of hand-printed Farsi characters. In: 2010 IEEE International Conference on Intelligent Computing and Intelligent Systems, Xiamen, China, pp. 384–387 (2010). https://doi.org/10.1109/ICI-CISYS.2010.5658728 23. Nemati, S., Basiri, M.E., Ghasem-Aghaee, N., Aghdam, M.H.: A novel ACO–GA hybrid algorithm for feature selection in protein function prediction. Expert Systems with Applications 36(10), pp.12086–12094 (2009). ISSN 0957–4174 https://doi.org/10.1016/j.eswa.2009. 04.023 24. Chuang, L.Y., Yang, C.H., Yang, C.H.: Tabu search and binary particle swarm optimization for feature selection using microarray data. J Comput Biol. 16(12), 1689–1703 (2009). https:// doi.org/10.1089/cmb.2007.0211. PMID: 20047491 25. Kumar, L., Bharti, K.K.: A novel hybrid BPSO–SCA approach for feature selection. Nat. Comput. 20(1), 39–61 (2019). https://doi.org/10.1007/s11047-019-09769-z 26. Moslehi, F., Haeri, A.: A novel hybrid wrapper–filter approach based on genetic algorithm, particle swarm optimization for feature subset selection. J. Ambient. Intell. Humaniz. Comput. 11(3), 1105–1127 (2019). https://doi.org/10.1007/s12652-019-01364-5 27. Zawbaa, H., Eid, E., Grosan, C., Snasel, V.: Large-dimensionality small- instance set feature selection: A hybrid bio-inspired heuristic approach. Swarm and Evolutionary Computation. 42 (2018). https://doi.org/10.1016/j.swevo.2018.02.021

Role of Hybrid Evolutionary Approaches for Feature Selection

103

28. Abualigah, L., Diabat, A.: A novel hybrid antlion optimization algorithm for multi-objective task scheduling problems in cloud computing environments. Clust. Comput. 24(1), 205–223 (2020). https://doi.org/10.1007/s10586-020-03075-5 29. Hans, R., Kaur, H.: Hybrid binary sine cosine algorithm and ant lion optimization (SCALO) approaches for feature selection problem. International J. Computational Materials Science and Eng. 09 (2019). https://doi.org/10.1142/S2047684119500210 30. Thawkar, S.: A hybrid model using teaching–learning-based optimization and Salp swarm algorithm for feature selection and classification in digital mammography. J. Ambient. Intell. Humaniz. Comput. 12(9), 8793–8808 (2021). https://doi.org/10.1007/s12652-020-02662-z 31. Houssein, E.H., et al.: Hybrid Harris hawks optimization with cuckoo search for drug design and discovery in chemoinformatics. Scientific Reports 10(1), 1-22 (2020) 32. Hussain, K., et al.: An efficient hybrid sine-cosine Harris hawks optimization for low and high- dimensional feature selection. Expert Systems with Applications 176 114778 (2021) 33. Al-Wajih, R., et al.: Hybrid binary grey wolf with harris hawks optimizer for feature selection. IEEE Access 9 31662–31677 (2021) 34. Shunmugapriya, P., Kanmani, S.: A hybrid algorithm using ant and bee colony optimization for feature selection and classification (AC-ABC Hybrid). Swarm Evol. Comput. 36, 27–36 (2017) 35. Zorarpacı, E., O¨zel, S.A.: A hybrid approach of differential evolution and artificial bee colony for feature selection. Expert Systems with Applications 62, 91–103 (2016) 36. Jona, J.B., Nagaveni, N.: Ant-cuckoo colony optimization for feature selection in digital mammo- gram. Pakistan J. biological sciences: PJBS 17(2), 266–271 (2014) 37. Khamees, M., Rashed, A.A.-B.: Hybrid SCA-CS optimization algorithm for feature selection in classification problems. In: AIP Conference Proceedings. 2290(1). AIP Publishing LLC (2020) 38. Piri, J., Mohapatra, P., Dey, R.: Fetal health status classification using MOGA-CD based feature selection approach. In: 2020 IEEE International Conference on Electronics, Computing and Communication Technologies (CONECCT). IEEE (2020) 39. Piri, J., Mohapatra, P.: An analytical study of modified multi-objective harris hawk optimizer towards medical data feature selection. Computers in Biology and Medicine 135, 104558 (2021) 40. Piri, J., Mohapatra, P., Dey, R.: Multi-objective ant lion optimization based feature retrieval methodology for investigation of fetal wellbeing. In: 2021 Third International Conference on Inventive Research in Computing Applications (ICIRCA). IEEE (2021) 41. Piri, J., Mohapatra, P., Pradhan, M.R., Acharya, B., Patra, T.K.: A binary multi-objective chimp optimizer with dual archive for feature selection in the healthcare domain. In: IEEE Access https://doi.org/10.1109/ACCESS.2021.3138403

Evaluation of Deep Learning Models for Detecting Breast Cancer Using Mammograms Subasish Mohapatra(B) , Sarmistha Muduly, Subhadarshini Mohanty, and Santosh Kumar Moharana Odisha University of Technology and Research (OUTR), Bhubaneswar, Odisha, India {smohapatra,sdmohantycse}@cet.edu.in

Abstract. The convolution neural network, a deep learning approach, has emerged as the most promising technique for detecting breast cancer in mammograms. This article explores some of the CNN models used to detect breast cancer by classifying mammogram images into benign, cancer, or normal class. Our study evaluated the performance of various CNN architectures such as AlexNet, VGG16, and ResNet50 by training some of them from scratch and some using transfer learning with pre-trained weights. The above model classifiers are trained and tested using mini-DDSM dataset. Rotation and zooming techniques are applied to increase the data volume. The validation strategy used is 90:10 ratio. AlexNet showed an accuracy of 65 percent, whereas VGG16 and ResNet50 showed an accuracy of 65% and 61% respectively when fine-tuned with pre-trained weights. VGG16 performed significantly worse when trained from scratch, whereas AlexNet outperformed others. VGG16 and ResNet50 performed well when transfer learning was applied. Keywords: Deep learning · Deep convolution neural network · Medical imaging · Mammograms (MGs)

1 Introduction Breast Cancer can be viewed as common cancer in women and is the second most prime cause of death worldwide. It is caused by abnormal cell growth in the breast tissue, which results in the formation of a tumor and hence poses a severe risk to women’s health and life. A lump in the breast, nipple discharge, and shape change of breast are all signs of breast cancer. By detecting lumps in their early stages, the mortality rate can be significantly reduced. Various medical imaging methods such as Magnetic Resonance Imaging (MRI), mammography, breast sonography, and magnetic resonance tomography (MRT) are widely used for breast cancer diagnosis [1]. Mammograms are the most preferred imaging methodology for screening early breast cancer. It is regarded as the most reliable cancer detection method, as it has less radiation exposure than other alternatives available [2]. There are two significant findings seen in mammogram images, including masses and calcification. The calcification is distinguished as a coarser, © The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 M. N. Mohanty et al. (Eds.): METASOFT 2022, AISSE 1, pp. 104–112, 2022. https://doi.org/10.1007/978-3-031-11713-8_11

Evaluation of Deep Learning Models for Detecting Breast Cancer

105

granular, popcorn, or ring shape characteristics with higher density and more dispersed [3]. Masses can be shown as medium grey or white regions in the breast, and their shapes can be oval, irregular, lobular, and with margin types that can be circumscribed, speculated, ill-defined, or obscured [4]. A mass can either be classified as benign or malignant. Benign tumors are usually distinguished as round or oval, whereas tumors having a partially round shape with an irregular outline are malignant, which is shown in Fig. 1 [5].

Fig. 1. Samples of the mini-DDSM dataset a) Cancer b) Benign c) Normal

The mammograms can be acquired in two different views (i) craniocaudal view and mediolateral oblique view (Fig. 2).

Fig. 2. a) Craniocaudal view of patient right breast b) Mediolateral oblique view of the patient right breast

During the actual diagnosis process, various factors such as image quality, radiologist expertise, the complexity of the breast structure affects the cancer detection accuracy. To

106

S. Mohapatra et al.

address this issue Computer-Aided Diagnosis (CAD) system comes into the picture [6]. Recently, Artificial Intelligence based CAD systems are used for providing better accuracy and early detection. This has opened many research directions for the researchers and developers to explore in deep learning methods for mammograms. In particular, Convolution Neural Networks (CNNs) have been used for lesion localization, detection, and classification task in mammograms [7]. In Recent years, DCNN has made remarkable advances in medical fields, most notably in image classification tasks; they play an essential role in improving CAD systems’ performance for breast cancer diagnosis [8]. Radiologists miss approximately 20% of breast cancer cases due to extremely few miniature calcification groups or masses which becomes extremely difficult for specialists to make the correct diagnosis of early breast cancer This led to the development of (CAD)-based system. Convolution Neural Network showed best performance on many image-classification tasks, and a few research studies have shown that CNNs can perform better on the mammogram classification. An effective classifier offers many benefits on classification of mammograms which incorporates (i) Lot of work can be saved by annotating mammograms (ii) Reduce the no of patient call-back rate (iii) Reducing false-positive cases and unnecessary follow-up test which overburdens the patients with increased health care costs. This has become the motivating factor to proceed ahead with the study. Our primary objective is to conduct a comparative study of various CNN architectures for achieving improved accuracy towards classification of benign and malignant tumors in the mammogram images. The performance of CNN models is compared using two methods: the first is to train the model from scratch, and the second is to train the model with pre-trained weights. The structure of the paper is categorized as follows: Sect. 2 details about the literature review of the work carried out by the researcher in the field of medical image classification using deep learning methods. Section 3 describes the methodology used to carry out the experiments. Section 4 details about the simulation environment; Sect. 5 discusses about the results from the experiments conducted and at last Sect. 6 provides the discussion of the study.

2 Literature Review In past few years, several researchers have performed classification of malignant and non-malignant classes of breast cancer by using different neural network classifiers. Hua Li et al. proposed a neural network model called Dense Net-II neural network. The mammogram images are pre-processed, and then a data enhancement technique is applied. Next, the first convolution layers of the DenseNet is replaced by Inception Net. 10-fold cross-validation strategy was used. The pre-processed mammogram images are the input source for the following neural network, i.e., AlexNet, VGGNet, GoogleNet, DenseNet, and DenseNet-II models. The results are compared, and it was shown that DenseNet-II neural network performed better than other models. The accuracy of the DenseNet-II network reached 94.55% (sensitivity 95.6%, specificity 95.36%). The dataset used was collected from First Hospital of Shanxi Medical University [3]. Dhungel N et al. presented a method that uses a cascade of CNN and a random forest classifier for detecting masses in mammograms. In first step multi-scale deep

Evaluation of Deep Learning Models for Detecting Breast Cancer

107

belief network was used to identify suspicious regions, which was then processed by a series of CNN, and a cascade of random forest classifiers for classification. IN breast and DDSM-BCRP datasets were used, and the sensitivity achieved for both cases is 85%–90%, respectively [4]. Phu T. Nguyen et al. has performed the classification of breast cancer into benign and malignant using Break His dataset. They built a CNN model in which original images are resized and was used to classify breast cancer classes. There are 7909 breast cancer images in the Break His dataset, categorized as benign or malignant from which 2440 images are in the benign category, and the remaining 5429 images are in the malignant category. It contains four subclasses under benign category and four subclasses under malignant category [9, 10]. Shen Li et al. have created a deep learning algorithm based on a convolution neural network method to detect breast cancer using mammograms. They performed the experiments on digitized film mammograms from the CBIS-DDSM dataset. The single Model has an AUC of 0.88 for every image four-model has AUC 0.91 with averaging and sensitivity and specificity at 86.1% and 80.1% respectively. Similarly, for FFDM images of the INbreast database, the single Model has AUC of 0.95 for each image. The four-model averaging has AUC 0.98 with sensitivity:86.7% and specificity: 96.1% [11, 12]. Levy D et al. used techniques like transfer learning, pre-processing of data, and applying augmentation to train the CNN architecture from start to finish. In their study, three different CNN architectures have been trained, such as shallow CNN, AlexNet, and GoogleNet, among which GoogleNet showed better performance compared to others with an accuracy of 0.92. The Dataset DDSM was used for the experiment [13]. Huynh, Li et al. have done the classification methodology to distinguish malignant and benign lesions. They have done a comparison between three methods. In first method, they have used pre-trained CNN features with SVM classifier, in the second method segmented tumor-based analytical method with SVM. In the third method, an ensemble classifier averaged between two individual classifier. Among the three methods, the performance of ensemble classifier showed significantly better performance with AUC (0.86) compared to the other two [14].

3 Proposed Methods Mini-DDSM dataset is used for the experiment which is publicly available. It is a condensed version of the popular DDSM (Digital Database for Screening Mammography) data set. It includes 9752 mammogram images that are classified as benign, cancerous, or normal. The dataset is divided into training and testing data in 90:10 ratios. The training dataset contains 8679 images with three classes, and the testing dataset contains 1073 images. Pre-processing is very crucial step to enhance the performance of CAD system. The medical dataset used here contain images of different shape and sizes as compared to the images required by the network classifier. The images fed should match the input size of the network classifier. To do so, the images in the dataset have been rescaled and resized to match the required size of CNN classifiers. The Alex Net model requires input images of size [227 × 227 × 3], whereas the VGG 16 and ResNet50 models both require input images of size [224 × 224 × 3].

108

S. Mohapatra et al.

Overfitting is how a network model learns perfectly well on training data, but it fails miserably on test data. To avoid overfitting, data augmentation is applied to increase the number of mammogram images from the original dataset, due to limited volume of data. In our approach, we are using augmentation based on geometric transformations which is quite simpler than the other forms. Here each image is augmented by using rotation, flipping, and zooming as they are simple to implement. Rotation augmentations can be performed by rotating the image to the right or left on an axis ranging from 1° to 359° [15]. To implement the augmentation process, we have used ImageGenerator method of the Keras library to generate batches of the images data in real time. Convolution neural network is a deep learning method which is very effective in classifying images, object detection and image recognition. It consists of three types of layers: convolution, pooling and fully connected layers. Convolution layer performs the extraction of the features from the input image by applying relevant filters, whereas the pooling layer reduces the dimensionality of extracted feature generated from the convolution layer and finally the last fully connected layer performs the classification of the features extracted from the series of convolution and pooling layer. AlexNet, VGGNet, ResNet, are the well-known CNN classifiers used in our approach. Alex Net architecture was the first one which outperformed a classification and detection task [16]. It comprises of eight weighted layers, the first five are convolutional layers, and the last three are fully connected layers. The last layer’s output is fed into the Softmax function, which produces 1000 class labels [16]. VGG 16 contains a series of convolution layer with Relu as an activation function and filters of size 3 × 3. The input image to the convolution layer is of fixed size 224 × 224. The convolution stride is kept constant at one pixel, and the spatial padding is kept one pixel for 3 × 3 convolution Layers. Five max-pooling layers are used to perform spatial pooling after few convolutional layers. There is stack of three fully connected layers connected to the previous layers. The first two FCs have 4096 channels each, while the third has 1000 channels for each class. For classification, the final layer includes a soft-max function [16]. ResNet50 network works on the principle of taking a deep CNN and adding a shortcut connection to skip few convolution layers at time. The shortcut connections create residual blocks. The output of the convolution layer is added to the residual block. The deep residual learning framework is designed to address the degradation issue caused by the deeper network [16]. Transfer learning is a machine learning technique. Here, the model developed for one task is being reused as the initial starting point for a model on a different task. The basic idea is to apply the Model knowledge from a task with a large amount of available training data to a new task with much less data. It can be used as a feature extractor, and its parameters are fine-tuned to match the target dataset. Normally to train a deep model it requires days or month to complete, but in this scenario the computing time is reduced to hours when the model is moved to the target application. Currently, the most well-known object classification is based on ImageNet [17].

Evaluation of Deep Learning Models for Detecting Breast Cancer

109

4 Simulation Environment The goal of this study is to assess the performance of deep neural networks in classifying mammogram images as cancerous, benign, or normal. The Mini-DDSM dataset is used to evaluate the effectiveness of various CNN models. The experiments are conducted in Google Collaboratory, a free jupyter notebook that runs on the cloud platform. It includes a zero-configuration interface for writing and executing Python code directly from the browser, as well as free GPU access. Some of the CNN models were trained from scratch during the experiment, while some of them were pre-trained using the ImageNet dataset. The SoftMax classification function is included in each network’s fully connected layer. The optimization of the network is done by Adam optimization method. To meet each configuration’s memory constraints, the batch size was kept at 16 for all models.

5 Experimental Results Two examinations were led to assess the presentation of the classifiers. Experiment I include preparing AlexNet and VGG 16 classifiers without any preparation with various ages and Learning Rate as hyper parameters. In Experiment II, the VGG16 and ResNet network classifiers were pre-prepared utilizing the ImageNet da-taset and afterward tweaked utilizing the little DDSM dataset. The presentation of the model classifiers is assessed for better precision involving mammograms in the two investigations. Table1 beneath sums up the characterization exactness of dif-ferent network classifiers prepared without any preparation with various learning rates, which is iterated more than 50 runs and 100 runs (Figs. 3 and 4). Table 1. Classification Accuracy of network classifier trained from scratch CNN AlexNet

VGG-16

Epoch

Optimizer

Batch size

Learning rate

Accuracy

50

Adam

16

0.001

0.6589

50

Adam

16

0.06

0.6384

100

Adam

16

0.001

0.6589

100

Adam

16

0.06

0.6384

50

Adam

16

0.001

0.312

50

Adam

16

0.06

0.3756

100

Adam

16

0.001

0.312

100

Adam

16

0.06

0.312

The experiment II was conducted using VGG16 and ResNet architecture which was pretrained using ImageNet dataset and then fine-tuned using mini-DDSM dataset (Figs. 5, 6, 7 and 8).

110

S. Mohapatra et al.

Fig. 3. Accuracy plot of Alex Net Model with Fig. 4. Accuracy plot of Alex Net Model with lr 0.06 lr 0.001

Table 2. Classification accuracy of network classifier with different hyper parameters using transfer learning CNN VGG 16

ResNet50

Epoch

Optimizer

Batch size

Learning rate

Accuracy

50

Adam

16

0.001

0.657

50

Adam

16

0.06

0.6542

100

Adam

16

0.001

0.6412

100

Adam

16

0.06

0.6514

50

Adam

16

0.001

0.5927

50

Adam

16

0.06

0.6198

100

Adam

16

0.001

0.6086

100

Adam

16

0.06

0.6226

Fig. 5. Accuracy plot of VGG 16 (pre-trained) with lr 0.001

Fig. 6. Accuracy plot of VGG 16 (pre-trained) with lr 0.06

Evaluation of Deep Learning Models for Detecting Breast Cancer

111

Fig. 7. Accuracy plot of ResNet (pre-trained) Fig. 8. Accuracy plot of ResNet (pre-trained) Model with lr 0.001 Model with lr 0.06

6 Conclusion Based on the above experimental results, it is concluded that network models pre-trained with a different dataset outperformed training from scratch, as evidenced by the VGG 16 results in both Table 1 and Table 2. When pre-trained with ImageNet dataset and then with mini-DDSM, the VGG 16 classifier performs significantly better. The accuracy was 0.65 when pretrained and 0.31 when trained from scratch. VGG16 fails miserably when trained from scratch and suffers from underfitting, which could be attributed to the complexity of the VGG 16 architecture. However, Alex Net trained from scratch performed better, with an accuracy of 0.65. In addition, in another experiment, VGG16 and ResNet were pretrained with ImageNet data and fine-tuned with mini-DDSM data. VGG16 performed better than ResNet in this scenario. However, because network models have low accuracy, a large volume of imaging data is required to train the Model in order to improve its effectiveness. In future, deep learning and machine learning approaches may be used for developing better prediction model for detection of breast cancer.

References 1. Aly, G., Marey, M., El-Sayed, S., Tolba, M.: YOLO based breast masses detection and classification in full-field digital mammograms. Comput. Meth. Prog. Biomed. 200, 105823 (2021) 2. Ragab, D., Sharkas, M., Marshall, S., Ren, J.: Breast cancer detection using deep convolutional neural networks and support vector machines. PeerJ 7, e6201 (2019) 3. Li, H., Zhuang, S., Li, D., Zhao, J., Ma, Y.: Benign and malignant classification of mammogram images based on deep learning. Biomed. Signal Process. Control 51, 347–354 (2019) 4. Dhungel, N., Carneiro, G., Bradley, A.: Automated mass detection in mammograms using cascaded deep learning and random forests. In: 2015 International Conference on Digital Image Computing: Techniques and Applications (DICTA) (2015) 5. Tang, J., Rangayyan, R., Xu, J., El Naqa, I., Yang, Y.: Computer-aided detection and diagnosis of breast cancer with mammography: recent advances. IEEE Trans. Inf. Technol. Biomed. 13, 236–251 (2009) 6. Wang, Z., et al.: Breast cancer detection using extreme learning machine based on feature fusion with CNN deep features. IEEE Access 7, 105146–105158 (2019)

112

S. Mohapatra et al.

7. Tripathy, H.K., Mishra, S., Mallick, P.K., Panda, A.R. (eds.): Technical Advancements of Machine Learning in Healthcare. SCI, vol. 936. Springer, Singapore (2021). https://doi.org/ 10.1007/978-981-33-4698-7 8. Hassan, S.A., Sayed, M.S., Abdalla, M.I., Rashwan, M.A.: Breast cancer masses classification using deep convolutional neural networks and transfer learning. Multimedia Tools Appl. 79(41–42), 30735–30768 (2020). https://doi.org/10.1007/s11042-020-09518-w 9. Nguyen, P.T., Nguyen, T.T., Nguyen, N.C., Le, T.T.: Multiclass breast cancer classification using convolutional neural network. In: 2019 International Symposium on Electrical and Electronics Engineering (ISEE) (2019) 10. Chen, C., Chen, C., Mei, X., Chen, C., Ni, G., Lemos, S.: Effects of image augmentation and dual-layer transfer machine learning architecture on tumor classification. In: Proceedings of the 2019 8th International Conference on Computing and Pattern Recognition (2019) 11. Shen, L., Margolies, L., Rothstein, J., Fluder, E., McBride, R., Sieh, W.: Deep learning to improve breast cancer detection on screening mammography. Sci. Rep. 9, 1–12 (2019) 12. Oyelade, O., Ezugwu, A.: A deep learning model using data augmentation for detection of architectural distortion in whole and patches of images. Biomed. Signal Process. Control 65, 102366 (2021) 13. Lévy, D., Jain, A.: Breast mass classification from mammograms using deep convolutional neural networks (2016) 14. Huynh, B., Li, H., Giger, M.: Digital mammographic tumor classification using transfer learning from deep convolutional neural networks. J. Med. Imaging 3, 034501 (2016) 15. Shorten, C., Khoshgoftaar, T.M.: A survey on image data augmentation for deep learning. J. Big Data 6(1), 1–48 (2019). https://doi.org/10.1186/s40537-019-0197-0 16. Tsochatzidis, L., Costaridou, L., Pratikakis, I.: Deep learning for breast cancer diagnosis from mammograms—a comparative study. J. Imaging 5, 37 (2019) 17. Gardezi, S.J.S., Elazab, A., Lei, B., Wang, T.: Breast cancer detection and diagnosis using mammographic data: systematic review. J. Med. Internet Res. 21, e14464 (2019)

Evaluation of Crop Yield Prediction Using Arsenal and Ensemble Machine Learning Algorithms Nikitha Pitla and Kayal Padmanandam(B) BVRIT HYDERABAD College of Engineering for Women, Hyderabad, India [email protected]

Abstract. Agriculture remains the prime source of living which is the keystone of our country. Present challenges like water scarcity, unpredictable cost, and weather ambiguity require farmers to equip themselves to smart farming. In precise, the crop yield is low due to ambiguity in climatic changes, poor facilities in irrigation, decreased soil fertility, and traditional farming techniques. Farmers are cultivating the same crops frequently without testing a new variety of crops and they are using fertilizer without the knowledge of what quantity needs to be used which leads to uncertainty. Machine learning is a successful technique to answer these uncertainties. This article mainly aims to predict crop and its yield depending on historical data available alike weather, soil, rainfall and crop yield parameters alike soil PH value, temperature, and climate of the particular area using various arsenal and ensemble algorithms. A comparative evaluation of prediction based on these algorithms are presented. The proposed GUI application predicts the type of crop to be cultivated that can give high yield based on the parameters given by the user. This application can be widely used by farmers to grow variety of crops based on the constraints, and thus increase the profit of yield and can invariably decrease the soil pollution. Keywords: Crop yield prediction · Machine learning · Random Forest Regressor · Support Vector Machine · Ada Boost Regressor

1 Introduction Agricultural yield mainly depends upon the weather parameters like rain, temperature, pesticides, and exact information among crop yield history is important for building decisions regarding the management of agriculture and forecasting for future. The parameters in agriculture are different for each field and also for each farmer. Every field is different in terms of its geographical location and soil composition. The quality of the field depends on the farmer and his techniques, ranging from the way he tills his land to the fertilizers and chemicals he uses over his field. In recent times, there have been various surveys and studies over the globe based on agriculture. Out of these many studies, it shows that the crop yield is maximum in the pro-pesticide areas, however, it comes at a cost to have increased in the number of harmful toxins in the soil and the degradation © The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 M. N. Mohanty et al. (Eds.): METASOFT 2022, AISSE 1, pp. 113–123, 2022. https://doi.org/10.1007/978-3-031-11713-8_12

114

N. Pitla and K. Padmanandam

of soil quality, thereby documenting the study of crop yield vs. chemical usage. Most devices are nowadays facilitated by the latest Artificial Intelligence and Machine Learning technologies and have built-in models that work towards a fast-growing approach to spread and help every sector in making reliable decisions of creating the superior application. Applying the same to our project, the main concept would be to increase the throughput of the agriculture domain through the latest Machine Learning models. There is one more hidden but important factor affecting this prediction is the amount of knowledge one gives to the model while training it. The higher the number of parameters passed through the training, the better and more accurate are the results. Lastly, in this work, we emphasis to generate a model which is precise, robust, and reliable and also understands the value of quality over quantity, all the while respecting the environment. So as to accomplish accurate prediction and stand on the erratic trends in weather and rainfall various machine learning classifiers like Gradient Boosting Regressor, Support Vector Machines, Random Forest etc. are used for best prediction. The main objectives are. a. b. c. d.

Implement machine learning techniques for crop yield prediction. Evaluate different parameters of climate (rainfall, temperature). Upsurge the precision of crop yield prediction To deliver User Interface for easy interaction.

2 Literature Review AnakhaVenugopal, Aparna S, JinsuMani, Rima Mathew, Prof. Vinu Williams proposes - Crop Yield Prediction using Machine Learning Algorithm [1]: Agriculture is the first and most important factor for survival. Machine learning (ML) can be an important perspective to get a realistic and practical solution to the problem of crop productivity. This article mainly concentrates on prediction of crop yield by smearing numerous machine learning techniques. The model classifiers used here include logistic regression. Mayank Champaneri, Chaitanya Chandvidkar, Darpan Chachpara, Mansing Rathod proposes Crop Prediction System using Machine Learning Algorithm [2]: Due to the impact of Climatic variation in India, maximum agricultural crops are harshly affected in terms of performance over the past years. Yield prediction of a crop before harvest would help policy makers and farmers to take suitable actions for marketing and storage. It tries solving the problem by devising a prototype of an interactive prediction system. Such a system will be implemented with an easy-to-use graphic user interface based on the web and a machine learning algorithm. The results of the predictions are made available to the farmer. Pavan Patil, Virendra Panpatil, Shrikant Kokate proposes-Crop Yield Prediction using Machine Learning [3]: It is the fact that agriculture is the profession of most Indians. Farmers may have the mentality of planting identical crop, using supplementary fertilizer, and following the generic choice for their agriculture. In the recent past, there is a significant expansion in the usage of machine learning across numerous industries and researches. In this article, the authors planned to create a system where machine learning could be used in agriculture in order to improve the conditions of farmers. The research papers surveyed has given a rudimentary idea of using ML with only one attribute. Our

Evaluation of Crop Yield Prediction

115

goal is to add more features to our system and improve results, which may lead to better returns and we can recognize several patterns for predictions. This system will be useful to justify which crop can be grown in a given area. Many other researchers are also investing their efforts towards farming and farming methodologies. Md Nazirul Islam Sarker, Md Shahidul Islam, Hilarius Murmu, Elizabeth Rozario [4], explained how farmers can get the idea of harvesting time for the crops and crop yield analysis using Machine Learning & big data approach. K Palanivel, C Surianarayanan [5] discussed the major problem of yield prediction and came up with early yield prediction for farmers to make preventive measures to get better productivity. Yan Li, Kaiyu Guan, Albert Yu, Bin Peng, Lei Zhao, Bo Li, Jian Peng [6] - discussed to build a transparent statistical model for improving crop yield prediction and to improve yield prediction performance analysis of existing models. Sami Khanala, John Fulton, Andrew Klopfenstein, Nathan Douridas, Scott Shearer [7] explained the integration of high resolution remotely sensed data and machine learning techniques for spatial prediction of soil properties and corn yield using Machine learning algorithms. Thomas van Klompenburg, Ayalew Kassahun, Cagatay Catal [8], extensively did a systematic literature review of Machine learning models and its importance for prediction of crop yield, and crop analysis. Many machine learning algorithms have used to know the crop yield prediction for the research. Here they have investigated several studies and analyzed models and features that provide insights for further research. Saeed Khaki and Lizhi Wang [9] discussed Crop yield can be determined by multiple factors such as genotype, environment, and their interactions. The accurate yield prediction requires basic understanding of the relationship between yield and the interactive factors, and to expose such relationship requires both comprehensive datasets and powerful algorithms.

3 Methodology The proposed system implement machine learning to device predictions of agricultural yield using the language Python as it has been widely accepted for experimentation in the field of machine learning. Machine learning employs huge data to advance experience and build a well-trained model through training on the same data. This devised model makes predictions of the output. The better the data set, the better the accuracy of the classifier. It is observed to be true that the machine learning approaches such as regression and classification achieve better results than traditional statistical models. Crop production mainly depends environmental factors such as rainfall, soil type and temperature. These are the factors that play a vivacious part in incrementing the crop yield. In addition, business conditions also disrupt the crop market aiming for maximum benefit. It is a mandate to contemplate altogether the influences together to predict the yield. Hence, using Machine Learning practices in the agricultural domain, our proposed system is built to make predictions about crop production which studies the factors such as rainfall, temperature, area, season, etc. Machine learning is undoubtedly one of the most significant and prevailing technologies today. Machine learning converts data to awareness. We analyze and know the patterns that are unseen.

116

N. Pitla and K. Padmanandam

3.1 Basic Terminologies • Dataset: Collection of data, that has different attributes for solving the problem. • Features: data information that helps us recognize the problem. They are the input to machine learning algorithm. • Model: the presentation (internal model) of occurrence learnt by the machine learning algorithm.

4 Module Description 4.1 Module 1 - Data Acquisition Data collection and its measures for the collecting data is a very important process. The food and agriculture related data is available in Food and Agriculture Organization Corporate Statistical Database (FAOSTAT) as CSV files. Around 200 countries data and 200 products data are available in FAOSTAT. It Provides countrywide and global statistics on food and agriculture. Crop and its yield are the primary thing for every county’s economic wellbeing. The data frame contains data from 1985 to 2020 and the parameters include mean temperature of the country, average temperature recorded, and year. The temperature data frame starts from 1983 and ends in 2020. The difference in years would put at risk with aggregated values, so has to standardize a general series not to allow null values. 4.2 Module 2 - Exploratory Data Analysis The data frame contains data from 1990 to 2013 that has 20 plus years of data across 100 countries. Final note, each column has high variance in values. The dataset is grouped by items, and the insight is India is highest in cassava and potato production. Potato appears as the leading crop in the data set, and highest in 4 countries. The attribute relationship has to be computed and the best way is to find the correlation. It is observed that all variables are independent (Fig. 1).

Data Gathering& Cleaning

Data Exploraon

Data Preprocessing

Model Comparison based on results

Fig. 1. System architecture

Build GUI

Applicaon

Evaluation of Crop Yield Prediction

117

4.3 Module 3 - Preprocessing The process of converting raw data to clean data set is called as Data preprocessing. Usually, the data collected from multiple sources will be inconsistent and not ready for possible analysis. In the data frame, the two categorical columns that are present in data frame, and the label contains numerical values instead of categorical values. The no of available values is frequently fixed set, the elements and country values are in this case. The label data is not directly available to operate on ML algorithms. It needs all input values and output values that are numeric. The numerical form data is extracted from the categorical data. The categorical values convert the model that supplies to ML algorithm using the one hot encoder, to do better predictions. One-Hot Encoding converts two columns to a hot numeric array. The categorical variables show the numerical value of the access in the dataset. The encoding will build binary column for every category and gives back result in matrix form. Minmax scaler is used to scale the features that all are at same magnitude. The data is split into training and testing with a common split of 70 and 30 for train/test.

Fig. 2. Dataset

4.4 Module 4 - Model Evaluation Beforehand decision on the algorithm to use, it is required to assess, relate and select the finest algorithm that fits this particular data set. Typically, the model will choose accordingly how well it is fit in line or curve, for an optimization problem with different techniques to solve. When we are working on machine learning problem for a particular dataset. For this project, the following models are compared by their root squared value: • • • •

Gradient Boosting Regressor Random Forest Regressor Support Vector Machine Regressor Decision Tree Regressor

118

N. Pitla and K. Padmanandam

• Logistic Regressor • Ada Boost Regressor • Extra Decision Tree Regressor The rating scale is set based on the regression degree function Rˆ2 which is the coefficient of determination. The regression model represents the variance ratio for the items (crops). The curve or line how well it is fit in terms of data points will be shown by Rˆ2 score. The R-Squared have the explanation that fit for the regression model for all observed data. The regression model fits 60% of the data that reveals 60% of r-squared as an example. If r-squared values are higher it means it is a better model to fit. Above we can see the model is very good fit with a scale of 96%. The calculations are done in basics of node impurities that are weighted and the node probability to reach. The no. of samples that are reaching to each node is calculated but the node probability, these will be divided by total no. of samples. The importance of feature is given to the highest value and the top 7 are most important for the model. In the model decision-making the potato has the highest importance. The data set represents highest yield is the potato. In the Fig. 2(b) cassava has the insecticidal effect which is the third significant feature. If crop is sweet potato, the feature importance in the data set we can see. If the crop is grown in India, then it stands to reason that India has the largest total crops in the data set. The features are correct, if the model influence the yield of the crop (Fig. 3).

Fig. 3. Correlation matrix as a heatmap

Evaluation of Crop Yield Prediction

119

(a)

(b)

Fig. 3. continued

5 GUI Application To build GUI application, framework Flask, is used to develop a web application for the users/farmers to input their queries and expect decisions. Based on the details given by the user, the ideal crop which will have optimal yield is recommended by the system. The Predictor system works on two categories, crop prediction and crop yield prediction. The user can choose the category based on their requirement. The crop prediction demands input from the user as moisture, humidity, ph-value and rainfall. The crop yield predictor demands input as area in sqft, crop, year, season.

120

N. Pitla and K. Padmanandam

Fig. 4. GUI for crop prediction

Fig. 5. Inputs for the crop

5.1 Application Functionality 1. Functionality - Crop predictor. The application takes Moisture, Humidity, PH value, and Rainfall as inputs and the algorithm suggests which crop to grow for better yield as shown in Fig. 4 (Fig. 5). 2. Functionality -Crop yield predictor. The application takes Area, Crop-Name, Year, and Season as inputs and the algorithm returns crop yield in a particular year in hg/ha as shown in Fig. 6 (Fig. 7).

Evaluation of Crop Yield Prediction

121

Fig. 6. Crop is predicted.

Fig. 7. Yield prediction in hg/ha

6 Results and Discussion From the below Table 1, models are compared and chosen that best fits for the dataset. The different techniques and models are compared to provide solution for optimization to find the maximum appropriate model that will not overfit or underfit. The R square value will represent variance of the crops and shows how well the data points fit the line. The decision tree regressor has the highest R square 95% in training and 98% in testing and the Extra tree regressor has the second highest values of training and testing.

122

N. Pitla and K. Padmanandam Table 1. R square values for testing and training

Algorithm

R square value Training testing

Gradient Boosting Regressor

0.8965768919264416

0.88948458548945

Random Forest Regressor

0.6842532317855172

0.6974322238289824

Support Vector Regressor

−0.2035337648036075

−0.19842982304556

Decision Tree Regressor

0.959398430060264

0.9884855840400567

Ada Boost Regressor

0.5344203434615187

0.540921384305050

Logistic Regressor

0.6814053026393077

0.6937239932484045

Extra Trees Regressor

0.9734372847301811

0.9721924848045776

7 Conclusion and Future Scope This system is projected to deal with the growing rate of farmer suicides and to help them to grow financially stronger. The Crop Recommender system help farmers to predict yield for given crop and also helps them to decide which crop to grow. Appropriate datasets were collected, studied, and trained using machine learning tools. The system tracks the user’s location and fetches needed information from the backend based on the location. Thus, the user needs to provide limited information like the soil type and area. The future scope can be implementation of crop diseases detection using Image Processing where users can upload a picture of diseased crops and get pesticides recommendations and implementation of Smart Irrigation System to monitor weather and soil conditions, plant water usage to automatically alter the watering schedule.

References 1. Reddy, D.J., Kumar, M.R.: Crop yield prediction using machine learning algorithm. In: Proceedings - 5th International Conference on Intelligent Computing and Control Systems, ICICCS 2021, vol. 9, no. 13, pp. 1466–1470 (2021). https://doi.org/10.1109/ICICCS51141.2021. 9432236 2. Mishra, S., Mishra, D., Santra, G.H.: Applications of machine learning techniques in agricultural crop production: a review paper. Indian J. Sci. Technol. 9(38), 1–14 (2016). https://doi. org/10.17485/ijst/2016/v9i38/95032 3. Patil, P., Panpatil, V., Kokate, P.S.: Crop prediction system using machine learning algorithm. J. Xidian Univ. 14(6), 748–753 (2020). https://doi.org/10.37896/jxu14.6/009 4. Sarker, M.N.I., Islam, M.S., Murmu, H., Rozario, E.: Role of big data on digital farming. Int. J. Sci. Technol. Res. 9(4), 1222–1225 (2020) 5. Palanivel, K., Surianarayanan, C.: An approach for prediction of crop yield using machine learning and big data techniques. Int. J. Comput. Eng. Technol. 10(3), 110–118 (2019). https:// doi.org/10.34218/ijcet.10.3.2019.013 6. Li, Y., et al.: Toward building a transparent statistical model for improving crop yield prediction: modeling rainfed corn in the U.S. Field Crops Res. 234(February), 55–65 (2019). https://doi. org/10.1016/j.fcr.2019.02.005

Evaluation of Crop Yield Prediction

123

7. Khanal, S., Fulton, J., Klopfenstein, A., Douridas, N., Shearer, S.: Integration of high resolution remotely sensed data and machine learning techniques for spatial prediction of soil properties and corn yield. Comput. Electron. Agric. 153(July), 213–225 (2018). https://doi.org/10.1016/ j.com-pag.2018.07.016 8. Van Klompenburg, T., Kassahun, A., Catal, C.: Crop yield prediction using machine learning: a systematic literature review. Comput. Electron. Agric. 177(July), 105709 (2020). https://doi. org/10.1016/j.compag.2020.105709 9. Khaki, S., Wang, L.: Crop yield prediction using deep neural networks. Front. Plant Sci. 10(May), 1 (2019). https://doi.org/10.3389/fpls.2019.00621

Notification Based Multichannel MAC (NM-MAC) Protocol for Wireless Body Area Network Manish Chandra Roy1(B) , Tusarkanta Samal1 , and Anita Sahoo2 1 Department of Computer Science and Engineering, C.V. Raman Global University,

Bhubaneswar, Odisha, India [email protected] 2 Department of Computer Science and Engineering, College of Engineering, Bhubaneswar Biju Pattanik University of Technology, Rourkela, Odisha, India

Abstract. The instant delivery of emergency traffic, energy efficient and satisfying QOS requirements have forced the researchers to develop multichannel MAC (Medium Access Control) protocols in WBAN (Wireless Body Area Networks). However, energy efficiency and transmission of various types of traffic task for the researchers. An energy efficient real time reliable Notification based multichannel MAC (NMMAC) for WBAN has been demonstrated in this paper. One control channel and seven data channels are among the eight channels evaluated in the proposed protocol. The nodes can be shifted vigorously between the channels and control and data channel can be transmitted in parallel. The different traffic has attempted to access the control channel to communicate the notification packet based in priority. The data packet can be transmitted only to the corresponding existing free channel after getting positive acknowledgement from the receiver. The proposed protocol has been evaluated in the Castalia simulator and simulation results show that the proposed NMMAC performs better than its existing counter parts, packet delay and energy consumption etc. Keywords: WBAN · MAC · Multichannel QoS

1 Introduction Global population growth has become a key concern in the modern era. One of humanity’s most difficult concerns is the increase of chronic sickness and the exorbitant cost of healthcare. As a result, they desire a healthcare system that emphasises proactive wellness and disease prevention. This necessitates a solution for dealing with the aforementioned serious problem. In this case, the researchers used a useful technology called the Internet of Things (IoT), which connects real things to the Internet and allows data to be transmitted and received via it. Smart cities, transportation, electronics, and entertainment systems gain from IoT applications. Sensors, artificial intelligence, advanced imaging technologies, and medical gadgets are all used in the medical industry’s IoT deployment [1]. © The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 M. N. Mohanty et al. (Eds.): METASOFT 2022, AISSE 1, pp. 124–133, 2022. https://doi.org/10.1007/978-3-031-11713-8_13

Notification Based Multichannel MAC (NM-MAC) Protocol

125

A tiny sensor network WBAN, is implanted in the human body and collects healthcare data. This data could include body temperature, blood pressure, blood sugar level, and ECG [2]. However, in this new and demanding healthcare context, protocol design that is exceptionally trustworthy, scalable, energy efficient, and low latency is required. Biosensors use batteries, which cannot be refilled or replaced. Sensor nodes require a significant amount of energy during data packet transmission and reception. The healthcare sector has significant challenges in terms of energy use and data transmission reliability. Medium Access Control (MAC) is crucial in determining how much energy is used. Many researchers have attempted to design single channel more practical. However, due to the usage of a single channel, which generates collisions, single channel protocols have not fared well. In addition, a number of multichannel MAC methods have been proposed to improved reliability and urgent data transfer in WBAN. Low priority data can communicate via a separate frequency channel. An effort has been made in this work to provide an energy Real time Reliable Multichannel MAC protocol for WBAN. This multichannel protocol was created for WBAN with the goal of ensuring real time transmission of emergency data while also maximizing energy conservation. NM MAC protocol can be operated in two periods such as Notification period and Data period that must be executed in parallel. The outstanding part of this paper is systematized as follows. The similar work on multichannel MAC protocols have represented in Sect. 2. The proposed protocol is elaborated in Sect. 3. Section 4 presents the simulation outputs as well as a comparison to other existing state of the-art protocols. The summary of conclusion is presented in Sect. 5.

2 Related Work In this section, WBAN has been given a detailed study of numerous multichannel MAC protocols. Due to the usage of just one channel, single channel MAC techniques fail in WBAN, resulting in issues such as node collisions, latency, energy efficiency, and interference. To alleviate the cohabitation of WBANs, a multichannel technique has been developed in [7]. A single-radio TDMA multichannel MAC protocol with star and mesh topologies has been developed. Through data aggregation technology, this protocol has achieved a considerable reduction in packet latency. Authors in [9] have been proposed ‘is-MAC’ this is a TDMA based multichannel MAC protocol. The decentralised with beaconenabled MAC system provides unchanging data transfer by preserving data channels through over the control channel proposed in [10]. The goal of the spectrum handoff technique is used to prevent interference between co-existing WBAN by creating and disseminating a databased of suggested interfering use nodes proposed in [11]. To get channel number, the authors in [12] designed an IEEE standard 802.15.6 multi-channel MAC protocol, as well as a one-to-one channel mapping techniques among the channel phase and multiple channels. As a result, mapping rules aids in energy conservation. The priority of the nodes, on the other hand, is not taken into account in this protocol. This protocol [13] uses two channels, one of which is designated as a separate beacon channel, and one of which is designated as a data channel. The study and development of a hybrid

126

M. C. Roy et al.

Routing protocol based on the IEEE 802.15.6 specification, which is classed as a narrow communication technology that may be used in WBANs, are linked by author [16]. We concentrated on MAC protocols for both short- and long-range communication standards for the WBAN systems, which was different from previous studies. Four different MAC protocols were examined in [17]. Energy wastage sources and other key MAC protocol needs were also mentioned. The authors of [18] looked at several energy-aware MAC algorithms and their optimization methodologies, as well as analysing and comparing WBASN communications route loss.

3 Proposed Work This section presents a Notification Multichannel MAC (NM MAC) for WBAN (Wireless Body Area Networks) with the intention of disseminating emergency information in real time and maximising energy efficiency. The above target is achieved in NM MAC protocol through dynamically switching their interface between the channels by nodes. NM MAC protocol is operated in two periods such as Notification period and Data period that must be executed in parallel. In Notification period node having data, makes an effort to content control channel to forward the notification packet. 3.1 Data Type Organisation

Fig. 1. Data type organisation

Fig. 2. Modified super frame structure

Fig. 3. Notification packet structure

The NM MAC is a hybrid multichannel MAC protocol in WBAN that has established on IEEE 802.15.6. Generally, single channel healthcare MAC protocol has suffered transmission delay of emergency data due to very limited channel. The proposed NM MAC protocol can be mitigated the above said problems by assigning suitable channel to each sensor nodes (Figs. 1, 2 and 3). 3.2 Notification Based Multichannel MAC Protocol The five different states of a node have been considered in NM MAC. These states are initializing the network, synchronization between the different nodes, selection of channels, medium access and inactive state.

Notification Based Multichannel MAC (NM-MAC) Protocol

127

3.2.1 Initialization In the initialization state a network has set up its function or a new node has added in the network. The function of new node is to contend medium to get synchronized within the network. 3.2.2 Synchronization In NM MAC, notification phase can be exclusively used for synchronization purpose. Scheduled message in NM MAC should be achieved by periodic notification packet as it is responsible for broadcast the packets to its nearest adjacent nodes. The duration for which the nodes transmit its notification packet can be considered as notification (NTF) period. 3.2.3 Channel Selection In NM MAC, the Notification (NTF) period and Data period should be operated parallel. The Notification and data can be transmitted in Control channel (CCH) and Data channel (DCH) respectively. A time period of each frame must be consisting of one CCH and seven numbers of DCHs. When a node generates data, it must be transmitted in CCH as NTF and should be contend to grip a time slot randomly. Slot number and slot duration product has put as timer of a node. When the nodes interested to occupy the slots then timer must be decreased. The Notification packet must be encompassed packet id, the destination id and types of packet or data. The node should be waited for the Notification acknowledgment (N-ACK) packet from the intended receiver after sending the Notification packet. The N-ACK packet can be communicated to the sender when the receiver successfully gets a Notification packet. The receiver’s id, idle channel number, and the period the channel will be utilised for transmission must also be included in the N-ACK. The proposed protocol used four types of channels such as Mixed Channel (MCH), Aperiodic Emergency Channel (AECH), Periodic Emergency Channel (PECH) and Normal Channel (NCH). The AECH, CCH and NCH have been exclusively used for transmitting Aperiodic data, Periodic data and Normal data respectively. The Network Allocation Vector (NAV) in IEEE 802.11 has been used to set a particular channel. The sender node has advertised the schedule and selected channel id after getting N-ACK. As an outcome, the communication channel can be broadcast to adjacent nodes. Thus, hidden terminal interference problem would be solved. If the receiver is unable to locate a common free channel, a null value must be sent to the sender, informing the sender of the free channel’s unavailability after the timeout period. Thus energy can be saved through this approach. After time out period the sender would be retried with an amplified back off time (Figs. 4 and 6). When Notification and N-ACK have successfully exchanged between sender and receiver, then they use selected channel for transmission of data and ack. The sender and receiver can be changed to sleep state after completion of data transmission. Incase of collision of Notification packet in CCH then that node may be used DCH on the basis of availability of free data channel. The idle node or the nodes whose transmission is over may be switched to sleep state to save energy.

128

M. C. Roy et al.

Fig. 4. Notification Acknowledgement (N-ACK) packet structure

Fig. 5. Communication process in NM MAC

Fig. 6. Proposed flow chart

3.2.4 Proposed Algorithm

. i) Algorithm for CSMA Mechanism: Notations: UP= User priority CW= Contention-Window CW min= Minimum Contention-Window CW max= Maximum Contention-Window BC= Back off counter 1. For (UP=7, UP ≤ 0, UP--) 2. Print ready for transmission. 3. If K=0 then 4. CW= CWmin 5. BC= rand (1, CW) 6. While (BC) 7. BC- 8. If BC==0 then transmit data 9. End of while loop 10. Else if K=odd then 11. K= K+1 12. CW= min (2CW, CW max) 13. BC= rand (1, CW) 14. If BC==0 then transmit data 15. End if 16. End of for loop

Notification Based Multichannel MAC (NM-MAC) Protocol

ii) 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. 15. 16. 17. 18. 19. 20. 21. 22. 23. 24. 25.

129

Algorithm for Proposed Protocol: If (priority== 7) then Print AED If MCHstatus!= busy then MCH AED Else AECH AED If the node remains in the same channel, then Node used CSMA/CA algorithm 1 End if End if Else if (priority= six) then Print PED If MCHstatus!= busy then MCH PED Else AECH PED If the node remains in the same channel, then Node used CSMA/CA algorithm 1 End if End if Else Print ND NCH ND If the node remains in the same channel, then Node utilised CSMA/CA algorithm1 End if End if

3.2.5 NM MAC Protocol: An Illustration The Fig. 5 shows the illustration of complete data communication process between sender and receiver. The beacon has been used for synchronization between nodes. The notification phase and data phase can be operated in parallel for each frame after beacon period. Suppose the node 1 has chosen the first slot of notification period that enclose various parameters such as node id, destination id, types of packets and all free channel id. A timer has been established by every node throughout the network. The node can be communicated notification packet only when the value of timer touches zero. A free channel will be selected for data communication by intended receiver when it gets notification packet and transmit Notification Acknowledgement (N-ACK) to its counterpart. The packet id, types of channels and time to issues has been enclosed in the N-ACK packet. The selected channel id will be broadcasted by sender after fruitful transmission of Notification and N-ACK to conscious to remaining nodes about which channel is used. It is found that node 1 is AEP and Mixed channel number1 (MCH1) will be selected for data communication after fruitful transmission of Notification, N-ACK. The timers will

130

M. C. Roy et al.

be freezing their timer by remaining nodes and pause until total communication is over. In the notification channel, node 2 will start transmitting its Notification packet at the same time. MCH1 is unavailable for node 2 transmission because it is in use by node 1 for data communication, which is controlled mostly by NAV vector. MCH2 will be selected for data communication for AEP 2 by retelling N-ACK to node 2. During this time frame, Node 3 accesses the notification period in order to broadcast notification packets to its receiver. Assume that node 3 is PEP1 that is not a neighbor of node 2, the channel 2 has been added to the list of free channels. In this case, receiver selects a free channel by checking the NAV vector and MCH3 will be selected for data communication. During this time interval, let node 4 AEP3 and node 5 NP1 will be alerted by their corresponding slots by communicating Notification. MCH4 and NCH will be selected by corresponding receiver respectively for data communication. Similarly, AECH and PECH have been chosen by node 6 AEP4 and node 7 PEP2 respectively due to unavailable of mixed channel. Thus, parallelism can be accomplished by parallel communication. In the above illustration similarly Notification packet will be communicated by node 8 PEP3 to its counterpart for data conversation. If all the channels are busy, then free channels will not be available. As a result, the sender will get a null value in the N-ACK, which will prompt them to resubmit. Any node in the vicinity that hears such an N-ACK ignores it. But during at this time interval node1 ED1 complete its communication. Hence, MCH1 will available by fixing the NAV vector. Therefore, MCH1 will be reused by node 8 for communication purposes. Similarly, because all channels except MCH3 are active with data exchange, node 9 AEP5 will send a notification message to tits counterpart, and its receiver will look for a free channel. In this interval channel 3 data communication was completed, as a result node 9 AEP5 select MCH3 for its data exchange. The node 10 NP2 searches its own channel. During this time interval the communication of NCH was completed which is now free for reuse. Therefore, the node or packet 10 will select NCH for data communication.

4 Simulation Results In this part, we compare the latency, energy consumption and normalized throughput of out proposed algorithm to the existing state-of-the-art algorithm IEEE 802.15.6, EIMAC and MC MAC. Castalia Simulator [11] is to implement the suggested method and its analogues. The simulation parameters are established, then the results are used to examine and assess the performance of our proposed model in comparison to IEEE 802.15.6 and EIMAC. Figures 7 and 8 show the delay comparison of the suggested protocol with IEEE 802.15.6, EIMAC, and MC MAC, as well as the number of nodes and the pace at which packets arrive. These figures have been concluded that the proposed NM MAC exhibits less delay as comparison to others. Contention becomes vulnerable in the case of IEEE 802.15.6 as the quantity of packets increases, resulting in increased latency. MC MAC protocol is a multichannel protocol and the node has an option to communicate in different channel. Therefore, MC MAC has less delay as compared to single channel IEEE 802.15.6. EIMAC is a multichannel protocol which provides more chance to

Notification Based Multichannel MAC (NM-MAC) Protocol

Fig. 7. Delay vs number of nodes

Fig. 10. Energy consumption vs packet arrival rate

Fig. 8. Delay vs packet arrival rate

Fig. 11. Normalised throughput vs number of nodes

131

Fig. 9. Energy consumption vs number of nodes

Fig. 12. Throughput vs packet arrival rate

transmit dense traffic as more number of channels are available. EIMAC protocol has no reserved channel for individual data types and data packets have to check available channel first then transmit the data packet on basis of availability of channel. So, it exhibits more delay as comparison to NM MAC. Individual packet channels are used in the NM MAC protocol, allowing packets to be sent instantaneously and without delay. This approach considerably minimises contention time and relieves the burden of contention and congestion. Figures 9 and 10 show the energy consumption comparison of the proposed protocol with IEEE 802.15.6, EIMAC and MC MAC, as well as the number of nodes and packet arrival rate. These figures have been concluded that the proposed NM MAC exhibits less energy consumption as comparison to others. In the case of IEEE 802.15.6, as the number of packets rises, contention becomes more susceptible, resulting in increased latency and, in turn, increased energy consumption. MC MAC has consumed less energy as compared to single channel IEEE 802.15.6 due to less delay. EIMAC is a multichannel protocol which provides more chance to transmit dense traffic as more number of channels are available. EIMAC protocol has no reserved channel for individual data types and data packets have to check available channel first then transmit the data packet on basis of availability of channel. Therefore, it has consumed more energy as comparison to NM MAC. Individual packet channels are used in the NM MAC protocol, allowing packets to be sent instantaneously and without delay. This protocol considerably decreases contention and congestion pressure, as well as contention time, resulting in the least amount of energy usage.

132

M. C. Roy et al.

Figures 11 and 12 show a throughput comparison of the proposed protocol with IEEE 802.15.6, MC-MAC, and EIMAC, as well as the number of nodes and packet arrival rate. These number show that the planned NM MAC has a high throughput when compared to others. In the case of IEEE 802.15.6, congestion becomes more susceptible, and data collisions become more often owing to the single channel, resulting in lower throughput. The MC MAC protocol is a multichannel protocol that allows nodes to interact over multiple channels. As a result, MC MAC has a higher throughput than single channel IEEE 802.15.6. EIMAC protocol is a multichannel that gives you a better chance of transmitting dense traffic since there are more channels accessible. EIMAC protocol has no reserved channel for individual data types and data packets have to check available channel first then transmit the data packet on basis of availability of channel. So, it exhibits less throughput as comparison to NM MAC. Individual packet channels are used in the NM MAC protocol, allowing packets to be sent immediately and without delay. In comparison to IEEE 802.15.6, the EIMAC standard distributes network traffic across many channels for transmissions, easing contention on a single network and lowering the likelihood of node collisions. This protocol dramatically decreases contention and congestion pressures, as well as the time it takes for conflict to occur, resulting in maximum throughput in comparison to IEEE 802.15.6.

5 Conclusion and Future Scope WBAN’s most difficult task is to transmit emergency traffic to isolated professionals as quickly as possible, because this directly affects patience. As a result, designated channels are provided in this article for broadcasting both periodic and aperiodic emergency traffic, with the goal of delivering emergency traffic as quicky as possible. The simulation results reveal that the proposed NM MAC protocol outperforms both IEEE 802.15.6 and EIMAC in order to improve energy, throughput, and latency. This work can be done in IEEE 802.15.4 in future. It will also work in real time application. However, the limitation of this channel is fix dedicated channel for individual packet or data.

References 1. Samal, T., Kabat, M.R.: Energy efficient real time reliable MAC protocol for wireless body area network. In: 2019 10th International Conference on Computing, Communication and Networking Technologies (ICCCNT), pp. 1–6 (2019) 2. Samal, T.K., Kabat, M.R.: Fuzzy constraints notification MAC protocol in wireless body area networks. In: 2019 6th International Conference on Computing for Sustainable Global Development (INDIACom), pp. 675–679 (2019) 3. Samal, T., Kabat, M.R., Priyadarshini, S.B.B.: Energy saving delay constraint MAC protocol in wireless body area network. In: Mishra, D., Buyya, R., Mohapatra, P., Patnaik, S. (eds.) Intelligent and Cloud Computing. SIST, vol. 194, pp. 623–630. Springer, Singapore (2021). https://doi.org/10.1007/978-981-15-5971-6_64 4. Samal, T., Kabat, M.R.: A prioritized traffic scheduling with load balancing in wireless body area networks. J. King Saud Univ. Comput. Inf. Sci., 2021 (2021). https://doi.org/10.1016/j. jksuci.2020.12.023

Notification Based Multichannel MAC (NM-MAC) Protocol

133

5. Cavallari, R., Martelli, F., Rosini, R., Buratti, C., Verdone, R.: A survey on wireless body area networks: technologies and design challenges. IEEE Commun. Surv. Tutor 16, 1635–1657 (2014) 6. Allah, S., et al.: A comprehensive survey of wireless body area networks on PHY, MAC, and network layers solutions. J. Med. Syst. 36, 1065–1094 (2012) 7. Kim, B.S., Cho, J., Kim, D.Y.: ACESS: adaptive channel estimation and selection scheme for coexistence mitigation in WBANs. In: Proceedings of the ACM 10th International Conference on Ubiquitous Information Management and Communication, Danang, Vietnam, pp. 96–103 (2016) 8. Cho, K., Jin, Z., Cho, J.: Design and implementation of a single radio multichannel MAC protocol on IEEE 802.15.4 for WBAN. In: 8th International Conference on Ubiquitous Information Management and Communication (IMCOM) (2014) 9. Kirbas, I., Karahan, A., Sevin, A., Bayilmis, C.: isMAC: an adaptive and energy-efficient MAC protocol based on multi-channel communication for wireless body area networks. TIIS 7, 1805–1824 (2013) 10. Lee, W., Rhee, S., Kim, H., Lee, H.: An efficient multichannel management protocol for wireless body area network. In: International Conference on Communication (ICC), pp 5688– 5693 (2014) 11. Movassaghi, S., Abolhasan, M., Smith, D.: Smart spectrum allocation for interference mitigation in wireless body area network. In: IEEE Conference on Communication (ICC), pp. 5688–5693 (2014) 12. Li, C., Zhang, B., Yuan, X., Ullah, S., Vasilakos, A.V.: MC-MAC: a multi-channel based MAC scheme for interference mitigation in WBANs. Wirel. Netw. 24(3), 719–733 (2016). https://doi.org/10.1007/s11276-016-1366-0 13. Bhandari, S., Moh, S.: A priority-based adaptive MAC protocol for wireless body area networks. Sensors 16, 401 (2016) 14. Li, N., Cai, X., Yuan, X., Zhang, Y., Zhang, B., Li, C.: EIMAC: a multi-channel MAC protocol towards energy efficiency and low interference for WBANs. IET Commun. 12, 1954–1962 (2018) 15. Castalia: Wireless Sensor Network Simulator. https://Castalia.forge.nicta.com.au/index. php/en/. Accessed 1 Dec 2013 16. Saboor, A., Ahmad, R., Ahmed, W., Kiani, A.K., Le Moullec, Y., Alam, M.M.: On research challenges in hybrid medium-access control protocols for IEEE 802.15.6 WBANs. IEEE Sens. J. 19(19), 8543–8555 (2018) 17. Javaid, N., Hayat, S., Shakir, M., Khan, M.A., Bouk, S.H., Khan, Z.A.: Energy efficient MAC protocols in wireless body area sensor networks-a survey. arXiv:1303.2072 (2018) 18. Salman, T., Jain, R.: A survey of protocols and standards for Internet of Things. arXiv:1903. 11549 (2019)

A Multi Brain Tumor Classification Using a Deep Reinforcement Learning Model B. Anil Kumar(B) and N. Lakshmidevi GMR Institute of Technology, Rajam, Srikakulam, Andhra Pradesh, India {anilkumar.b,lakshmidevi.n}@gmrit.edu.in

Abstract. Brain Tumor is a type of disease where the abnormal cells will grow in the human brain. There will be different type of tumors in the brain and also these tumors will be in the spinal cord. Doctors will use some techniques to cure this tumors which are present in the brain. So the first task is to classify the different types of tumors and to give the respective treatment. In general the MagneticResonance-Imaging (MRI) is used to find the type of tumor is present in the image or not and also identifies the position of the tumor. Basically images will have Benign or malignant type of tumors. Benign tumors are non-cancerous can be cured with the help of medicines. Malignant tumors are dangerous they can’t be cured with medicines it will leads to death of a person. MRI is used to classify these type of tumors. MRI images will use more time to evaluate the tumor and evaluation of the tumor is different for different doctors. So There is one more technique which is used to classify the brain tumor images are deep learning. Deep learning consists of supervised learning mechanism, unsupervised learning mechanism and Reinforcement learning mechanism. The DL model uses convolution neural network to classify the brain tumor images into Glioma, Meningioma and Pituitary from the given dataset and also used for classification and feature Extraction of images. The dataset is consisting of 3064 images which is included with Glioma, Meningioma and pituitary tumors. Here, Reinforcement learning mechanism is used for classifying the images based on the agent, reward, policy, state. The Deep Q-network which is part of Reinforcement learning is used for better accuracy. Reinforcement learning got more accuracy in classification Compared to different mechanisms like supervised, unsupervised mechanisms. In this Accuracy of the Brain Tumor classification is increased to 95.4% by using Reinforcement compared with the supervised learning. The results indicates that classification of the Brain Tumors. Keywords: Brain tumor · MRI classification · Convolution neural networks · Deep learning · Reinforcement learning · Deep Q - Network

1 Introduction 1.1 Brain Tumor A brain tumor is one of the body’s most important organs, with billions of cells [1]. Its cell division produces an uncontrolled group of cells, which is referred to as a tumor © The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 M. N. Mohanty et al. (Eds.): METASOFT 2022, AISSE 1, pp. 134–144, 2022. https://doi.org/10.1007/978-3-031-11713-8_14

A Multi Brain Tumor Classification

135

[2]. Brain Tumor are divided into two types. One is benign type of tumor and other one is malignant type of tumor. Benign tumor is non-dangerous because other areas of the brain are not affected. Malignant type of tumor is dangerous and other areas of brain will be affected. It leads to person’s death [1]. Brain tumours can cause a variety of symptoms and indications. Symptoms can occur regardless of whether the tumour is benign or malignant [3]. The most common brain tumors are Glioma, meningioma and pituitary tumors [4]. Gliomas are the most common type of brain tumour that arises from the brain’s glial cells. Brain tumours can be detected and classified using a variety of imaging techniques. MRI, on the other hand, is one of the most widely used non-invasive methods. 1.2 Glioma Tumor Glioma is a type of tumour that begins in the brain or spine’s glial cells [5]. Gliomas account for over 30% of all brain and central nervous system cancers, as well as 80% of all malignant brain tumours. It can be found in the brain and spinal cord, as well as other parts of the nervous system. Depending on its location and rate of growth, a glioma can compromise your brain function and be life-threatening [6]. Your treatment and prognosis are influenced by the type of glioma you have. Generally speaking Surgery, radiation therapy, chemotherapy, targeted therapy, and other treatments are available for glioma. According to the WHO, gliomas are divided into four classes, ranging from type I to type IV. Grade I tumours are benign and have a texture similar to normal glial cells, Grade II tumours have a slightly different texture, Grade III tumours are malignant and have abnormal tissue appearance, and Grade IV tumours are the most severe stage of gliomas and tissue abnormalities visible to the naked eye [7]. 1.3 Meningioma Tumor Meningioma is a slow-growing tumor that develops from the membrane layers that surround the brain and spinal cord. Symptoms vary by location and are caused by the tumor pressing against neighboring tissue [8]. Many instances never show any signs or symptoms. Seizures, dementia, difficulty speaking, vision issues, one-sided paralysis, and loss of bladder control are all possible side effects. People who have had radiation, especially to the scalp, as well as those who have had a brain injury, are more likely to develop meningiomas. Dental x-rays have been linked to an increased risk of meningioma, especially in persons who had frequent dental x-rays in the past when the x-ray dose was higher than it is now. Up to 90% of them are harmless (not cancerous). Meningiomas are most commonly found in the brain, although they can also develop in the spinal cord. Meningiomas have no symptoms and do not require treatment right away. Brain tumours can be cancerous or non-cancerous, however the majority of meningioma tumours are non-cancerous. The most prevalent type of tumour that arises in the central nervous system is meningiomas. Meningiomas are malignant in a small percentage of cases. They have a proclivity towards growing swiftly. The most common symptoms are headaches, seizures, blurred vision, arm or leg weakness, numbness, and speech impairments.

136

B. A. Kumar and N. Lakshmidevi

1.4 Pituitary Tumors Pituitary tumors are abnormal growths in the pituitary gland that develop over time. Some pituitary tumors cause an overabundance of hormones that control vital physiological functions. Your pituitary gland may generate less hormones as a result of some pituitary tumors. The majority of pituitary tumors are benign growths adenomas [9]. Adenomas are benign tumours that stay in the pituitary gland or surrounding tissues and do not spread to other parts of the body. Pituitary tumours can be treated in a variety of ways, including removing the tumour, slowing its growth, and lowering your hormone levels with drugs. Pituitary tumours do not all generate symptoms. They are sometimes discovered by chance during an imaging test, such as an MRI or CT, that was performed for another purpose. 1.5 Dataset The Dataset includes T1-weighted contrast-enhanced images of 3064 images with three forms of brain tumours: meningioma, glioma, and pituitary tumour are included in the database [10]. According to their types, brain tumours can vary in shape, location, and size. There are three different views in the dataset: axial, coronal, and sagittal views. The dataset was received from the public access repository of The Cancer Imaging Archive (TCIA). The images with different grades of glioma that were T1-weighted contrast-enhanced (Grade II, Grade III, and Grade IV) [11] (Table 1). Table 1. Brain Tumor images dataset S. No

Dataset

1 2

3

Type of tumor

Total no of images

MRI images

Glioma, Meningioma, Pituitary Tumors

3064

Training dataset

Glioma

1283

Meningioma

637

Testing dataset

Pituitary

837

Glioma

143

Meningioma

71

Pituitary

93

2 Related Work Mou, Lichao, et al. developed an approach for training an intelligent agent that, given a hyper spectral image, can learn a policy to select an ideal band subset without any human intervention. Propose two different reward schemes for the deep reinforcement learning environment simulation and compare them in tests [12]. Li, Huadong, and Hua Xu suggested an unique reinforcement learning-based framework for pre-selecting useful images (RLPS) for emotion classification in FER, consisting of two modules: image selector and rough emotion classifier [13].

A Multi Brain Tumor Classification

137

Addressed a study that recognises the diagnostic strategy learning challenge and presents a novel three-component architecture for learning a diagnostic strategy with restricted features. A feature selection based on reinforcement learning techniques is presented to learn the best feature sequence for diagnosis using the encoder’s output as input [14]. Ni, Danni, et al. A selective ensemble method in image steganalysis based on Deep Q Network (DQN) was proposed, which combines reinforcement learning and convolutional neural networks and is uncommon in ensemble pruning. [15]. Park, Jaewoo, et al. proposed a novel framework based on the segmenter’s reinforcement learning and three brain blocks: The segmenter breaks a single word picture into numerous character images, the switcher allocates a recognizer to each sub-image, and the recognizers recognise the sub-images allotted to them [16]. Wang, Min, et al. proposed Each pixel has its own “state” and “activity,” and can vary its “action” based on interactions with the “environment.“ From a local neighbourhood region, a spatial-polarimetric “reward” function is created to investigate both spatial and polarimetric information for more accurate classification [17]. Zhou, S. Kevin, et al. resented a review of the literature on DRL in medical imaging Begin with a comprehensive DRL tutorial, which includes the most recent model-free and model-based algorithms [18]. Zhao, Dongbin, Yaran Chen, and Le Lv. proposed a model for image classification based on visual attention convolutional neural networks (CNN) A visual attention-based image processing module is used to highlight one region of an image while weakening the others, producing a focused image. After that, the CNN is utilised to classify the focused image [19]. Furuta, Ryosuke, Naoto Inoue, and Toshihiko Yamasaki proposed a pixelRL learning method that greatly enhances performance by considering not only the future states of the individual pixel but also those of its neighbours The proposed method can be utilised to do pixel-wise modifications in some image processing jobs where deep RL has never been employed before. [20].

3 Proposed Model 3.1 Convolution Neural Networks The hidden layers of a convolutional neural network include convolutional layers. This usually consists of a layer that performs a dot product of the convolution kernel with the layer’s input matrix. This product’s activation function, which is normally the Inner product, is often used. The convolution technique develops a feature map as the convolution kernel goes along the layer’s input matrix, which is subsequently used as input for the next step layer. Other layers, such as pooling layers and completely connected layers, are added after that [21] (Fig. 1). A convolutional layer is made up of a number of filters, each of which has its own set of parameters that must be learned. The height and weight of the filters are smaller than the volume of the input. A neuron-based activation map is generated by convolving each filter with the input volume. The main benefit is that it can learn all of the key features

138

B. A. Kumar and N. Lakshmidevi

Fig. 1. Convolution layers.

for each class without any human supervision and convert to a lower dimension. When compared to other classification methods, the amount of pre-processing required by a ConvNet is significantly less [22]. A convolutional layer is made up of a number of filters, each of which has its own set of parameters that must be learned. The height and weight of the filters are smaller than the volume of the input. A neuron-based activation map is generated by convolving each filter with the input volume [23]. The filter is smaller than the input data, and the dot product is used to multiply a filter sized patch of the input with the filter. A dot product is the element-wise multiplication of the input and filter’s filter-sized patch, which is then summed, always yielding a single value [24]. If the input image is n × n and the filter size is f × f then the output size will be (n − f + 1) × (n − f + 1). The procedure is often referred to as the “scalar product” because it produces a single value. The multiplication is done between a two-dimensional array of weights termed a filter or a kernel and an array of input data. Padding is just the process of adding layers of zeros to our input images in order for the output size to match the input image size with the same number of pixels. Padding is a term used in convolutional neural networks [25]. When a image is processed, the amount of pixels added to it is referred as networks. CNN’s kernel, to be precise. The neural networks filter that moves across the screen is called the kernel. Scanning each pixel and transforming the information into a smaller format. The addition of padding to a CNN-processed image allows for more precise analysis. If input size is n × n and filter size is f × f then the output size will be (n + 2p − f + 1) × (n + 2p − f + 1). By changing one unit at a time, the filter convolves around the input volume. The stride is the amount by which the filter shifts. In convolution, stride refers to the number of steps we take in each phase. It is one by default. It can be seen that the size of the output is smaller than the size of the input. Stride is a component of convolutional neural networks, which are neural networks that are optimized to compress images and video data, which can be rather vast. Convolutional layers’ feature map output has the disadvantage of capturing the exact position of features in the input. This means that even little changes in the position of the feature in the input image will result in a new feature map [26]. Re-cropping, rotation, shifting, and other minor adjustments to the input image can cause. The pooling layer is responsible for downsampling the input image and passing it on to the fully linked

A Multi Brain Tumor Classification

139

layer. As a result, there will be no loss of image information or features, and the image will not be computed. Flatten converts a pooled feature map into a single column that may be fed into a fully linked layer. Flattening is the process of transforming data into a one-dimensional array that may be passed on to the next layer. The output of the convolutional layers has been flattened to generate a single lengthy feature vector. Fully Connected Layer is simply, feed forward neural networks. Fully Connected Layers are the network’s final layers. The input to the completely connected layer is the output from the final Pooling or Convolutional Layer, which is flattened and then fed into the fully connected layer [27] (Fig. 2).

Brain Tumor Images (Training)

Data Pre Processing

Select samples

CNN model

Labeled sampled set

O/P Image Compare class

State,A

Train Agent Classify

Brain Tumor Images (Testing)

Data Pre Processing

Feature Extraction

Tested output image

Output Result Image

Fig. 2. Proposed method

3.2 Reinforcement Learning Reinforcement learning is the process of interacting with the environment to learn how to map environmental conditions to actions, i.e., understanding what to do at each discrete time interval. Reinforcement learning is a well-known technique for tackling sequential decision problems that allows computers to figure out their own best actions. Robotics, computer games, and network routing have all made use of it [28]. Almost all reinforcement learning methods that give the aim of the reinforcement learning agent require reward. The reward is a measure of how well an agent accomplishes a task in a particular state. The reward network that maps each state to a scalar, a reward, that expresses the state’s inherent expectation [29]. For a classification task, an image is classified based on several key areas. When a link is strong, reinforcement learning uses rewards to increase weights, and when a connection is weak, it uses rewards to reduce weights [30]. Since of its learning method and special reward function, Deep reinforcement learning is appropriate for learning with imbalanced data because it is simple to pay more attention to the minority class by delivering a bigger reward or penalty. State, action, and reward are the three major components of reinforcement learning. The goal of the RL agent is to learn a function that maps state space to action space. The RL agent would be rewarded for taking action. The RL agent’s goal is to maximise the

140

B. A. Kumar and N. Lakshmidevi

accrued rewards. The environment reacts to the activity and changes its state (st + 1). The reward (rt + 1), which the agent receives or does not receive for the selected action, is likewise decided for the transition (st, at, st + 1) [31]. 3.3 Deep Q-Learning Q-learning is a model-free reinforcement learning algorithm for determining the worth of a certain action in a given state [32]. For any given FMDP, Q-learning can determine the best action-selection policy. The function that the algorithm computes – the expected rewards for an action taken in a given state – is referred to as “Q.” The algorithm defines the optimal action-value function Q(s, a) as the largest expected return obtained by adopting any strategy after seeing some states s and then taking some action [33]. Q ∗ (s, a) = max E[Rt|st = s,at = a,π] The Deep Q-network (DQN), which blends Q-learning and deep NN, is the first significant example of such an integration. When given high-dimensional data, the DQN agent can successfully learn policies using RL. Deep CNN is used to approximate the action-value function for optimality. Deep CNN solves the instability and divergence that can occur while approximating Q-function with shallow NN by using experience replay and target network. The DQL learns two value functions by randomly updating one of them with an experience, resulting in two sets of weights. The greedy policy is determined by one set on each update, while the value is determined by the other. Based on the Trained images and tested images the classification can happen successfully with more accuracy. We used different target decomposition methods to extract multiple features from the Brain images dataset which is the state parameter of the DQN algorithm. The created labelled sample set to provide feedback on the agent’s behaviours. sample selection, in which the agent examined the surroundings automatically using the e-greedy policy to determine the action and reward for each pixel. To update the Q values, a Q-learning approach was used, and a DNN model was used to fit the Q values. The total reward for each round is used to evaluate the model training as well as the performance of the suggested strategy in each round of classification. The Bellman equation states that the optimal action-value function obeys an important identity. The following intuition underpins this equation: If the optimal value Q(s, a) of the sequence s at the next time-step is known for all potential actions a, then selecting the action a that maximises the expected value of r + Q(s, a), which is a r [0, 1] linked discount cumulative reward function, is the best approach. Q ∗ (s, a) = E r + γmax a Q ∗ s , a |s, a .

3.4 Confusion Matrix The confusion matrix is a prominent tool in classification. It may be used to solve multiclass classification problems as well as binary classification problems. To show

A Multi Brain Tumor Classification

141

counts based on expected and actual values, confusion matrices are employed. True Negative is the output, and it shows the number of correctly identified negative cases. Similarly, “TP” stands for True Positive, which refers to the number of positive examples that have been accurately identified. False Positive and False Negative values are denoted by the letters “FP” and “FN,” respectively. False Positive value (FP) is the number of actual negative cases categorised as positive, whereas False Negative value (FN) is the number of actual positive examples classified as negative. Comparing the TN, TP, FN, FP based on the Brain Tumor images. Prediction of the Image as having the Brain Tumor and also Actually the image is having the Brain Tumor then it is True positive(TP). Prediction of the Image as having no Brain Tumor and Actually the image doesn’t having the Brain Tumor then it is True Negative (TN). Prediction of the Image as having the Brain Tumor but Actually the image is having no Brain Tumor then it is False Positive (FP). Prediction of the Image as having no Brain Tumor but Actually the Image is having the Brain Tumor then it is False Negative (FN).After the classification the confusion matrix is created to evaluate these parameters and can justify the accuracy of the classification of the Brain Tumor images. Based on Accuracy, Precision, specificity, Recall, F1-Score can justify the classification is high accuracy or not.

4 Experimental Results These are the results obtained by performing the proposed methodology. In the fig we have validated the confusion matrix which will give the accuracy, specificity etc., in the results (Fig. 3).

Fig. 3. Confusion matrices

142

B. A. Kumar and N. Lakshmidevi

Based on the confusion matrix, will get TP, TN, FP, FN values of Glioma, meningioma and pituitary.so that the accuracy, precision, specificity, Recall, F1-score can be calculated based on the TP, TN, FP, FN values of the Glioma, Meningioma and pituitary tumors (Table 2). Table 2. Performance table of proposed model Model

Accuracy

Precision

Specificity

Recall

F1-score

RL-Model

95.4

90.4

97.7

95.4

93.33

BT3_VGG16

93.7

87.6

87.7

90.3

88.9

BT3_VGG19

94.5

88.9

88.9

91.6

90.2

BT3_Googlenet

92.2

85.2

85.3

87.3

86.2

5 Conclusion Deep learning algorithms are proving to be powerful and accurate methods for classifying and detecting brain cancers and lesions automatically. This project investigated deep reinforcement learning methodologies using convolutional neural networks to produce more accurate results for brain illness classification. In This tumors are classified into Glioma, Meningioma and pituitary using the dataset of the images containing these three types of Tumors. By using the Reinforcement Learning Algorithm and features Extraction. The classification accuracy is compared with the other training mechanism like supervised and unsupervised learning with Reinforcement learning is high accuracy. So the reinforcement results have accuracy is more than all. The Tumor is classified as Glioma, meningioma or pituitary. The proposed method, which employs processing pathways, is capable of correctly segmenting and classifying the three types of brain cancers found in the dataset: meningioma, glioma, and pituitary tumor. The proposed approach achieves a 95.2% classification accuracy for glioma tumors, a 96% classification accuracy for pituitary tumors, and a 95% classification accuracy for meningioma tumors. The quality of the image dataset is also discovered to have a significant impact on categorization accuracy. The proposed approach offers a greater classification accuracy than the supervised learning mechanism. If there is adequate MRI imaging data relating to glioma, meningioma, and pituitary tumours, this model can be utilised to assist medical physicians in the diagnosis of brain tumours, as well as in real-world applications.

References 1. Murphy, A.M., Rabkin, S.D.: Current status of gene therapy for brain tumors. Transl. Res. 161(4), 339–354 (2013) 2. De Angelis, L.M.: Brain tumors. New England J. Med. 344(2), 114–123 (2001) 3. Cancer Stat Facts: Brain and Other Nervous System Cancer. National Cancer Institute, 31 March 2019

A Multi Brain Tumor Classification

143

4. Recurrent Brain Cancers Follow Distinctive Genetic Paths. University of California Santa Cruz. University of California San Francisco, Accessed 17 June 2015 5. Louis, D.N., et al.: The 2016 World Health Organization classification of tumors of the central nervous system: a summary. Acta Neuropathol. 131(6), 803–820 (2016) 6. Goodenberger, M.L., Jenkins, R.B.: ‘Genetics of adult glioma.’ Cancer Genet. 205(12), 613– 621 (2012) 7. Sanai, N., Chang, S., Berger, M.S.: Low-grade gliomas in adults. J. Neurosurg. 115(5), 948– 965 (2011) 8. Claus, E.B., Calvocoressi, L., Bondy, M.L., Schildkraut, J.M., Wiemels, J.L., Wrensch, M.: Dental x-rays and risk of meningioma. Cancer 118(18), 4530–4537 (2012) 9. Daly, A.F., Vanbellinghen, J., Beckers, A.: Characteristics of familial isolated pituitary adenomas. Expert. Rev. Endocrinol. Metab. 2(6), 725–733 (2007) 10. Cheng, J.: Brain Tumor Dataset, 2 April 2017. Distributed by Figshare. Accessed 30 May 2019 11. Scarpace, L., Flanders, A., Jain, R., Mikkelsen, T., Andrews, D.: Data from REMBRANDT. The cancer imaging archive (2015) 12. Mou, L., et al.: Deep reinforcement learning for band selection in hyperspectral image classification. IEEE Trans. Geosci. Remote Sens. 60, 1–14 (2021) 13. Li, H., Hua, X.: Deep reinforcement learning for robust emotional classification in facial expression recognition. Knowl.-Based Syst. 204, 106172 (2020) 14. Zhu, M., Zhu, H.: Learning a diagnostic strategy on medical data with deep reinforcement learning. IEEE Access 9, 84122–84133 (2021) 15. Ni, D., et al.: Selective ensemble classification of image steganalysis via deep Q network. IEEE Signal Process. Lett. 26(7), 1065–1069 (2019) 16. Park, J., et al.: Multi-lingual optical character recognition system using the reinforcement learning of character segmenter. IEEE Access 8, 174437–174448 (2020) 17. Wang, M., et al.: Polarimetric SAR data classification via reinforcement learning. IEEE Access 7, 137629–137637 (2019) 18. Zhou, S.K., et al.: Deep reinforcement learning in medical imaging: a literature review. Med. Image Anal. 73, 102193 (2021) 19. Zhao, D., Chen, Y., Lv, L.: Deep reinforcement learning with visual attention for vehicle classification. IEEE Trans. Cogn. Dev. Syst. 9(4), 356–367 (2016) 20. Furuta, R., Inoue, N., Yamasaki, T.: Pixelrl: fully convolutional network with reinforcement learning for image processing. IEEE Trans. Multimedia 22(7), 1704–1719 (2019) 21. Huang, Z.Y.: Application Research of Convolution Neural. Hubei University of Technology, pp. 41–46 (2017) 22. Chang, C.-H.: Deep and shallow architecture of multilayer neural networks. IEEE Trans. Neural Netw. Learn. Syst. 26(10), 2477–2486 (2015) 23. Venkatesan, R., Li, B.: Convolutional Neural Networks in Visual Computing: A Concise Guide, pp. 10–23. CRC Press, London (2017) 24. Ciresan, D., Meier, U., Schmidhuber, J.: Multi-column deep neural networks for image classification. CoRR, abs/1202.2745 (2012) 25. Scherer, D., Müller, A., Behnke, S.: Evaluation of pooling operations in convolutional architectures for object recognition. In: Diamantaras, K., Duch, W., Iliadis, L.S. (eds.) ICANN 2010. LNCS, vol. 6354, pp. 92–101. Springer, Heidelberg (2010). https://doi.org/10.1007/ 978-3-642-15825-4_10 26. Murray, N., Perronnin, F.: Generalized max pooling. In: Proceedings of IEEE International Conference on Computer Vision Pattern Recognition, September 2014, pp. 2473–2480 (2014) 27. LeCun, Y., Kavukcuoglu, K., Farabet, C.: Convolutional networks and applications in vision. In: Proceedings of IEEE International Symposium on Circuits and System, June 2010, pp. 253–256 (2010)

144

B. A. Kumar and N. Lakshmidevi

28. Wiering, M., van Otterlo, M.: Reinforcement Learning: State-of-the-Art. Springer, New York (2012) 29. Mnih, V., et al.: Human-level control through deep reinforcement learning. Nature 518(7540), 529–533 (2015) 30. Jo, S., Sun, W., Kim, B., Kim, S., Park, J., Shin, H.: Memristor neural network training with clock synchronous neuromorphic system. Micromachines 10(6), 384 (2019) 31. Mnih, V., et al.: Playing atari with deep reinforcement learning. CoRR, abs/1312.5602 (2013) 32. Matiisen, T.: Demystifying deep reinforcement learning. neuro.cs.ut.ee Computational Neuroscience Lab, 19 December 2015. Accessed 6 Apr 2018 33. Zhao, D., Chen, Y., Lv, L.: Deep reinforcement learning with visual attention for vehicle classification. IEEE Trans. Cogn. Dev. Syst. 9(4), 356–367 (2017)

A Brief Analysis on Security in Healthcare Data Using Blockchain Satyajit Mohapatra1 , Pranati Mishra1(B) , and Ranjan Kumar Dash2 1 Department of CSE, Odisha University of Technology and Research, Bhubaneswar, India

[email protected]

2 Department of CSA, Odisha University of Technology and Research, Bhubaneswar, India

[email protected]

Abstract. With the advancement of technology, security of data needs to be coped up in the field of Healthcare. Blockchain technology ensures trust, immutability and accountability which has a huge impact in the managing of patient data. Since data needs to be secured and privacy concerns arise, Self-Sovereign Identity management is employed to hand over the data to its respective individual and let them manage their data. This paper focuses on various works carried out in the field of healthcare using blockchain technology and points out the motivation for securing patient data with blockchain, smart contracts being the principal base in the cost analysis of Ganache and Ropstein network. It then presents future scope in integrating and making Electronic Health Records along with the application decentralized using a peer to peer network such as IPFS. Keywords: Blockchain · Healthcare · Privacy · Decentralized · Distributed ledger

1 Introduction Blockchain Technology has been intriguing since its design and provides a promising future for organizations endeavoring for security and privacy. A distributed electronic cash system was proposed by Wei Dai in 1998 known as B-Money but was never launched [1]. The introduction to digital cryptocurrency Bitcoin by “Satoshi Nakamoto” in the year 2008 formulated the Blockchain Technology. The whitepaper published by him “Bitcoin: A Peer-to-Peer Electronic Cash System”, presented the concept of transacting money online without the involvement of third parties such as banks. It presented the key concept of Proof of Work and dealt with the double spending of currency i.e. preventing duplicity of digital currency and being spent more than once [2]. The transactional data is replicated and stored on every node in a blockchain leading to the distributed property. It thus ensures no single point failure, is tamper proof and works on peer to peer network [3]. At the very beginning first generation Blockchain 1.0 was based on digital currency, second generation 2.0 was based on digital economy and third generation 3.0 is referred to as digital society which includes government services, healthcare industries, etc. [4]. With its presence in every major field, Blockchain has tremendous utilization in the field © The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 M. N. Mohanty et al. (Eds.): METASOFT 2022, AISSE 1, pp. 145–155, 2022. https://doi.org/10.1007/978-3-031-11713-8_15

146

S. Mohapatra et al.

of healthcare. Healthcare industry requires additional security and privacy policies due to huge amounts of data being generated every day in the form of Electronic Medical Records. The data mainly of the patients need to be stored safely and securely. Moreover IoT devices like body sensors, mobile applications, etc. need to handle data securely as there may be risk of malicious attack. Data sharing, interoperability and handling of medical records remain an important factor in healthcare [5]. The remainder of this article is organized as follows: Sect. 2 gives brief knowledge about the Blockchain technology and its presence in the healthcare industry. Section 3 provides the related studies carried out on Blockchain in Healthcare and security of patient data. Section 4 focuses on discussion. Section 5 furnishes future work and objectives followed by references.

2 Overview of Blockchain Blockchain is a chain of blocks containing records of various transactions. Each new block created points to the previous block via a referenced hash value of the previous block called the parent block. The starting initial block of a blockchain is called a genesis block which doesn’t have any parent block. A block generally consists of block version, parent block hash, timestamp, nonce, merkle root hash [6]. Only when the majority of the participating nodes agree by consensus mechanism that the transactions in a block are valid then it is added to the chain. Validation and addition of blocks are carried out by miners who receive incentives in form of tokens or bitcoins. The hash values generated are always unique and any change in the previous block data would affect the following blocks in the chain and would denote that tampering has occurred. But this change to be tamper proof has to occur on majority nodes on the blockchain network which is quite impossible [7] (Fig. 1).

Fig. 1. Blockchain architecture

A Brief Analysis on Security in Healthcare Data Using Blockchain

147

2.1 Properties of Blockchain Immutability. One of the most important feature, immutability means something that cannot be altered or changed. Since it is made up of a chain of nodes, unlike a central storage, it requires majority of nodes for validation. Hence once a transaction is logged, it is quite impossible for updating or editing the value. Decentralized Storage. Since Blockchain is a chain of nodes there is no governing or central authority rather it is maintained by major nodes. So it helps in storing and managing digital assets whether it be cryptocurrencies, important documents, etc. Hence it hands control directly to the person over their assets. Improved Security. In addition to its decentralized feature, cryptography forms the backbone of security in a blockchain network. Each block in a network has a hash value which is irreversible and is dependent on the previous block element’s value. So any modification to a block would change its hash value which would be detected. Transparency. As blockchain stores data in a distributed ledger the amount of information is available to everyone. However the amount of information provided to a user can be limited based on role as everyone need not have access to all information. Consensus Mechanism. Consensus is an algorithm which makes the decision for every transaction. For a transaction to be successful all nodes must come to an agreement which is made possible by the consensus algorithm. Nodes might not trust each other but they trust the consensus [8].

2.2 Types of Blockchain Public Blockchain. In this type anyone can participate i.e. every node has equal rights to create, validate and access block data. It is fully decentralized in nature. Private Blockchain. Here a single organization has control over the network and has authority which nodes to allow for participation in the network. It is partially decentralized as every node doesn’t have full access rights. Consortium Blockchain. A group of organizations governing the blockchain network form the Consortium Blockchain. It gives more decentralized access as compared to private blockchain but cooperation among various organization is required at all times. Hybrid Blockchain. This blockchain network is formed as a result of combination of public, private or consortium blockchain to implement various features as per the need [9].

148

S. Mohapatra et al.

2.3 Blockchain in Healthcare In the field of healthcare industry blockchain plays a vital role for managing electronic medical records, drug supply chain management, health insurance, biomedical research [10]. With the recent outbreak of Covid-19 government of each nation need to address the problems caused by technologies like Blockchain and AI [11]. A huge amount of medical data is generated and handled by each country as per their national policies. In the situation of a global outbreak, a global response is needed with collaborative work from all countries to mitigate the virus spread. While sharing of medical data among countries can be achieved to some extent by interoperability, privacy of data shared remains a major concern. A standardized protocol is required which is to be agreed upon globally for addressing the security concerns for the Global Health information System [12]. The outbreak made for the rapid adoption of digital health technology along with changes in the governance policies and regulatory demands. However the level of healthcare services provided needed a superior quality with proper security and management of legacy Identity Management (IdM) [13]. Digital identity is a major concern on the internet in regards to its usage and handling. With the evolution comes Self Sovereign Identity which provides key elements - individual control, security and portability. No central authority has control over the data; rather the control is fully handed over to the individual who can control and manage their own data [14]. The key principles regarded as criterion include Existence, Control, Access, Transparency, and Persistence. These help in providing a base for Self Sovereign Identity to the users [15]. Patients’ identity and personal medical records may be in a scattered format with different organizations. This creates problems in proper diagnosis of the patient by the physician as proper data and medical history is not available and the patient might be unaware of all records. There is a need for providing a unique identity to patients and storing all medical records in the same to provide seamless and proper diagnosis. The Self Sovereign Identity would ensure proper data access by authorized personnel and management of data by the respective individuals only [16].

3 Related Work Blockchain has proved to be of great value in healthcare in terms of accountability, building trust factor and helping increase security. An improved IEEE 802.15.6 protocol designed reduced computational costs on resource nodes significantly. Another protocol provided the architecture on how users can share data to others on Pervasive Social Network (PSN) using Blockchain Technology. Since original health data is not stored and only address is stored it makes it light and there is no data loss due to third party access [17]. A Blockchain based Privacy Risk Control - Healthcare Data gateway (HGD) architecture helped in enabling patients to have control over their data and Secure Multiparty Computing keeps check on third party access to patient data and violating policies [18]. A Hyperledger and chain code based Mobile Healthcare Application provided a user centric data approach with tree based processing of huge volumes of data. It is compatible with existing systems, requests are recorded and latency is low O (log n) [19].

A Brief Analysis on Security in Healthcare Data Using Blockchain

149

With the Institution driven approach now transitioning towards patient driven, patients have access to their data with simple API and are the head of their own data management. But there might be delays related to processing of blocks and security issues due to public key [20]. The use of RAPTOR and TPOLE mechanisms in SERUMS API, the authors were able to add new data sources into the data lake without any conflict. The processing of large heterogeneous data and removing complexities was possible [21]. It further provided a user centric approach handing full control over to the patients. Rules could only be manipulated by authorized users. The main challenge faced was to secure privacy of users while complying with international GDPR policies [22]. The use of Flexipass and Data Fabrication helped SERUMS for a new universal smart patient record format, controlling access, development and evaluation of health records [23]. A Hyperledger Fabric permissioned Blockchain framework was able to fully comply with GDPR Authenticated resources and CRUD operations were immutably recorded. However software bugs and the openness of the ledger allowing inspection violates privacy [24]. The use of Service Oriented Architecture, XML and SOAP protocols ensured circleof-trust in patient consent mechanism. It helped in transformation of clinical documents with an interoperability module, providing direct access to Data Warehouse instead of production Database [25]. An improved Bell-LaPadula model with Hyperledger Composer and Fabric ensured dynamic access control policies for clearance level with associated smart contracts. Access policies included are - read, remark, update and delete. The dynamic nature allowed any peer to join and was not compulsory to maintain transaction history [26]. During transmission of data from IoT based devices such as Body Sensor Hive, security is ensured by blockchain technology to and from Unmanned Aerial Vehicles (UAV). Any hospital can participate for data usage as per the permission of use and it ensures 2 phase validation of data from source. Tampering is prevented by digital signature and low power transmission is ensured [27]. A Hyperledger fabric blockchain with data lakes created high level access control with individual rules for accessing the data lakes. It ensured a secure patient centric solution with dynamic design for compliance with GDPR [28]. The use of Fuzzy Cooperative Game and Smart Contracts helped in estimating benefit of cooperation from various factors such as no of patients cured, supply of medicines and success probability of treatment [29]. A project based on Ethereum, Ganache and ReactJS showcased efficiency and less time consumption in record access and secure access of Electronic Medical Records by institutions securely [30]. With the integration of Blockchain and IoT with Fog Computing a customer centric model was developed providing incentives to customers. It ensured participation on a higher scale and provided insights for customer satisfaction and retention [31]. The use of blockchain architecture ensures data security, health data ownership, robustness, transparency, data verifiability, and reduced clinical processing time [32].

150

S. Mohapatra et al.

Another combination of Blockchain and IoT technology focused on remote monitoring of patient health, drug traceability, managing medical records and tracing of fake medicines. Data security in IoT was ensured by blockchain and associated transactions. It needed a cost survey and security threat exploration and the transaction processing time was high [33]. Ethereum with smart contracts along with filecoin and storj developed a model that ensured pricing alignment and notified on change. Fragmented information was discouraged and providers were tracked to adhere to rules and regulations. Only authorized stakeholders were allowed to interact and efficient procurement practices were implemented. Ethereum transaction charges in terms of gas fees remain a concern but the model can be generalized to supply chain operation involving multiple stakeholders [34]. With the HL7 (Health Level Seven International) principles, 7 implying the seven layers of the OSI model it is IT standard accepted and adopted models of healthcare. The transactions are dependent on subsequent transactions in a progressive time-blockchain for generating the final transaction. The Temporary Hash Signature (THS) is used for authentication without the use of a third party [35].

Fig. 2. Ganache test network

4 Discussion There are many concepts arising for the application of blockchain in healthcare but the implementation and actual deployment of models is very low. Although applications pertaining to Self-Sovereign Identity have been developed they have not yet been

A Brief Analysis on Security in Healthcare Data Using Blockchain

151

Fig. 3. Medical-Chain admin creates personnel record

streamlined and lack the integration of storage of Electronic Health Records as it is too costly to store it on the blockchain network. We implemented Medical-Chain, a blockchain based medical record system where there were three main users of the system namely the admin, medical personnel (doctors) and the patients (Fig. 2). At the top level of the accessibility is the Admin who has the initial work to begin with. First the Admin deploys the smart contract on a suitable network, after which our front end React application will be able to interact with it using the WEB3 protocols. Next functionality provided to an Admin includes the creation of medical personnel account with details which includes doctor id, name and department associated. Only then the doctor has the access to the system. The access can be revoked anytime which is also controlled by the admin. Then medical personnel would be able to securely add patient data and patient would be able to access their own data or grant access to medical personnel as per requirement (Figs. 3 and 4). The application was developed using ReactJS Framework which served as frontend while the smart contract deployed served as backend. The application was tested on both Ganache local network and Ropstein Test Network. It was found that the cost of operations on the Ganache network was a bit higher than that of the Ropstein network. Since the medical inscription was text only, the cost increased as the data size (no of characters) in the inscription increased (Fig. 5 and 6).

S. Mohapatra et al.

Fig. 4. Medical personnel submits patient medical record

Ganache

Ropstein

0.06 0.05

Cost in Ethers

152

0.04 0.03 0.02 0.01 0

Contract Deployment

Doctor CreaƟon

PaƟent CreaƟon

Ganache

0.05490758

0.00139586

0.00930838

Ropstein

0.05287559

0.0012646

0.00735367

Fig. 5. Cost analysis of Ganache and Ropstein network

A Brief Analysis on Security in Healthcare Data Using Blockchain

Ganache

153

Roptein

0.012

Cost in Ethers

0.01 0.008 0.006 0.004 0.002 0 29

66

112

No. of Characters in InscripƟon Fig. 6. Cost analysis based on amount of data in inscription.

5 Conclusion and Future Scope Since majority of papers focus on developing models and there is hardly any implementation concept. The work demonstrated in this paper only produces text based medical inscription which needs to be optimized as the cost is directly dependent on it. Hence our further work includes developing an application incorporating Electronic Health Records including files in pdf and jpeg formats stored in IPFS - distributed file ledger while the key data being stored on the Ethereum Network. Furthermore the frontend application developed in React Framework would incorporate the functionality enabling Self Sovereign Identity through blockchain. To make the whole application distributed and not centralized, it would be hosted on a peer to peer network platform such as IPFS.

References 1. Dai, W.: B-Money. In: Wei Dai’s Home Page (1998). http://www.weidai.com/bmoney.txt. Accessed 7 Mar 2022 2. Nakamoto, S.: Bitcoin: a peer-to-peer electronic cash system. In: Bitcoin Org. (2009). https:// bitcoin.org/bitcoin.pdf. Accessed 7 Mar 2022 3. Blockchain White Paper. National Archives and Record Administration (2019). https://www. archives.gov/files/records-mgmt/policy/nara-blockchain-whitepaper.pdf 4. Murthy, C.V., Shri, M.L., Kadry, S., Lim, S.: Blockchain based cloud computing: architecture and research challenges. IEEE Access 8, 205190–205205 (2020). https://doi.org/10.1109/acc ess.2020.3036812 5. McGhin, T., Choo, K.-K.R., Liu, C.Z., He, D.: Blockchain in healthcare applications: research challenges and opportunities. J. Netw. Comput. Appl. 135, 62–75 (2019). https://doi.org/10. 1016/j.jnca.2019.02.027 6. Nofer, M., Gomber, P., Hinz, O., Schiereck, D.: Blockchain. Bus. Inf. Syst. Eng. 59(3), 183–187 (2017). https://doi.org/10.1007/s12599-017-0467-3

154

S. Mohapatra et al.

7. Zheng, Z., Xie, S., Dai, H.N., Chen, X., Wang, H.: Blockchain challenges and opportunities: a survey. Int. J. Web Grid Serv. 14, 352 (2018). https://doi.org/10.1504/ijwgs.2018.095647 8. Wust, K., Gervais, A.: Do you need a blockchain? In: 2018 Crypto Valley Conference on Blockchain Technology (CVCBT) (2018). https://doi.org/10.1109/cvcbt.2018.00011 9. Shrivas, M.K.: The disruptive blockchain: types, platforms and applications. Texila Int. J. Acad. Res. 17–39 (2019). https://doi.org/10.21522/tijar.2014.se.19.01.art003 10. Garrido, A., Ramírez López, L.J., Álvarez, N.B.: A simulation-based AHP approach to analyze the scalability of EHR systems using blockchain technology in Healthcare Institutions. Inform. Med. Unlocked 24, 100576 (2021). https://doi.org/10.1016/j.imu.2021.100576 11. Nguyen, D.C., Ding, M., Pathirana, P.N., Seneviratne, A.: Blockchain and AI-based solutions to combat coronavirus (covid-19)-like epidemics: a survey. IEEE Access 9, 95730–95753 (2021). https://doi.org/10.1109/access.2021.3093633 12. Biswas, S., Sharif, K., Li, F., Bairagi, A.K., Latif, Z., Mohanty, S.P.: GlobeChain: an interoperable blockchain for global sharing of healthcare data—a COVID-19 perspective. IEEE Consum. Electron. Mag. 10, 64–69 (2021). https://doi.org/10.1109/mce.2021.3074688 13. Javed, I.T., Alharbi, F., Bellaj, B., Margaria, T., Crespi, N., Qureshi, K.N.: Health-ID: a blockchain-based decentralized identity management for Remote Healthcare. Healthcare 9, 712 (2021). https://doi.org/10.3390/healthcare9060712 14. Tobin, A., Reed, D.: Inevitable rise of self-sovereign identity. In: Sovrin (2021). https://sov rin.org/library/inevitable-rise-of-self-sovereign-identity/ 15. Shuaib, M., Daud, S.M., Alam, S.: Self-sovereign identity framework development in compliance with self sovereign Identity principles using components. Int. J. Mod. Agric. 10(2), 3277–3296 (2021). http://www.modern-journals.com/index.php/ijma/article/view/1155 16. Siqueira, A., Conceição, A.F., Rocha, V.: User-centric health data using self-sovereign identities. In: Anais do IV Workshop em Blockchain: Teoria, Tecnologias e Aplicações (WBlockchain 2021) (2021). https://doi.org/10.5753/wblockchain.2021.17135 17. Zhang, J., Xue, N., Huang, X.: A secure system for pervasive social network-based healthcare. IEEE Access 4, 9239–9250 (2016). https://doi.org/10.1109/access.2016.2645904 18. Yue, X., Wang, H., Jin, D., Li, M., Jiang, W.: Healthcare data gateways: found healthcare intelligence on blockchain with novel privacy risk control. J. Med. Syst. 40(10), 1–8 (2016). https://doi.org/10.1007/s10916-016-0574-6 19. Liang, X., Zhao, J., Shetty, S., Liu, J., Li, D.: Integrating blockchain for data sharing and collaboration in mobile healthcare applications. In: 2017 IEEE 28th Annual International Symposium on Personal, Indoor, and Mobile Radio Communications (PIMRC) (2017). https:// doi.org/10.1109/pimrc.2017.8292361 20. Gordon, W.J., Catalini, C.: Blockchain technology for healthcare: facilitating the transition to patient-driven interoperability. Comput. Struct. Biotechnol. J. 16, 224–230 (2018). https:// doi.org/10.1016/j.csbj.2018.06.003 21. Bowles, J.K.F., Mendoza-Santana, J., Vermeulen, A.F., Webber, T., Blackledge, E.: Integrating healthcare data for enhanced citizen-centred care and analytics. Stud. Health Technol. Inform. 275, 17–21 (2020). https://doi.org/10.3233/shti200686 22. Bowles, J., Mendoza-Santana, J., Webber, T.: Interacting with next-generation smart patientcentric healthcare systems. In: Adjunct Publication of the 28th ACM Conference on User Modeling, Adaptation and Personalization (2020). https://doi.org/10.1145/3386392.3399561 23. Janjic, V., et al.: The serums tool-chain: ensuring security and privacy of medical data in smart patient-centric healthcare systems. In: 2019 IEEE International Conference on Big Data (Big Data) (2019). https://doi.org/10.1109/bigdata47090.2019.9005600 24. Truong, N.B., Sun, K., Lee, G.M., Guo, Y.: GDPR-compliant personal data management: a blockchain-based solution. IEEE Trans. Inf. Forensics Secur. 15, 1746–1761 (2020). https:// doi.org/10.1109/tifs.2019.2948287

A Brief Analysis on Security in Healthcare Data Using Blockchain

155

25. Gavrilov, G., Vlahu-Gjorgievska, E., Trajkovik, V.: Healthcare data warehouse system supporting cross-border interoperability. Health Informatics J. 26, 1321–1332 (2019). https://doi. org/10.1177/1460458219876793 26. Kumar, R., Tripathi, R.: Scalable and secure access control policy for healthcare system using blockchain and enhanced Bell–LaPadula model. J. Ambient. Intell. Humaniz. Comput. 12(2), 2321–2338 (2020). https://doi.org/10.1007/s12652-020-02346-8 27. Islam, A., Young Shin, S.: A blockchain-based secure healthcare scheme with the assistance of unmanned aerial vehicle in internet of things. Comput. Electr. Eng. 84, 106627 (2020). https://doi.org/10.1016/j.compeleceng.2020.106627 28. Bowles, J., Webber, T., Blackledge, E., Vermeulen, A.: A blockchain-based healthcare platform for secure personalised data sharing. Stud. Health Technol. Inform. 281, 208–212 (2021). https://doi.org/10.3233/shti210150 29. Smirnov, A., Teslya, N.: Ambulance vehicle routing under pandemic with Fuzzy Cooperative game via smart contracts. In: Proceedings of the 7th International Conference on Vehicle Technology and Intelligent Transport Systems (2021). https://doi.org/10.5220/001045560000 2932 30. Sreeraj, R., Singh, A., Anbarasu, D.V.: Preserving EMR records using blockchain. In: Annals of the Romanian Society for Cell Biology (2021). https://www.annalsofrscb.ro/index.php/jou rnal/article/view/6480. Accessed 8 Mar 2022 31. Gul, M.J., Subramanian, B., Paul, A., Kim, J.: Blockchain for public health care in smart society. Microprocess. Microsyst. 80, 103524 (2021). https://doi.org/10.1016/j.micpro.2020. 103524 32. Srivastava, S., Singh, S.V., Singh, R.B., Kumar, H.: Digital transformation of Healthcare: a blockchain study. Int. J. Innov. Sci. Technol. 8, 414–425 (2021) 33. Ratta, P., Kaur, A., Sharma, S., Shabaz, M., Dhiman, G.: Application of blockchain and internet of things in healthcare and medical sector: applications, challenges, and future perspectives. J. Food Qual. 2021, 1–20 (2021). https://doi.org/10.1155/2021/7608296 34. Omar, I.A., Jayaraman, R., Debe, M.S., Salah, K., Yaqoob, I., Omar, M.: Automating procurement contracts in the healthcare supply chain using blockchain smart contracts. IEEE Access 9, 37397–37409 (2021). https://doi.org/10.1109/access.2021.3062471 35. Khubrani, M.M.: A framework for blockchain-based smart health system. In: Turkish J. Comput. Math. Educ. (TURCOMAT) 12(9), 2609–2614 (2021). https://turcomat.org/index. php/turkbilmat/article/view/3750

A Review on Test Case Selection, Prioritization and Minimization in Regression Testing Swarnalipsa Parida, Dharashree Rath, and Deepti Bala Mishra(B) GITA Autonomous College, Bhubaneswar 752054, India [email protected]

Abstract. Regression Testing is always applied to the existing software or when the software is in maintenance and operation phase. This testing process gives an assurance that the software product can work without any harm to the previous version, though any changes are performed towards its bugs that detected after the delivery of that software product. In this testing technique, the test suites are executed and updated time to time to check the functionality of the new version as well as the old one. In this paper we have done the survey on various methods of Regression testing that mainly focus on execution of test suites in terms of test suite selection, test suite minimization, test suite prioritization. Also, we have studied various regression testing techniques that researchers have been done by using different test case factors, algorithms and metrices like APFD, APSC, APCC. Keywords: Regression testing · Test case selection · Test case minimization · Test case prioritization · APFD · APCC · SUT

1 Introduction In Software Development Life Cycle (SDLC) after coding and design phase, the developed software needs a check to confirm that there is no fault present and the developers can jump to next phase of the lifecycle model. This process is known as testing of a software [1]. Testing of a software product takes a crucial role in software engineering, which helps the end users to improve their newly requirements and also to eradicate the bugs arises in that product [2]. Regression Testing technique, which is one of the important testing techniques, performed on the updated version of software. It helps the developers as well as the end-users to make changes inside that product. When changes made in existing software the regression testing concept arises [3]. Basically, when the modification made in any software during regression testing the tester should make sure that after the modification there should not be any unfavorable impact on quality of the software [1, 3]. The rest part of this paper is arranged as; Sect. 2 describes the basic concept of testing and the working principles of some testing techniques; Sect. 3 represents the literature review of some existing works on regression testing techniques. Finally, the conclusion and some future scopes are drawn in Sect. 4. © The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 M. N. Mohanty et al. (Eds.): METASOFT 2022, AISSE 1, pp. 156–163, 2022. https://doi.org/10.1007/978-3-031-11713-8_16

A Review on Test Case Selection, Prioritization and Minimization

157

2 Basic Concepts This Section describes, a few background concepts and definitions related to regression testing process in software testing. 2.1 Regression Testing When a Software product will hand over to the customers and they execute that software with their real-time data and information, it may require so many changes and modifications [4]. So, the developers are bound to modify the internal codes so that the product will meet the customer’s requirements. After any type of modification, the product has to be tested again and again, which is called regression testing [5]. Generally, in regression testing technique the program under test is divided into some sub-programs or suites, which are known as test-suites. These test suites are ready to test with the help of existing test cases and some newly generated test cases that are developed by the tester during testing the updated version [6]. The main aim of the regression testing is that the modified software does not affect the execution of the previous software though changes take place in the coding part. An approach being used named as “retest-all approach”, which helps to execute the test cases in the test-suite. As the growing of test cases is directly proportional to the growing of test suites, it results to a costly process and may affect to the whole cost of the software product development [7].

Fig. 1. Classification of regression testing

158

S. Parida et al.

To get rid of these type of complication various activities like test case selection, test case prioritization and test case minimization are performed during regression testing shown in Fig. 1. • Test Case Selection: - Test Case Selection technique is used to select the appropriate subset of existing test suite. Using this technique less time is required for testing the modified version of the software product [7]. • Test Case Prioritization: - Test Case Prioritization is used to organized the test cases for testing in order to enhance the effectiveness of activation i.e., test cases having higher priority are executed earlier [8]. The problem for test case prioritization can be stated as below: Given: A test suite T and Torder refers to a number of ways that test cases are chosen. Fitness f of T is calculated depending on some criteria to a real value. Problem: We have to find T in such a way that T p Torder for all T, where T! = T and f T >= f(T) • Test Case Minimization: - The purpose of test case minimization is to recognize and remove the outdated test cases from the test suites in order to reduce the code to run as well as reduce the cost of the software product [9, 10]. Test case minimization can be defined as: Given: A test suite Ts and a set of test case requirements r 1 , r2 , r3 . . . ...... rn that must be satisfied to provide the testing coverage of a software. The subsets of Ts, T1 , T2 , T3 . . . ....Tn is associated with Traceability matrix, in such a way that each test case T j belongs to T i can be used to test r i . Problem: We have to find a reprehensive set of test cases from Ts that satisfies all the ri ’s, where the ri ’s can represent either all the requirements of the program or the requirements related to the modified program. Test case prioritization can be performed in various ways like [11]. • • • • • • •

No Prioritization Optimal test case prioritization Coverage based test case prioritization Random test case prioritization Slice based test case prioritization Risk Based Prioritization (RBP) Fault Exposing Potential (FEP) based prioritization

A Review on Test Case Selection, Prioritization and Minimization

159

Prioritization can be takes place in two different ways i.e., Code based test case prioritization and model-based test case prioritization [1, 12–14]: • Code based test case prioritization: - In this process, the source code related information is taken for prioritization purpose. The information like branch coverage, statement coverage, condition coverage, statement coverage, condition coverage, FEP value, execution time etc. can be taken as test case criteria. • Model based test case prioritization: - Here, the modification of test cases is done based upon the model of the system. The behavior of the system can be varied by their different languages of the model like UML (Unified Modelling Language), FSM (Finite State Machine), SDL (Specification Description Language) etc. 2.2 Effectiveness of Prioritized Test The effectiveness of prioritized test suite can be measured using Average Percentage of Fault Detection (APFD) metric [9, 14]. It measures the weighted average of the percentage of faults detected during the execution of test suite. The range of APFD varies from 0 to 100, where a higher value means a faster rate of fault detection. So, it is a metric to detect how quickly a test suite identifies the faults. The APFD value can be calculated using Eq. (1) [15, 16]. APFD = 1 −

1 TF1 + TF2 ± − − − + TFm + mn 2n

(1)

where, TFi = Position of the first test case in the test suite. T = Test Suite i = Number of Fault detect m = Total number of faults exposed in the Test Suite n = Total number of test cases in T Like APFD various metrices are introduced to measure the effectiveness of the prioritized test cases like; [17]. • • • • • •

APCC- Average Percentage of Code Coverage APPC- Average Percentage of Path Coverage APBC- Average Percentage of Branch Coverage APFD- Average Percentage of Fault Detection APFDC- Average Percentage of Fault Detection with Cost metric PTR- Problem Tracking Report

160

S. Parida et al.

3 Related Work A lot of works has been already done in the field of regression testing using different algorithms and approaches and a few of them are described in this section. Kaur et al. [9] have designed a GA based approach, used to prioritize test cases to achieve maximum code coverage. Different prioritization approaches viz. code coverage prioritization and fault coverage prioritization were discussed taking time constrained environment and total amount of code coverage. The APCC has been used to measure the performances of the proposed regression testing technique. Their experimental result gives better results in terms of highest APCC value. Sharma et al. [8] have proposed an algorithm for test case prioritization by comparing different existing greedy approaches, and not got satisfactory results. Further, a hybrid algorithm, GA with hill climbing method has been proposed to find the optimal results. Different coverage factors and FEP value of test cases are taken for prioritization. The APFD metric has been used to find the performance of the proposed hybrid method and the reported results found better in terms of highest fault detection rate. Rhmann et al. [3] have proposed an efficient technique using 0–1 integer programming that reduces the test suites and prioritizes the test cases which based on statement coverage, fault identification, risk coverage. The proposed method has been applied on TCP and their experimental result gives maximum fault coverage with minimum risk coverage. Deepti et al. [6] have proposed an approach for test case prioritization using Genetic Algorithm (GA). Authors have also presented a novel code-based prioritization technique, which minimized the test cases with maximum mutant coverage. The proposed technique takes the prioritization factors like statement coverage, fault exposing potential award and mutant coverage. The experiment has been done on a small case study as the Triangle Classifier Problem (TCP) and GA has been used for test case prioritization and minimization purpose. Harikarthik et al. [4] have proposed a technique in which new test cases can be generated and bugs can also be detected as earlier as possible. The proposed technique Kernel Fuzzy C-Means clustering (KFCM) has been applied on Java platform with cloud sim for regression testing. Authors have taken the APFD to measure the fault detection rate and found satisfactory results. Deepti et al. [10] have presented a method based on multi-objective GA for test case prioritization and optimization in regression testing. Authors have taken different test case factors like statement coverage data, requirement factors, risk exposure value and execution time for test case prioritization. The minimization of test cases has been done by taking the execution time as a basic factor of test cases. Chi et al. [5] have developed a new approach Additional Greedy Method Call sequence (AGC) to guide test case prioritization effectively. In the proposed method, dynamic function call states of the art TCP technique in different aspects. Their reported results show that the proposed AGC out performs in scalability factors than other existing methods. The bug detection capabilities are also high and they have achieved a highest mean APFD value. Rahmani et al. [11] have presented a novel approach by reviewing various regression testing method. The authors have used open source SUTs in the time of evolution. Various methods have also implemented on SUTs using an MTS tool from SIR (Softwareartifact Infrastructure Repository). The effectiveness of the proposed techniques has been measured with the help of APFD metric. Anu Bajaj and Om Prakash [13] have developed a tri-level method for regression testing. In the proposed method authors have

A Review on Test Case Selection, Prioritization and Minimization

161

applied three different nature inspired approaches as GA, PSO, and a hybrid technique that combines PSO with gravitational constant. The gravitational search algorithm uses a chaotic constant to calculate the best fitness and has been used as a critical parameter to control the optimization process. Authors have solved so many research questions using the proposed tri-level methods. Their reported results shows that the proposed methods perform very well in solving the research issues in software regression testing. The hybrid PSOGSA outperformed over other algorithms in minimizing test cases for regression testing. Soumen et al. [19] proposed a honey bee swam intelligence algorithm for test case optimization in regression testing. The proposed algorithm has been designed to enhance the fault detection rate in minimum time. The proposed nature inspired algorithm implemented on two VB projects with incorporating some faults into them. Authors have discussed various strategies of prioritization in regression testing like no order, random order, reverse order etc. The proposed method yielded highest APFD value in comparison to other prioritization techniques. Huang et al. [7] proposed a GA based Modified Cost-Cognizant Test Case Prioritization (MCCTCP) to determine the most effective order of test cases. The historical records are gathered from the latest regression testing. The reported results indicate, that the effectiveness of faults detection is improved by the proposed method and the historical information can provide high test effectiveness during testing. Kumar et al. [14] proposed a prioritization technique, based on requirement analysis such as requirement priority and requirement factor with varying nature. Their proposed system improves the testing process by ensuring the quality, cost, effort of the software and the user’s satisfaction. Sharma and Sujata [15] defined an effective model-based approach to generate and prioritize effective test cases. They have used GA to generate effective test paths based on the requirement and user view analysis. They have taken cost factor for a specific model and estimate the overall cost to test the functional behavior of the model. Konsaard et al. [17] proposed total coveragebased regression test case prioritization using GA. Modified GA is used to simplify and it has the ability to change the population, that supply a number of test cases used for prioritization process. Srikanth et al. [16] proposed a technique for test case prioritization by taking two factors as Customer Priority (CP) and the Fault Proneness (FP) in the domain of enterprises level cloud application. Their experimental results indicate the effectiveness of test case prioritization can be improved by using CP and FP in risk-based prioritization. Wang et al. [18] proposed a Risk Based Test Case Prioritization (RI-TCP) technique, by taking transmission of information flow among different components of a software. Their proposed algorithm maps the software into the Class level Directed Network Model based on Information Flow (CDNMIF) according to the dependencies. The priority to each test case is given by calculating the total sum of risk indexes of all the barbells covered by the test case. They found RI-TCP technique gives a higher rate of fault detection with serious risk indicators. The Table 1 gives a brief summary of related works that discussed above.

162

S. Parida et al. Table 1. Summary of Literature Studies

Authors

Method used

Results

Kaur et al. [9], 2011

GA

APCC = 88.3%

Huang et al. [7], 2011

GA

APFD = 92.46%

Kumar et al. [14], 2012

_

APFD = 50%

Sharma et al. [8], 2014

Greedy algorithm, meta heuristic algorithm

APFD = 88%

Konsaard et al. [17], 2015

GA

APCC = 100%

Srikanth et al. [16], 2015

PORT approach

APFD = 76%

Sharma et al. [15], 2015

GA

_

Rhmann et al. [3], 2016

0–1 Integer programming

_

Wang et al. [18], 2018

Ri-TCP

APFD = 94%

Deepti et al. [10], 2019

GA

APSC = 72%

Harikarthik et al. [4], 2019

KFCM

APFD = 32%

Deepti et al. [6], 2019

GA

APSC = 93.33%

Chi et al. [5], 2020

AGC

APFD = 80.81%

Rehmani et al. [11], 2021

Sorting technique

APFD = 99.71%

Anu Bajaj and Om Prakash [13], 2021

GA, PSO, PSOGSA

APSC = 100%

Soumen et al. [19], 2021

Honey bee swam intelligence algorithm

APFD = 85% (Project 1) APFD = 82.5% (Project 2)

4 Conclusion and Future Scope In this research paper, a literature survey and a detailed analysis about test case selection, test case minimization, test case prioritization of regression testing has been done. Different techniques for regression testing have already been developed by researchers. Also, this paper provides some basic concepts used in test case prioritization and minimization, and how different metrices like APFD, APSC, APCC etc. are used to measure the efficiency of proposed methods. It is also seen from the existing works that, so many factors have been taken by researchers to perform various activities of regression testing. The test case factors like FEP, statement coverage, code coverage, risk coverage, execution time, fault detection rate and historical data can be taken for calculating the weight values. In future, we are planning to take more factors of test cases like condition coverage, risk coverage, mutant coverage etc. for test case prioritization. It is also planned to develop an approach for regression testing which can be used for automatic test data generation as well as for test case prioritization. The optimization techniques can also be applied for the minimization purpose in regression testing.

A Review on Test Case Selection, Prioritization and Minimization

163

References 1. Acharya, A., Mohapatra, D.P., Panda, N.: Model based test case prioritization for testing component dependency in CBSD using UML sequence diagram. Int. J. Adv. Comput. Sci. Appl. 1(3), 108–113 (2010) 2. Ray, M., Mohapatra, D.P.: Prioritizing program elements: a pretesting effort to improve software quality. Int. Sch. Res. Not. (2012) 3. Rhmann, W., Zaidi, T., Saxena, V.: Test cases minimization and prioritization based on requirement, coverage, risk factor and execution time. J. Adv. Math. Comput. Sci., 1–9 (2016) 4. Harikarthik, S.K., Palanisamy, V., Ramanathan, P.: Optimal test suite selection in regression testing with testcase prioritization using modified Ann and Whale optimization algorithm. Clust. Comput. 22(5), 11425–11434 (2017). https://doi.org/10.1007/s10586-017-1401-7 5. Chi, J., et al.: Relation-based test case prioritization for regression testing. J. Syst. Softw. 163, 110539 (2020) 6. Mishra, D.B., Panda, N., Mishra, R., Acharya, A.A.: Total fault exposing potential based test case prioritization using genetic algorithm. Int. J. Inf. Technol. 11(4), 633–637 (2018). https:// doi.org/10.1007/s41870-018-0117-0 7. Huang, Y.C., Peng, K.L., Huang, C.Y.: A history-based cost-cognizant test case prioritization technique in regression testing. J. Syst. Softw. 85(3), 626–637 (2012) 8. Sharma, N., Purohit, G.N.: Test case prioritization techniques “an empirical study”. In: 2014 International Conference on High Performance Computing and Applications (ICHPCA), pp. 1–6. IEEE, December 2014 9. Kaur, A., Goyal, S.: A genetic algorithm for regression test case prioritization using code coverage. Int. J. Comput. Sci. Eng. 3(5), 1839–1847 (2011) 10. Mishra, D.B., Mishra, R., Acharya, A.A., Das, K.N.: Test case optimization and prioritization based on multi-objective genetic algorithm. In: Yadav, N., Yadav, A., Bansal, J., Deep, K., Kim, J. (eds.) Harmony Search and Nature Inspired Optimization Algorithms. AISC, vol. 741, pp. 371–381. Springer, Singapore (2019). https://doi.org/10.1007/978-981-13-0761-4_36 11. Rahmani, A., Min, J.L., Maspupah, A.: An empirical study of regression testing techniques. In: Journal of Physics: Conference Series, vol. 1869, no. 1, p. 012080. IOP Publishing, April 2021 12. Khalilian, A., Azgomi, M.A., Fazlalizadeh, Y.: An improved method for test case prioritization by incorporating historical test case data. Sci. Comput. Program. 78(1), 93–116 (2012) 13. Bajaj, A., Sangwan, O.P.: Tri-level regression testing using nature-inspired algorithms. Innov. Syst. Softw. Eng. 17(1), 1–16 (2021). https://doi.org/10.1007/s11334-021-00384-9 14. Kumar, A., Gupta, S., Reparia, H., Singh, H.: An approach for test case prioritization based upon varying requirements. Int. J. Comput. Sci. Eng. Appl. 2(3), 99 (2012) 15. Sharma, N., Sujata, M.: Model based test case prioritization for cost reduction using genetic algorithm. Int. J. Sci. Eng. Appl. 4(3), 2319–7560 (2015) 16. Srikanth, H., Hettiarachchi, C., Do, H.: Requirements based test prioritization using risk factors: an industrial study. Inf. Softw. Technol. 69, 71–83 (2016) 17. Konsaard, P., Ramingwong, L.: Total coverage based regression test case prioritization using genetic algorithm. In: 2015 12th International Conference on Electrical Engineering/Electronics, Computer, Telecommunications and Information Technology (ECTI-CON), pp. 1–6. IEEE, June 2015 18. Wang, Y., Zhu, Z., Yang, B., Guo, F., Yu, H.: Using reliability risk analysis to prioritize test cases. J. Syst. Softw. 139, 14–31 (2018) 19. Nayak, S., Kumar, C., Tripathi, S., Mohanty, N., Baral, V.: Regression test optimization and prioritization using Honey Bee optimization algorithm with fuzzy rule base. Soft. Comput. 25(15), 9925–9942 (2020). https://doi.org/10.1007/s00500-020-05428-z

Artificial Intelligence Advancement in Pandemic Era Ritu Chauhan1 , Harleen Kaur2(B) , and Bhavya Alankar2 1 Amity University, Sector 125, Noida, UP, India

[email protected]

2 Department of Computer Science and Engineering, Jamia Hamdard, New Delhi, India

{harleen,balankar}@jamiahamdard.ac.in

Abstract. Artificial intelligence (AI) and machine learning (ML) is usually extensive technology that is worthwhile and applied in several application domains. However, current scenario has manifested and laid enormous challenges among the researchers and scientists to develop and implement the technology in the real world. Nevertheless, the COVID-19 outbreak has triggered intense work on such applications and designing modules to discover knowledge from extensive databases. Digital technologies are critical for both social and economic health in the face of the coronavirus. A digital response to the COVID-19 epidemic can take many forms and be quite beneficial. In the current study of approach, we have widely discussed extensive application of AI in the pandemic era focusing on rapid developments for screening of the population and evaluating the infection risks. Keywords: Artificial intelligence · COVID-19 · Healthcare · Machine learning

1 Introduction Extreme intense respiratory condition Covid infection (COVID-19) brought about by Covid 2 (SARS-CoV-2) represents an extraordinary general wellbeing emergency. On March 16, 2020, the White House teams up gears with research foundations and innovation organizations worldwide using artificial intelligence (AI) examination to foster new data and text mining advancements. Moreover, AI is a promising and potentially powerful technology that is applied for the detection and prognosis of the disease. In the past several, AI-based technology such as imaging and other data streams are combined with huge databases of electronic health information, which may enable a per-personalized approach to better diagnosis and prediction of individual outcomes in medicine responses to therapies [1–5]. In similar, numerous studies are discussed in the pandemic era to substantially evolve and acquire AI to discover knowledge from large scale health care databases. Allen Institute conducted an integral study of AI with COVID-19 databases where the focus was to identify varied factors which were responsible for prognosis and diagnosis of disease. The study effectively was applied with AI to foster the new factors which were the © The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 M. N. Mohanty et al. (Eds.): METASOFT 2022, AISSE 1, pp. 164–172, 2022. https://doi.org/10.1007/978-3-031-11713-8_17

Artificial Intelligence Advancement in Pandemic Era

165

main exploration cause of COVID-19 [6–10]. Moreover, immense measures of COVID19 patient data were incorporated and analyzed utilizing progressed machine learning calculations to all the more likely comprehend infection spread examples, further the study elaborated with new improved speed and precision for determination, and adoptive new and successful restorative methodologies. Also, it can recognize individuals who are conceivably the most helpless. Past prediction states that the unexpected rate of the danger of COVID-19 has been growing enormously. The new tracking down system suggests an ever-increasing number of youthful grown-ups which are experiencing serious COVID-19 manifestations and a customized legacy. The situations exhibit the dire requirement for a thorough danger evaluation dependent on Physical and Physiological characteristics. The major cause may occur due to Human angiotensin changing over catalyst 2 (ACE2) communicated in the epithelial cells of the lungs, small digestive system, heart, and kidneys is SARSCoV-2 Spike. The study suggests that COVID-19 is an attacking glycoprotein receptor, where ACE2 articulation by utilizing ACE2 energizers to treat hypertension and diabetes may really fuel the clinical result of COVID-19 contamination. Indeed, this speculation should be additionally tried with a thorough plan of long-term clinical perceptions and investigations [11–15]. Accordingly, the natural chemistry (ACE2 articulation levels, and so forth) and clinical data (age, respiratory example, infection level, endurance, and so on) of COVID-19 patients that are undergoing fundamental illness can be broken down with the use of machine learning to recognize confided in capacities. Not exclusively is hazard forecast (e.g., ACE2) performed, yet additionally, hazard order and expectation for a fair arrangement of continuous illness treatment and assurance against COVID-19. ACE2 quality polymorphisms, addressed by different hereditary transformations in the human genome, have appeared to have an impact on viral restricting movement, suggesting potential COVID-19 contamination [16–20]. Machine learning and artificial intelligence are extensively applied researchers are excitedly looking and sitting tight for the data generated in real time by the epidemic throughout the planet. Ensuring data transformations for simple entry is vital, yet troublesome. The accessibility of clinical data identified with COVID-19 that can be overseen and prepared in an effectively open database is a huge boundary today. Hence, it is necessary to generate a digital framework to work with worldwide coordinated attempts [21, 22]. To this end, the US government organization is now proceeding with consortium development and financing openings. Therefore, these drives, clinical data identified with COVID-19 will be incorporated with existing biobanks, for example, the UK Biobank, and existing data for those patients, like genotype and physiological qualities (if effectively in the biobank). towards a quicker and more feasible way to deal with significant data mining by bioinformatics and computational researchers. A unified assortment of data from COVID-19 patients all throughout the planet will help future artificial intelligence and machine learning studies to foster prescient, symptomatic, and remedial procedures for COVID-19 and comparable pandemics.

166

R. Chauhan et al.

2 Artificial Intelligence 2.1 Artificial Intelligence Role in the Treatment of COVID-19 The primary use of AI is without a doubt assisting scientists with discovering immunizations that can assist medical care experts and help in containing the pandemic. Biomedicine and examination depend upon numerous advances, to which different applications in measurements have effectively contributed to the process. Similarly, the application of AI is important for this progression in the rehabilitating process. Simulated intelligence forecasts of the viral construction have effectively saved researchers long stretches of experimentation processes. This man-made intelligence seems to offer significant help but is restricted by the supposed “consistent” rules and endless mix to contemplate the protein collapsing. Moderna, an American startup, stands apart in learning biotechnology dependent on messenger ribonucleic acids (mRNA), where research on protein collapsing is fundamental. With the help of bio-informatics and considering AI as a vital part, we have the opportunity to lessen the time that is required to foster an immunization model that can be used in populations. Likewise, Chinese tech monster Baidu has collaborated with Oregon University and the University of Rochester to operate on a Linearfold expectation calculation to check the collapsing of a similar protein, in February 2020. This calculation is a lot quicker and more efficient than the customary calculations used to anticipate the design of the infection’s optional ribonucleic acid (RNA) and provide the researchers with extra data on how the infection spreads. Along these lines, Linearfold forecasts the auxiliary construction of the Covid-19 RNA arrangement in 27 s rather than 55 min. DeepMind, an auxiliary of Alphabet that is Google’s parent organization, similarly serves forecasts of the Covid protein structure with an AlphaFold AI framework. IBM, Amazon, Google, and Microsoft are also giving US specialists the required force so as to deal with enormous data sets of the learning of disease transmission, bioinformatics, and sub-atomic demonstrating. 2.2 Artificial Intelligence as a Catalyst for the Exchange of Information On March 11, 2020, in the US, White House OSTP meet along with the innovation organizations and key exploration bunches has decided how to utilize artificial intelligence devices to analyze a large number of examination articles that are distributed throughout the globe during the pandemic. In the weeks after the novel coronavirus episode in Wuhan, China in the month of December 2019, almost 2,000 exploration papers got published on the impacts of the novel infection, potential medicines, and the elements of the pandemic. Subsequently, Microsoft research, the National Medical Library, and Allen Institute have collaborated and arranged 29,000 reports that are recognized using the new infection and the more extensive group of Covid. Presently, 13,000 of these have been prepared so the PC can peruse the basic data about the creator and his association. Kaggle, an auxiliary of Google and commonly the stage for facilitating data science content, has planned for 10 key inquiries in relation to Covid-19. These inquiries range from hazard factors and non-pharmaceutical treatments to the hereditary idea of the infection and antibody improvement endeavors.

Artificial Intelligence Advancement in Pandemic Era

167

2.3 Role of Artificial Intelligence as an Observer and Predictor in the Pandemic Evolution Canadian organization BlueDot is known for its initial AI infection location and its continuous accessibility to 100 data sets including news, tagging, socioeconomics, climate data, and creature populaces. BlueDot recognized what was viewed as a pneumonia flare-up in Wuhan, China, on December 31, 2019, and distinguished the urban areas destined to encounter this flare-up. A group of specialists operating with Boston Children’s Hospital has additionally fostered an AI to follow the outspread of the Covid. This framework, called Health-Map, incorporates data from Google search, interpersonal organizations, online journals, and conversation discussions. This is a source that disease transmission specialists do not regularly utilize; however, it assists with distinguishing the main indications of an episode and surveying the reaction of the overall population. The Joint International Conference on Artificial Intelligence (IRCAI) in Slovenia has dispatched a “smart” media watch on Covid under the protection of UNESCO. Covid Media watch gives modern public and global news dependent on an open media choice. Created with the help of the OECD and occasion log data extraction innovation, this apparatus empowers strategy creators, the media, and the overall population so as to notice the evolving patterns related to the novel Covid in and on every side of the country. It is introduced as a helpful wellspring of data for the world. 2.4 Healthcare Personnel Assistance with Artificial Intelligence As a feature of AI, two Chinese organizations have created AI-based Covid symptomatic programming. Beijing-based startup Infervision prepared its product to recognize lung issues using computed tomography (CT) scans. Initially, it has been utilized for analyzing cellular breakdown in the lungs, computed tomography can identify pneumonia related to respiratory infections, such as Covid. In 34 Chinese medical clinics, this innovation is being utilized to help distinguish 32,000 presumed cases. Alibaba’s DAMO Academy, a testing arm of Chinese association Alibaba, additionally pre-arranged an artificial intelligence structure to perceive Covid with 96% precision. As the association shows, this system can quantify 300 to 400 scopes that are required to dissect Covid in 20 to 30 s which ordinarily takes 10 to 15 min. In South Korea, AI has apparently diminished the time required for the configuration test units, which are dependent on the hereditary cosmetics of the inflammation, which commonly takes 2–3 months to weeks. Biotech organization Seegene has created and circulated test packs utilizing a framework that is produced for computerized testing.

3 AI and ML to Combat Covid-19 3.1 Developing Novel COVID-19 Antibody Sequences for the use in Experimental Testing Using the Machine Learning Lawrence Livermore National Laboratory (LLNL) researchers have identified a first set of therapeutic antibody sequences that are intended at binding and neutralising SARSCoV-2 to the virus that is a source for causing COVID-19, which were created in a

168

R. Chauhan et al.

matter of weeks using the technology of machine learning. The research team performs experimental testing on identified antibody designs. At the moment, the only way for treating COVID-19 using antibodies is to take them from the blood of patients who have totally recovered. Through an iterative computational-experimental approach, the novel antibody designs can be improved, thus enabling a safer, more reliable pathway for employing antibodies as some potential therapeutics for persons suffering from the condition. LLNL scientists described the way they used the Lab’s high-performance computers and a ML-driven computational platform for designing antibodies to the candidates which were predicted to bind with SARS-CoV-2. ML algorithm propose changes to the structures to optimize the SARS-CoV-2 by combining existing antibody structures for SARS-CoV-1. The number of viable designs was narrowed down from a nearly limitless list of options to 20 initial sequences anticipated to be SARS-CoV-2 targets by lab experts. According to the researchers, free energy calculations for the first set of designs, which were used to forecast the likelihood of binding, compared favorably to analogous calculations for known SARS-CoV-1 antibodies. According to the results, the projected SARS-CoV-2 antibodies may link and bind to the virus’s receptors and then neutralise it by blocking the virus from connecting to and penetrating through human cells. The antibody mutants reportedly scored well along various developability measures, indicating that they are likely to be created in a lab, according to the researchers. The sequences and interaction calculations easily accessed to the scientific community could potentially enable outside groups contrast human-derived antibodies to LLNL’s free energy estimations to determine which the ones that are worth exploring further, according to lab researchers. Adam Zemla, a LLNL scientist and co-author, have used known protein structure of SARS-CoV-1 to estimate the 3-D protein structure of SARS-CoV-2, as previously published. Following that, the real spike protein structure of SARS-CoV-2 got discovered, proving that the prediction was correct. It is recognized that despite the great level of resemblance in between two viruses, SARS-CoV-1 antibodies don’t really bind to SARS-CoV-2, in our attempt to mimic the binding of SARS-CoV-2 with SARS-CoV-1 neutralizing antibodies. A Lab team headed by Faissol and data scientist Thomas Desautels was using a computational platform incorporating machine learning, experimental data, bioinformatics, molecular simulations and structural biology to significantly narrow down the potential antibody designs anticipated to target SARS-CoV-2, using the SARS-CoV-2 protein sequence and recognized antibody structures for SARSCoV-1. According to the publication, the scientists used nearly 180,000 free energy simulations of potential antibodies with the SARS-CoV-2 Receptor Binding Domain (RBD) on two high - performing computers at LLNL, Corona and Catalyst, totaling over than 200,000 CPU hours and 20,000 GPU hours Leveraging Livermore’s HPC systems, we can readily check up on mutants with intriguing predictions as they arise, unique combination of all of these computational elements, involving bioinformatics, simulation, and machine learning. After predicting SARS-CoV-2 structures, the researchers employed a specially constructed platform to calculate the binding characteristics of nearly 90,000 mutant antibodies. They chose the most promising antibody sequences and determined that they had enhanced interaction with the SARS-CoV-2 RBD with free energies as

Artificial Intelligence Advancement in Pandemic Era

169

low as −82 kilocalorie per mole, which is a standard measure for binding capacity. The lower the number, the more likely it is that a bound matching will occur. SARS-CoV-1 and one of its known antibodies have a structure with energy of − 52 kcal/mole. The LLNL team is continuing to develop the platform by doing more computations with different types of antibodies known to bind to SARS-CoV-1. They’re also doing higher-fidelity molecular dynamics computations to improve prediction accuracy and looking into binding “hotspots,” which are areas where a small number of connecting sites dominate binding. Our first platform relied on free- energy estimates using a standard molecular modeling method that provided a good balance of efficiency and agility. To determine the binding free-energy of our new antibody designs, we are now using a more precise but computationally intensive technique. The team is performing these elevated simulations using a methodology developed through collaborative efforts between LLNL and Harvard University, which was backed by internal Laboratory Directed Research and Development (LDRD) financing. 3.2 CT Scan Checks with AI for COVID-19 According to Chinese researchers, artificial intelligence (AI) can detect COVID-19 from CT images. At least two research groups have published studies claiming to show that deep learning can assess radiographic characteristics for correct COVID-19 diagnosis quicker than existing blood tests, saving crucial time for disease management. COVID19 made its debut in Wuhan, China, towards the end of last year has since spread around the world. The World Health Organization designated the outbreak of pandemic in early March, and over 130,000 fatalities have been reported worldwide. Viruses can cause asymptomatic pneumonia to serious infections pneumonia with rapid breathing difficulties and multi organ failure. A reverse transcription polymerase chain reaction (RT-PCR) test on a sample of blood is commonly used for diagnosis of COVID-19, however there are questions about the responsiveness and accessibility of these assays CT scans have been found to effective to detect COVID-19-like symptoms in the lungs, potentially allowing for rapid diagnosis than existing RT-PCR testing. However, COVID-19 has imaging characteristics with some other kinds of pneumonia, making it difficult to distinguish between the two. Researchers from the Tianjin Medical University Cancer Institute used CT images from 180 people who had conventional viral pneumonia well before COVID-19 epidemic and 79 people who had confirmed COVID-19 to build an AI technique to detect the virus. The photos from the patients were allocated at random to train or test the deep blended learning. The researchers believe that their model correctly recognised COVID19 from Computed tomography images with an efficiency of 89.5%. The accuracy of two radiologists who also analysed the photographs was roughly 55%. According to the researchers, the findings show that AI can effectively diagnose a CT scan. Another team from China used chest CT images from 400 patients with COVID-19, over 1400 persons with community acquired pneumonia, and more than 1000 people without pneumonia to train a deep-learning mechanism to identify COVID-19.

170

R. Chauhan et al.

When they evaluated their AI on CT scans from 450 patients, 20% of whom had COVID-19, they found that it had an efficiency of roughly 90%. The researchers claim that this demonstrates that deep learning can distinguish COVID-19 from communityacquired pneumonia and other lung illnesses. Although the United States currently does not recommend CT as a main diagnostic tool, preliminary results from China indicated that CT had a high specificity and was accurate in detecting COVID-19. There has been a lot of controversy about that paper since then, including concerns that there may have been selection bias involved. 3.3 AI as an Aid in the Transmission of COVID-19 In Greece, a new machine learning method to COVID-19 testing has yielded promising results. In August and November 2020, the Eva technology dynamically employed recent testing findings acquired at the Greek border to recognize and prevent the growth of asymptomatic COVID-19 cases among incoming international passengers, resulting in a reduction in the number of incidence and mortality in the country. Eva discovered 1.85 times many asymptomatic, infected travellers than would have been detected by standard, random detection testing. Diagnosis of rates of infection was up to two to four times greater than random checks during the peak tourist season of August and September. The work lays forth a framework for using AI and factual data to achieve public health goals like border checks during a pandemic. Eva also holds the prospect of increasing the already overcrowded testing infrastructure in most nations, due to this rapid spreading of a new corona virus strain. Given the limited budget for tests, the major question was whether the tests could run more efficiently by using dynamic monitoring to identify more infected travelers. One of the most significant issues confronting countries as they cope with COVID-19 is the lack of information. Because such extensive testing would be both expensive and complex, most governments either screen inbound travelers from specified countries or do randomize COVID-19 testing. Eva also allowed Greece to predict when a country would experience a COVID-19 infection spike nine days earlier than ML-based algorithms with the use of only publicly available data could have predicted. Eva’s core technology is a “contextual bandit method,” a ML framework designed for “complex decision making,” taking into account a variety of practical issues such as time-varying data and port-specific test budgets. The approach strikes a balance between the necessity for high-quality COVID-19 prevalence monitoring forecasts across nations and the deployment of limited test results to catch potentially infected travellers.

4 Conclusion It’s been over a year since the COVID-19 outbreak started. AI-based applications have been in use from the pre-pandemic time in the diagnosis as well as drug development, for the forecasting of the disease spread and further tracking the population movement. To address the novel coronavirus, new applications have also been developed, that are included in the provision of healthcare. AI has made a significant impact to fighting this pandemic. ML-based technology is playing a substantial role against COVID-19.

Artificial Intelligence Advancement in Pandemic Era

171

Medical experts have started using machine learning for the learning on the virus, test potential treatments and analyzing the public health impacts. A review on the AI-based technologies that are being used in the timeline of pandemic has revealed the range of approaches that can be used, and the extent to which the application is meeting the terms. Data validation is another crucial factor that defines the tool’s performance; if potential bias in the data are not specifically screened can lead to unfair application. Artificial Intelligence has proven to be a valuable resource for disseminating knowledge on the pandemic. For example, clinical trial information, new insights into illness, progression, and so on. Real-time data may now be exchanged with professionals such as doctors, scientists, research labs, and medical personnel, all because of AI advancements. That is a significant contribution. In addition, AI has made a big impact to the vaccine preparatory work during the epidemic. Someday the pandemic will be over; however, the impact on the future economy, global health, manufacturing, education, political relations, etc. will certainly remain. It is important to realize how the issues by the disease would seem like in the coming future so that they can plan strategies from now. Approaches in AI can be useful in both terms, the prediction and suggestion on the ways of handling the problems that may occur in the future.

References 1. Abdel-Basset, M., Chang, V., Mohamed, R.: HSMA_WOA: a hybrid novel slime mould algorithm with whale optimization algorithm for tackling the image segmentation problem of chest X-ray images. Appl. Soft Comput. 95, 106642 (2020). https://doi.org/10.1016/j.asoc. 2020.106642 2. Lu Wang, L., et al.: CORD-19: the COVID-19 open research dataset. arXiv:32510522 (2020) 3. Dhiman, G., Chang, V., Singh, K.K., Shankar, A.: Adopt: automatic deep learning and optimization-based approach for detection of novel coronavirus COVID-19 disease using X-ray images. J. Biomol. Struct. Dyn. 0(0), 1–13 (2021). https://doi.org/10.1080/07391102. 2021.1875049 4. Gupta, M., Jain, R., Taneja, S., Chaudhary, G., Khari, M., Verdú, E.: Real-time measurement of the uncertain epidemiological appearances of COVID-19 infections. Appl. Soft Comput. 101, 107039 (2021). https://doi.org/10.1016/j.asoc.2020.107039 5. Keshavarzi Arshadi, A., et al.: Artificial intelligence for COVID-19 drug discovery and vaccine development. Front. Artif. Intell. 3, 65 (2020). https://doi.org/10.3389/frai.2020. 00065 6. Mei, X., et al.: Artificial intelligence–enabled rapid diagnosis of patients with COVID-19. Nat. Med. 26(8), 1224–1228 (2020). https://doi.org/10.1038/s41591-020-0931-3 7. Snider, M.: Tests expand on whether wearables could predict corona-virus (2020). https:// medicalxpress.com/news/2020-05-wearables-coronavirus.html 8. Yang, Z., Bogdan, P., Nazarian, S.: An in silico deep learning approach to multi-epitope vaccine design: a SARS-CoV-2 case study. Sci. Rep. 11(1), 3238 (2021). https://doi.org/10. 1038/s41598-021-81749-9. https://www.nature.com/articles/s41598-021-81749-9 9. van Sloun, R.J.G., Demi, L.: Localizing b-lines in lung ultrasonography by weakly supervised deep learning, in-Vivo results. IEEE J. Biomed. Health Inform. 24(4), 957–964 (2020). https:// doi.org/10.1109/JBHI.2019.2936151

172

R. Chauhan et al.

10. Kavadi, D.P., Patan, R., Ramachandran, M., Gandomi, A.H.: Partial derivative nonlinear global pandemic machine learning prediction of COVID 19. Chaos Solitons Fractals 139, 110056 (2020) 11. Li, J., Xu, Q., Shah, N., Mackey, T.K.: A machine learning approach for the detection and characterization of illicit drug dealers on instagram: model evaluation study. J. Med. Internet Res. 21(6), e13803 (2019). https://doi.org/10.2196/13803.100155. https://doi.org/10.1016/j. patter.2020.100155 12. Polyzos, S., Samitas, A., Spyridou, A.E.: Tourism demand andthe COVID-19 pandemic: an LSTM approach. Tour. Recreat. Res. 0(0), 1–13 (2020). https://doi.org/10.1080/02508281. 2020.1777053 13. Ismael, A.M., Sengür, ¸ A.: Deep learning approaches for COVID-19 detection based on chest X-ray images. Expert Syst. Appl. 164, 114054 (2021). https://doi.org/10.1016/j.eswa.2020. 114054 14. Harari, Y.N.: Yuval Noah Harari: The world after corona-virus (2020). https://www.ft.com/ content/19d90308-6858-11ea-a3c9-1fe6fedcca75 15. Karim, M.R., Döhmen, T., Cochez, M., Beyan, O., Rebholz-Schuhmann, D., Decker, S.: Deepcovidexplainer: explainable COVID-19 diagnosis from chest X-ray images. In: 2020 IEEE International Conference on Bioinformatics and Biomedicine (BIBM), pp. 1034–1037. IEEE (2020) 16. Hall, L.O., Paul, R., Goldgof, D.B., Goldgof, G.M.: Finding COVID-19 from chest X-rays using deep learning on a small dataset, preprint on webpage at arXiv:2004.02060 (2020) 17. Farooq, M., Hafeez, A.: COVID-resnet: a deep learning framework for screening of COVID19 from radiographs, preprint on webpage at arXiv:2003.14395 (2020) 18. Soltani, P., Patini, R.: Retracted COVID-19 articles: a side-effect of the hot race to publication. Scientometrics 125(1), 819–822 (2020) 19. Apuzzo, M., Kirkpatrick, D.D.: COVID-19 changed how the world does science together. N. Y. Times, 1 (2020) 20. Chauhan, R., Kaur, H., Chang, V.: An optimized integrated framework of big data analytics managing security and privacy in healthcare data. Wirel. Pers. Commun. (2020). https://link. springer.com/article/10.1007/s11277-020-07040-8 21. Chauhan, R., Kaur, H., Alankar, B.: Air quality forecast using convolutional neural network for sustainable development in urban environments. J. Sustain. Cities Soc. (2021). https:// www.sciencedirect.com/science/article/abs/pii/S2210670721005163 22. Chauhan, R., Kaur, H., Chang, V.: Advancement and applicability of classifiers for variant exponential model to optimize the accuracy for deep learning. J. Ambient Intell. Hum. Comput. (2017). https://doi.org/10.1007/s12652-017-0561-x. {SCI IF: 7.588}. https://link. springer.com/article/10.1007%2Fs12652-017-0561-x

Predictive Technique for Identification of Diabetes Using Machine Learning Ritu Chauhan1 , Harleen Kaur2(B) , and Bhavya Alankar2 1 Center for Computational Biology and Bioinformatics, Amity University, Sector 125, Noida,

India [email protected] 2 Department of Computer Science and Engineering, Jamia Hamdard, New Delhi, India {harleen,balankar}@jamiahamdard.ac.in

Abstract. In today’s world, the digital era has expanded enormously which has implicated the data generated. The researchers and scientists are laying enormous efforts to rationalize and determine the technology which can be vastly applied to discover relevant information which can benefit end-users. In the past decade, AI (Artificial Intelligence) and ML (Machine Learning) tends to be affirmative technology that has opened wide prospects in the varied application. Our proposed study focuses on the application of ML in Diabetes Mellitus, where the scope is to determine varied factors which can be the cause of prognosis and diagnosis of disease. We have designed and implemented the proposed approach using a decision tree to discover hidden patterns and information from large scale diabetes Mellitus databases. Keywords: Decision tree · Classification · Machine learning · Artificial intelligence · Healthcare databases

1 Introduction As we know, Diabetes is a chronic disorder that is identified by an abnormal blood glucose level, which is caused by either ineffective utilization or insufficient production of insulin. The predominance of diabetes in 2010 was assessed at 285 million (6.4% of grown-ups) around the world. By 2030, that number is required to grow to 552 million. In view of the flow development pace of the disorder, it is assessed that 1 out of 10 grownups will foster diabetes by 2040 [1]. The predominance of diabetes in South Korea has additionally expanded significantly. As indicated by a new report, 13.7% of South Korean grown-ups have diabetes and right around a quarter has prediabetes. Diabetes frequently goes undetected in light of the fact that individuals with diabetes are regularly uninformed of the sickness or are asymptomatic by their own doing. Very nearly 33% of diabetics are uninformed of their condition. Uncontrolled diabetes causes genuine longterm harm to different organs and body frameworks, including the kidneys, heart, nerves, veins, and eyes. Hence, progressed identification of the infection permits individuals in danger to take careful steps to moderate the movement of the sickness and improve their © The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 M. N. Mohanty et al. (Eds.): METASOFT 2022, AISSE 1, pp. 173–180, 2022. https://doi.org/10.1007/978-3-031-11713-8_18

174

R. Chauhan et al.

personal satisfaction. Exploration is being done in a few distinct regions, for example, machine learning (ML) and artificial intelligence (AI), to lessen the impacts of diabetes and improve the nature of patient consideration. ML-based strategies for foreseeing the advancement of diabetes have been accounted for in various investigations [2]. In general, Artificial intelligence is positioned to be the catalyst that accelerates advancements throughout the healthcare system and has the capacity to modify healthcare. It’s known among the best technological advancement to improve patient outcomes as well as greater productivity with healthcare delivery. We can say that it would be an important factor that will be able to speed up the development of life-saving therapies. Moreover, Artificial intelligence (AI) is also being utilized in varied healthcare sectors to improve the accuracy of time on operational and administrative activities which may include processing of claims and patient administration. This frees up time for healthcare personnel, which can then be used to improve the quality of patient care and experiences. In telemedicine, for example, Artificial intelligence-based consultation and prescriptions enable access to patients in rural locations (or major cities with severe traffic jams) at a low cost and with similar attributes to a physical visit to a doctor. During the pandemic, there was a tremendous need to automate the basic decision-making procedures for delivering advice to patients due to a lack of important resources such as medical professionals and sometimes even space in hospitals and healthcare institutions. In this regard, AI plays a critical role, particularly in assisting clinicians in the early diagnosis of COVID-19 cases by promptly assessing unusual symptoms and thereby alerting relevant patients and healthcare authorities. So, implementing proper AI technology throughout the healthcare sector can benefit doctors, administrative personnel, and, of course, patients [26–28]. In the current study approach, we have applied classification-based techniques on diabetes mellitus databases for the diagnosis and prognosis of the disease. The data were carefully examined and generated from the National Institute of Diabetes and Digestive and Kidney Diseases. Further, the analysis of data was configured based on measuring the activity of diabetes patients while gathering the overall examination of patients with the level of pregnancy. However, the data has several parameters to support the study which include age, level of insulin, and other attributes. Finally, the paper has been outlined such as Sect. 2 briefly discusses the role of AI in healthcare with relevant literature citation. Section 3 discusses the role of Machine Learning in healthcare databases with predictive data analytics. Further, Sect. 4 discusses the overall exploration of data with implemented results. Lastly, the conclusion and discussion is discussed at the end.

2 Literature Review Most investigations recommend that the expansion in white blood cell tally is because of persistent irritation during hypertension. The family background of diabetes isn’t related to BMI and insulin. Be that as it may, expanded BMI isn’t constantly connected with stomach stoutness [4]. A solitary boundary isn’t exceptionally powerful in diagnosing diabetes precisely and can be deluded in the dynamic cycle. It is important to join a few boundaries to anticipate diabetes early and viably. Some current procedures don’t give

Predictive Technique for Identification of Diabetes Using Machine Learning

175

viable outcomes when various boundaries are utilized to foresee diabetes [3]. In our examination, diabetes is anticipated with the assistance of relationship among significant and various traits. We researched the finding of diabetes by gathering ANN, RF, and K-Means [5]. Ahmad compared the prediction accuracy of multilayer perception (MLP) in neural networks against the ID3 and J48 algorithms [8]. The results showed that a pruned J48 tree performed with higher accuracy, which was 89.3% compared to 81.9%. Marcano-Cedeño proposed a multiplayer perceptron metaplasia (AMMLP) as a prescient model for diabetes with the best aftereffects of 89.93% [9]. All past examinations utilized a similar Pima Indian diabetes informational collection as the test information. The Waikato Environment Toolkit (WEKA) for information examination was the principal apparatus of decision for most scientists [9]. We understood that to get more significant and helpful information, it is expected to settle on sensible decisions of preprocessing techniques and boundaries. Vijayan V. explored the advantages of different pretreatment strategies for anticipating DM. The pretreatment techniques were principal component analysis (PCA) and discretization. We infer that the pretreatment strategy improves the exactness of the credulous Bayes classifier and decision tree (DT) and decreases the precision of the support vector machine (SVM) [10]. Wei examined the danger factors for DM2 dependent on the development calculations of FP and Aprili. Guo proposed prescient qualities for the receptor operational trademark (ROC) area, affectability, and particularity to approve and approve the test results [11]. You need a decent method to make your model fit for everybody, in view of viable expectation calculations. A similar study was conducted to find composition and answer dependent on an Android application to beat the absence of information about DM [12]. This application utilized the DT classifier to anticipate a client’s diabetes level. The framework additionally gave data and tips on diabetes. Diabetes, a non-transmittable illness, causes long haul difficulties and genuine medical issues. A report from the World Health Organization tends to diabetes and its inconveniences that have physical, monetary and monetary ramifications for individuals in the entire family. Studies show that uncontrolled wellbeing has slaughtered around 1.2 million individuals [3], prompting passing. Hazard factors for diabetes, like cardiovascular infection and different diseases, have slaughtered about 2.2 million individuals. Diabetes is an infection brought about by a drawn-out connection to sugar in the blood. This article depicts different classifiers and proposes a choice emotionally supportive network that utilizes the AdaBoost calculation, which utilizes choice stumps as the essential classifier for arrangement. Also, support vector machines, gullible Bayes, and choice trees are furthermore proceeded as fundamental classifiers for AdaBoost estimations to guarantee precision. The exactness got for AdaBoost calculation with choices stump as a base classifier is 80.72%, which is more noteworthy contrasted with that of Support Vector Machine, Naive Bayes and Decision Tree [14].

3 Prediction of Healthcare Data Using AI and ML The digitization has created immense pressure on researchers and clinicians to discover information from large scale databases. As we know, the data is growing exponentially, so the challenge is to acquire, exchange, analyze and transmit data significantly to end users. In general, traditional technology was unable to discover the patterns from large

176

R. Chauhan et al.

scale databases, so the existing framework required a robust technology which can persistently handle the pressure of data and will be able to generate the significant models for prognosis and diagnosis of disease. Substantially, AI and ML has opened wide opportunities among the researchers to identify hidden patterns and knowledge from the healthcare databases. AI and ML has proved significantly challenging to work under the circumstances of several application domain. Eventually, it has open wide opportunities in domain of healthcare databases where Diabetes mellitus (DM) tends to be a chronic disease which can be portrayed by an undeniable degree of sugar in the blood. Practically 50% of all diabetics convey the family hereditary component, which is quite possibly the main qualities of DM. The failure of the pancreas to deliver sufficient insulin and the inadequate utilization of insulin in the body are obsessive reasons for diabetes. There are two sorts of DM. The etiology of type 1 diabetes (T1DM) is that the pancreas secretes harmed beta cells, keeping them from bringing down blood sugar levels on schedule. Insulin obstruction and insulin inadequacy are the etiology of type 2 diabetes (DM2) and are additionally called insulin-subordinate DM. During the most recent thirty years of improvement in China, where the quantity of diabetics is expanding, individuals have started to understand that this ongoing sickness profoundly affects the day by day lives, all things considered, and everybody. The extent of diabetics in everyone is expanding, with diabetic men expanding at a higher rate than diabetic women. As indicated by some authority measurements, the quantity of diabetics in 2017 was around 110 million. The International Diabetes Federation (IDF) distributes the most recent information on DM in the Diabetes Atlas (seventh release). In 2015, the quantity of diabetics overall was moving toward 415 million. The developing populace of diabetics is projected to move toward 642 million, or 1 of every 10 grown-ups. Zeroing in on high-hazard gatherings of patients with DM is fundamental to lessen the dismalness and impacts of DM [23–25]. Machine learning is assisting to overcome the issues that huge amounts of data provide. Healthcare firms may use machine learning to address rising medical requirements, enhance operations, and reduce spending. Among the most common forms of AI is machine learning. It analyses and discovers patterns in massive data sets in order to aid decision-making. Therefore, data analytics is an appropriate study field for us. The informational collection contains 215,544 records identified with the patient’s visit. The result factors are diabetes labeled by paired factors 0 and 1, 0 demonstrating patients without DM and 1 shows the patient with DM. Indicators of interest are sex, (age at test date), BMI (body mass index), TG (triglycerides), FBS (fasting blood pressure), SBP (systolic blood pressure), HDL (lipoproteins). high thickness) and LDL (low thickness lipoproteins). Since patients can have different records addressing numerous visits to a clinical focus, we took every quiet’s last visit to acquire an informational index of 13,317 patients. During the exploratory information investigation step, some outrageous qualities for BMI and TG were found, and afterward these qualities were barred to acquire the last logical informational index containing 13,309 patients. Roughly 20.9% of the patients in this dataset are suffering with DM. In the complete dataset around 40% belongs to the men and rest 60% belongs to women. The patients in this informational index are somewhere in the range of 18 and 90 years of age. Age is likewise coded as a downright

Predictive Technique for Identification of Diabetes Using Machine Learning

177

factor addressed by four classifications: youth, moderate age, partial old and old. Roughly 44.6% of the patients are of moderate age and their age lies between 40 and 64 years. 47.8% are more seasoned individuals, 65 to 84 years of age. 4.8% are more than 85 years of age. 2.9% are under 40 years of age. The weight record was determined by isolating the patient’s weight (kilograms) by the patient’s stature (meters) squared. The weight list goes from 11.2 to 70 with a middle of 28.9. The circulation of BMI, FBS, HDL, and TG are completely slanted to one side. Table 1 shows that the medians of BMI, FBS, TG of the group of people with DM are above those of the group of patients with no DM; the median HDL is higher for the group of patients with no DM meanwhile the median LDL, median SBP, and thus the median Age are similar [18–25].

4 Results As we know that the data is growing at an exponential rate, so the enormous efforts are laid to understand the complex nature of the data. In, proposed study of approach the data was obtained from National Institute of Diabetes and Digestive and Kidney Diseases (NIDDKD). Where the objective of the data is based on the measuring the activity of diabetes patients while gathering the overall examination of patients with level of pregnancy. However, the data has several parameters to support the study which include age, level of insulin and other attributes which are discussed in detail in Table 1. Table 1. Description of data Attributes

Description of data

Pregnancies

The values suggest the frequency of person to be pregnant

Glucose

The data value represents the Plasma glucose concentration within 2 h of oral glucose tolerance test

BloodPressure

The Blood measure is measured in mm Hg where it signifies Diastolic blood pressure

SkinThickness

The skin thickness is discussed as skin folds in the Triceps muscle

Insulin

The insulin level was checked for every 2-h

BMI

Body mass index (weight in kg/(height in m)ˆ2)

DiabetesPedigreeFunction It calculates the total Diabetes pedigree function Age

The value of Age in years

Outcome

The outcome for the prognosis of disease

After, preprocessing the data has been applied for classification, In Fig. 1, a decision tree has been formulated where different parameters such as pregnancies, insulin level, Glucose level, age, outcome of the disease and other features were diagnosed with target features. Figure 1a suggest that root node is glucose where the decision tree was gathered in retrospective of patients who are diagnosed with diabetes mellitus1 and the factors which are influential for prognosis of the disease. The analysis perceives that the dark

178

R. Chauhan et al.

color indicates the node which are confirmed to have diabetes. Figure 1 represents the maximum number of cases which will substantially have pregnancy induced diabetes, if the glucose level is higher than 125 and lower than 200 with pregnancies more than twice and age between 20–30 years, so they have the maximum chances for prognosis of diabetes.

Fig. 1. Decision tree with confirmed cases of diabetes mellitus

Figure 2 represents the minimum number of cases which will substantially have pregnancy induced diabetes, if the glucose level is higher but BMI index is controlled, then induce of diabetes mellitus is very less.

Fig. 2. Patients not suffering from diabetes

Figure 3 represents the scatter plot of the outcome of prognosis of diabetes mellitus, blue color shade represents the vital outcome to have minimum chances of diabetes mellitus whereas the red color indicates the prognosis is on higher side if the correlated features such as BMI is higher than 29.3, glucose level < 50 and other correlated features with age and pregnancies.

Predictive Technique for Identification of Diabetes Using Machine Learning

179

Fig. 3. Scatter plot of outcome

5 Conclusion AI and ML have gained exponential growth in the past decade. The major progression has been instantiated with global digital platform at each domain of knowledge. The current study of approach is proposed to study the overall patterns of prognosis and diagnosis of diabetes mellitus in female patients. The study applies decision-based tree algorithm for detecting patterns which can be the cause for the occurrence of the disease. Moreover, the result retrieved suggested that patterns were evaluated in corresponding with occurrence of disease.

References 1. International Diabetes Federation (IDF): DIABETES ATLAS, 7th edn. (2015) 2. https://www.sciencedirect.com/science/article/pii/S2352914817301405 3. The International Diabetes Federation (IDF): [Internet]. http://www.idf.org/complicationsdiabetes 4. http://en.wikipedia.org/wiki/Data_mining#cite_note-acm-1 5. Diabetes mellitus prediction model based on data mining unlocked (2018). https://doi.org/10. 1016/j.imu.2017.12.006 6. https://www.sciencedirect.com/science/article/pii/S1877050915004500 7. https://ac.els-cdn.com/S1877050915004500/1-s2.0-S1877050915004500-main.pdf?_tid= f721250c-d935-497cb84c-986b803ab30&acdnat=1520326508_3495c7cac8e512ab149ace a41f03627f 8. Woldaregay, A.Z., Årsand, E., Botsis, T., Albers, D., Mamykina, L., Hartvigsen, G.: Diabetes. J. Med. Internet Res. 21, e11030 (2019) 9. Maniruzzaman Kumar, N., Abedin, M., Islam, S., Suri, H.S., El-Baz, A.S., Suri, J.S.: Comparative approaches for classification of diabetes mellitus data: machine learning paradigm. Comput. Methods Programs Biomed. 152, 23–34 (2017) 10. Kavakiotis, I., Tsave, O., Salifoglou, A., Maglaveras, N., Vlahavas, I., Chouvarda, I.: Comput. Struct. Biotechnol. J. 15, 104–116 (2017) 11. VeenaVijayan, V., Anjali, C.: Decision support systems for predicting diabetes mellitus –a review. In: Proceedings of 2015 Global Conference on Communication Technologies (GCCT 2015)

180

R. Chauhan et al.

12. Wei, Z., Ye, G., Wang, N.: Analysis for risk factors of type 2 diabetes mellitus based on FP-growth algorithm. China Med. Equip. 13(5), 45–48 (2016) 13. Guo, Y.: Application of artificial neural network to predict individual risk of type 2 diabetes mellitus. J. Zhengzhou Univ. 49(3), 180–183 (2014) 14. Chauhan, R., Kaur, H., Chang, V.: Advancement and applicability of classifiers for variant exponential model to optimize the accuracy for deep learning. J. Ambient Intell. Hum. Comput. (2017). https://doi.org/10.1007/s12652-017-0561-x 15. Chauhan, R., Kaur, H.: A feature based reduction technique on large scale databases. Int. J. Data Anal. Tech. Strateg. 9(3), 207 (2017) 16. Chauhan, R., Kaur, H., Alam, A.M.: Data clustering method for discovering clusters in spatial cancer databases. Int. J. Comput. Appl. Spec. Issue 10(6), 9–14 (2010) 17. Chauhan, R., Kaur, H., Chang, V.: An optimized integrated framework of big data analytics managing security and privacy in healthcare Data. Wirel. Pers. Commun. 117(1), 87–108 (2020) 18. Breiman, L., Friedman, J.H., Olshen, R.A., Stone, C.J.: Classification and Regression Trees. Wadsworth International Group, Belmont (1984) 19. Ash, C., Farrow, J.A.E., Wallbanks, S., Collins, M.D.: Phylogenetic heterogeneity of the genus bacillus revealed by comparative analysis of small subunit ribosomal RNA sequences. Lett. Appl. Microbiol. 13, 202–206 (1991) 20. Audic, S., Claverie, J.M.: The significance of digital gene expression profiles. Genome Res. 7, 986–995 (1997) 21. Wan, V., Campbell, W.: Support vector machines for speaker verification and identification. In: Neural Networks for Signal Processing X. Proceedings of the 2000 IEEE Signal Processing Society Workshop (Cat. No.00TH8501) (2000) 22. Chapelle, O., Haffner, P., Vapnik, V.: Support vector machines for histogram-based image classification. IEEE Trans. Neural Netw. 10(5), 1055–1064 (1999) 23. Lee, J.W., Lee, J.B., Park, M., Song, S.H.: An extensive evaluation of recent classification tools applied to microarray data. Comput. Stat. Data Anal. 48, 869–885 (2005) 24. Yeung, K.Y., Bumgarner, R.E., Raftery, A.E.: Bayesian model averaging: development of an improved multi-class, gene selection and classification tool for microarray data. Bioinformatics 21, 2394–2402 (2005) 25. American Diabetes Association: Standards of medical care in diabetes—2011. Diabetes Care 34(Suppl. 1), S11–61 (2011). https://doi.org/10.2337/dc11-S011 26. Chauhan, R., Kaur, H., Chang, V.: An optimized integrated framework of big data analytics managing security and privacy in healthcare data. wireless personal communication (2020). https://link.springer.com/article/10.1007/s11277-020-07040-8 27. Chauhan, R., Kaur, H., Alankar, B.: Air quality forecast using convolutional neural network for sustainable development in urban environments. J. Sustain. Cities Soc (2021). https:// www.sciencedirect.com/science/article/abs/pii/S2210670721005163 28. Chauhan, R., Kaur, H., Chang, V.: Advancement and applicability of classifiers for variant exponential model to optimize the accuracy for deep learning. J. Ambient Intell. Hum. Comput. (2017). https://doi.org/10.1007/s12652-017-0561-x. {SCI IF: 7.588}. https://link. springer.com/article/10.1007%2Fs12652-017-0561-x

Prognosis of Prostate Cancer Using Machine Learning Ritu Chauhan1 , Neeraj Kumar1 , Harleen Kaur2(B) , and Bhavya Alankar2 1 Center for Computational Biology and Bioinformatics, Amity University, Sector 125, Noida,

Uttar Pradesh, India {rchauhan,Nkumar8}@amity.edu 2 Department of Computer Science and Engineering, Jamia Hamdard, New Delhi, India {harleen,balankar}@jamiahamdard.ac.in

Abstract. Prostate cancer is a certifiable disease that impacts an enormous number of men consistently who are reasonably matured or more settled. The majority of the cases occur in men more than the age of 65 years. The prostate is a little organ found in a man’s lower midriff. It’s arranged under the bladder and envelops the urethra. The prostate is constrained by the substance testosterone and produces crucial fluid, in any case, called semen. Semen is the substance containing sperm that leaves the urethra during release. When a surprising, undermining improvement of cells, which is known as a tumor, structures in the prostate, it’s called prostate malignancy. Machine learning (ML) is a subset of AI and suggests the creation and organization of calculations and algorithms to examine the information and its properties and isn’t given an undertaking explicitly dependent on certain predefined contributions from the climate. For marking, ML can be characterized into three models, which are supervised, unsupervised, and support learning. For features, ML can be characterized into handmade or non-carefully assembled include based strategies. Data mining works are related to prediction analysis which is a branch of factual science that makes extensive use of complicated computations and algorithms to be used in conjunction with an extraordinary collection of problems. Data mining satisfies a significant demand, which is to recognize patterns in datasets for a variety of problems that are related to a specific area or domain. Keywords: Gradient boosting · Random forest · Clinical data

1 Introduction In the simplest terms, cancer alludes to cells that outgrow control and attack different tissues. Cells then transform into cancerous one because of the collection of imperfections, or changes, in their DNA. • Certain: acquired hereditary deformities (for instance, BRCA1 and BRCA2 changes), • Diseases, • Ecological elements (for instance, air contamination), and © The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 M. N. Mohanty et al. (Eds.): METASOFT 2022, AISSE 1, pp. 181–190, 2022. https://doi.org/10.1007/978-3-031-11713-8_19

182

R. Chauhan et al.

• Deteriorating lifestyle -, for example, smoking and hefty liquor use - can like-wise harm DNA and lead to disease. On the off chance that cell is seriously harmed and can’t fix itself, it goes through supposed customized cell passing or apoptosis [1]. Cancer growth happens when harmed cells develop, separate, and spread strangely rather than falling to pieces as they ought to. The trillions of cells that form in a solid body evolve and divide as the body needs them to function day today. The healthy cell has its own life cycle, recreating and vanishing in a way that is dictated by the sort of cell. Cancer, therefore, upsets this cycle and gives rise to anomalous development in cells. Changes or alterations in DNA are the cause. Transformations happen oftentimes in DNA, yet generally cells right these slip-ups [2]. Artificial intelligence or AI integration in cancer care could enhance diagnosis accuracy and speed, aid clinical decision-making, and result in better clinical outcomes. Clinical care driven by AI has the ability to decrease health inequities, particularly in low-resource settings. Instead of using traditional genomic sequencing, AI algorithms can be utilized to identify specific gene mutations from tumor pathology images [3]. For example, deep learning (DL) was utilized by NCI- funded researchers at New York University to analyze pathological images of lung tumors collected from The Cancer GenomeAtlas. Variables that influence the cancer rate and forms include age, gender, race, environmental factors, food, and inherited characteristics. The World Health Organization (WHO), for example, provides the associated general data on tumor growth in the world:

Fig. 1. Common types of cancer among males and females

Prognosis of Prostate Cancer Using Machine Learning

183

• Cancer has emerged as the root cause of death at the global level. It accounts for a humongous number of 8.2 million passing’s (about 22% of all passing’s unidentified with transmittable sicknesses; WHO’s most recent information). • Lung, liver, stomach, colon, and breast cancer are the reasons behind most diseases passing every year. • Passing’s due to disease overall have been projected to keep increasing, with an expected passing of 13.1 million in 2030 (about a 70% expansion) [4] (Fig. 1).

1.1 Correlation Between Machine Learning and Healthcare Databases The medical or the healthcare area has for quite some time been an early adopter of and profited incredibly from advancements in technology. Nowadays, machine learning (a subtype of artificial intelligence) assumes a vital part in numerous wellbeing domains, which includes the advancement of advanced operations, the treatment of patient information and records and the therapy of persistent infections [5]. Computerbased intelligence or AI will have an impact on doctors and emergency care since it will play a key role in clinical decision support, allowing for early detection of illness and custom-tailored therapy to ensure optimal outcomes. A lot of healthcare organizations have started to take up the benefits provided by machine learning, for example, Quotient Health which is located in Denver, Colorado with the assistance of machine learning, they have created programming that means to “diminish the expense of supporting EMR [electronic clinical records] frameworks” by improving and normalizing the manner in which those frameworks are planned. Prognos which is located in New York asserts that it has 19 billion records for 185 million patients in its Prognos Registry. With help from machine learning, Prognos’ AI stage encourages early infection location, pinpoints treatment prerequisites, features openings for clinical preliminaries, notes holes in care, and different components for various conditions [6].

2 Literature Review Prostate cancer is the most well-known malignant growth and second driving reason for death in men. Prostate cancer progresses more slowly and with less vigor than several different cancers. There is a relatively good likelihood of survival if prostate cancer is detected early. Prostate cancer has a 5-year survival rate of about 98% in the United States. In the first stage, the tumor is solely impacting the prostate, and hasn’t migrated to other tissues. At stage 4, the tumor has expanded to tissue outside of the prostate and may have moved to other places of your body. Variables that potentially compensate for part of the heterogeneity that is associated with the anticipated course and fate of disease are referred to be prognostic factors in cancer [7]. Prognostic indicators are important not just for understanding the disease’s natural history and progression, but for predicting the results of various therapies, or even no therapy at all. The proof base for the analysis and therapy of prostate malignant growth is persistently evolving [8].

184

R. Chauhan et al.

Prostate disease is the most widely recognized non cutaneous danger and the subsequent driving reason for malignant growth demise in men. In the United States, 90% of men with prostate malignant growth are of more than an age of 60 years which is determined by prostate-explicit antigen (PSA) blood test. PSA is a protein that is produced in the prostate both by normal and malignant cells. A greater PSA level in the blood has been linked to an elevated risk of prostate cancer. If PSA results are higher than 4 ng/mL, many doctors will refer for additional testing [9]. PSA, a chemical produced naturally by the prostate gland, is measured in a blood sample taken from a vein in the arm. It’s usual to have a trace quantity of PSA in the blood. A higher-than-normal level, on the other hand, could suggest prostate infections, inflammation, hypertrophy, or malignancy. Normal therapies for clinically limited prostate disease incorporate vigilant pausing, a medical procedure to eliminate the prostate organ (extremist prostatectomy), outside bar radiation treatment, and interstitial radiation treatment (brachytherapy). Radiotherapy is a potent therapeutic option with excellent oncological outcomes and significant technological advancements over the previous two decades [10]. Advanced prognostic models actually for PCA are progressively being built using Artificial Neural Networks (ANN). To develop a machine learning model, all that is required is organized datasets containing input parameters and results, as well as a basic understanding of PCA insight [11]. Under this situation, AI might play a key role, firstly in the interpretation of such a massive amount of data, and then in the creation of machine learning algorithms that could help urologists limit the frequency of needless prostate biopsies while still catching aggressive PCA early. As prostate cancer is an extremely prevailing cancer in men and the natural history of prostate cancer has shown tremendous diversity [12]. Hence, more inactive tumors can securely be checked without prompt revolutionary treatment, named dynamic observation [13]. Machine learning (ML) is a branch of artificial intelligence that entails developing and deploying algorithms to assess data as well as its attributes without being given a specific task depending on predetermined inputs as from environment. In terms of classification, ML can be divided into three types: supervised, unsupervised, and reinforcement learning [14]. Considering the features, ML is classified as handcrafted or non-handcrafted techniques Machine learning’s major strength is the ability of analyzing and employing a data of enormous quantity that is much more efficient from what is possible for humans using the classical statistical analysis methods. Deep learning (DL) is a kind of ML enabling machine devices so as to grasp knowledge from the experience and then understand the environment in terms of a concept of hierarchy [15]. Recently, there has also been a surge in attention in web and hypertext mining, as well as mining from informal communities, protection and law authorization data, bibliographical references, and epidemiology records. Data mining has been generally utilized in the business field, and AI can perform information investigation and example disclosure, subsequently assuming a vital part in information mining application [16, 17, 17–19].

Prognosis of Prostate Cancer Using Machine Learning

185

3 Outline of Research The study was done by taking the information from several sources such as citations, research papers, data sets and various government sites containing Prostate cancer data. With the help of the study information about Prostate cancer, growth over the years was revealed. This study is backed by secondary data which has been collected with the help of SEER Explorer site which depicts the US population data of Prostate Cancer. This following study is a conclusive study as with the help of this we get proper insight of conclusions and results and provides recommendations towards the end [17] (Tables 1 and 2). Data was retrieved from SEER site for the analysis of Prostate cancer in different years and ages in the US people. Table 1. Prostate cancer data was taken from SEER site and then was arranged yearly with the help of excel tools and then average for every year was taken to obtain this data. Year

Rate type Observed SEER incidence rate

U.S. mortality rate

2000

164.49

30.83

2001

167.07

29.86

2002

166.89

28.55

2003

154.55

27.86

2004

62.86

27.13

2005

59.82

26.73

2006

62.61

24.71

2007

63.02

25.28

2008

59.83

23.81

2009

56.61

22.66

2010

53.66

22.49

2011

52.57

21.81

2012

44.61

20.05

2013

42.39

19.60

2014

38.77

19.54

2015

39.59

19.81

2016

40.62

19.56

2017

41.48

18.76

2018

39.97

18.95

186

R. Chauhan et al.

Table 2. Prostate cancer data was taken from SEER site and then was arranged age wise with the help of excel tools to obtain this data Age

Rate type Observed seer incidence rate

All ages

U.S. mortality rate

56.39

23.58

Ages not having heart disease. Figure 3 shows the confusion matrix of K-NN and Fig. 4 represents the confusion matrix for the random forest considering 20% of data set as testing data. For finding the performance matrix of each algorithm we used the following formula for accuracy and misclassification. Accuracy =

Tp + Tn Tp + Tn + Fp + Fn

Missclassification =

Fp + Fn Tp + Tn + Fp + Fn

Fig. 2. Count of people affected

(2) (3)

N. Panda et al.

20

7

13

21

23

4

2

32

True labels

True labels

272

Predicted labels

Predicted labels

Fig. 3. Confusion matrix for K-NN

Fig. 4. Confusion matrix for random forest

Table 1. Performance metrics of both the models Split ratio (training-testing) %

K-NN

Random forest

Accuracy

Precision

Recall

Accuracy

Precision

Recall

70–30

64.39

59.50

80–20

67.21

62.48

60.22

88.64

87.28

88.54

65.68

90.16

88.73

91.20

90–10

65.5

66.32

66.50

87.32

86.89

87.80

Accuracy(%)

The accuracy score attained by K-NN and Random Forest algorithm is shown in Table 1, considering with a different split ratio of the data set. It has been found that with a split ratio of (80–20)% Random forest got superior performance metrics as compared to K-NN. Figure 5 represents the accuracy comparison of K-NN and Random Forest with split ratio of (80–20)%. Figure 6 and Fig. 7 depicts the ROC curve of both the models with split ratio of (80–20)%.

100 50 0 K-NN

Random forest

Machine learning models Fig. 5. Accuracy comparison of both the models with split ration of (80–20)%

Application of Machine Learning Model Based Techniques

Fig. 6. ROC of KNN mode

273

Fig. 7. ROC of random forest model

6 Conclusion and Future Work The research is all about proposing a model to predict heart disease using machine learning approaches. In this machine learning approach two algorithms are used for training and analyzing the dataset which contains the test results of different persons. We have tested the accuracy of the algorithms and results are obtained that reflect the prediction of the results. We are getting random forest as a good accuracy of 90.16% approximately with the split ratio (80–20)%. So, this may help in the real-time application for predicting the disease. In the future, the study can be extended for designing a better machine learning model that will improve the classification accuracy with more attributes considering a large dataset. Other machine learning models like decision tree, SVM, ANN, and CNN can be implemented for classification purpose.

References 1. https://www.who.int/news-room/fact-sheets/detail/the-top-10-causes-of-death 2. American Heart Association. Classes of Heart Failure. American Heart Association, Chicago (2020). https://www.heart.org/en/health-topics/heart-failure/what-is-heart-failure/classes-ofheart-failure 3. Melillo, P., De Luca, N., Bracale, M., Pecchia, L.: Classification tree for risk assessment in patients suffering from congestive heart failure via long-term heart rate variability. IEEE J. Biomed. Health Inform. 17(3), 727–733 (2013) 4. Kumar, S.: Predicting and diagnosing of heart disease using machine learning algorithms. Int. J. Eng. Comput. Sci. 6(6), 2319–7242 (2017) 5. Taylor, O.E., Ezekiel, P.S., Deedam-Okuchaba, F.B.: A model to detect heart disease using machine learning algorithm. Int. J. Comput. Sci. Eng. (ISSN: 2347–2693) 7 (2019) 6. Mohan, S., Thirumalai, C., Srivastava, G.: Effective heart disease prediction using hybrid machine learning techniques. IEEE Access 7, 81542–81554 (2019) 7. Shah, D., Patel, S., Bharti, S.K.: Heart disease prediction using machine learning techniques. SN Comput. Sci. 1(6), 1–6 (2020). https://doi.org/10.1007/s42979-020-00365-y 8. Haq, A.U., Li, J.P., Memon, M.H., Nazir, S., Sun, R.: A hybrid intelligent system framework for the prediction of heart disease using machine learning algorithms. Mobile Inform. Syst. (2018) 9. Srivastava, K., Choubey, D.K.: Heart disease prediction using machine learning and data mining (2020)

274

N. Panda et al.

10. UCI Machine Learning Repository, “Heart disease data set” (2020). http://archive.ics.uci.edu/ ml/datasets/heart+disease 11. https://www.kaggle.com/ronitf/heart-disease-uci 12. Mohanty, P., Sahoo, J.P., Nayak, A.K.: Voiced odia digit recognition using convolutional neural network. In: Sahoo, J.P., Tripathy, A.K., Mohanty, M., Li, K.-C., Nayak, A.K. (eds.) Advances in Distributed Computing and Machine Learning: Proceedings of ICADCML 2021, pp. 161–173. Springer Singapore, Singapore (2022). https://doi.org/10.1007/978-98116-4807-6_16 13. Panda, N., Majhi, S.K.: Improved salp swarm algorithm with space transformation search for training neural network. Arab. J. Sci. Eng. 45(4), 2743–2761 (2020) 14. Panda, N., Majhi, S.K.: Improved spotted hyena optimizer with space transformational search for training pi-sigma higher order neural network. Comput. Intell. 36(1), 320–350 (2020) 15. Devi, C.A., Rajamhoana, S.P., Umamaheswari, K., Kiruba, R., Karunya, K., Deepika, R.: Analysis of neural network-based heart disease prediction system. In: Proceedings of the 11th International Conference on Human System Interaction (HSI), pp. 233–239 (2018) 16. Das, R., Turkoglu, I., Sengur, A.: Effective diagnosis of heart disease through neural networks ensembles. Expert Syst. Appl. 36(4), 7675–7680 (2009) 17. Deekshatulu, B.L., Chandra, P.: Classification of heart disease using k-nearest neighbor and genetic algorithm. Procedia Technol. 10, 85–94 (2013) 18. Jabbar, M.A., Deekshatulu, B.L., Chandra, P.: Prediction of risk score for heart disease using associative classification and hybrid feature subset selection. In: Proceedings of the Conference of ISDA, pp. 628–634. IEEE (2013) 19. Masetic, Z., Subasi, A.: Congestive heart failure detection using random forest classifier. Comput. Methods Programs Biomed. 130, 54–64 (2016) 20. Paul, A., Mukherjee, D.P., Das, P., Gangopadhyay, A., Chintha, A.R., Kundu, S.: Improved random forest for classification. IEEE Trans. Image Process. 27(8), 4012–4024 (2018) 21. Tharani, K., Kumar, N., Srivastava, V., Mishra, S., Pratyush Jayachandran, M.: Machine learning models for renewable energy forecasting. J. Statistics Manage. Syst. 23(1), 171–180 (2020). https://doi.org/10.1080/09720510.2020.1721636 22. Agrawal, A., Jain, A.: Speech emotion recognition of Hindi speech using statistical and machine learning techniques. J. Interdisc. Math. 23(1), 311–319 (2020). https://doi.org/10. 1080/09720502.2020.1721926 23. Panda, N., Majhi, S.K., Singh, S., Khanna, A.: Oppositional spotted hyena optimizer with mutation operator for global optimization and application in training wavelet neural network. J. Int. Fuzzy Syst. 38(5), 6677–6690 (2020) 24. Panda, N., Majhi, S.K.: Effectiveness of swarm-based metaheuristic algorithm in data classification using pi-sigma higher order neural network. In: Panigrahi, C.R., Pati, B., Mohapatra, P., Buyya, R., Li, K.-C. (eds.) Progress in Advanced Computing and Intelligent Engineering. AISC, vol. 1199, pp. 77–88. Springer, Singapore (2021). https://doi.org/10.1007/978-98115-6353-9_8 25. Chandra Sekhar, C., Panda, N., Ramana, B.V., Maneesha, B., Vandana, S.: Effectiveness of backpropagation algorithm in healthcare data classification. In: Sharma, R., Mishra, M., Nayak, J., Naik, B., Pelusi, D. (eds.) Green Technology for Smart City and Society. LNNS, vol. 151, pp. 289–298. Springer, Singapore (2021). https://doi.org/10.1007/978-981-15-82189_25 26. Upadhyay, A., Singh, M., Yadav, V.K.: Improvised number identification using SVM and random forest classifiers. J. Inf. Optim. Sci. 41(2), 387–394 (2020). https://doi.org/10.1080/ 02522667.2020.1723934

Software Effort and Duration Estimation Using SVM and Logistic Regression Sasanko Sekhar Gantayat1(B) and V. Aditya2 1 GMR Institute of Technology, Rajam, Andhra Pradesh, India

[email protected] 2 Cognizant Technology Private Limited, Chennai, India

Abstract. Prediction of software development cost is one of the important tasks before starting the actual development phase of the project. Software products are mostly acceptable by end-users when they are developed within a lower budget. In the field of project management, software cost estimation is considered as the most challenging areas. Machine learning algorithms are used to handle these types of problems. Machine learning algorithms increase the project success rates. Using machine learning algorithms with software simulation allows developers to show the working of a program to customers and that could result in increasing methods of project cost estimation and for better resource allocation and utilization. So, to develop and implement any software systems, the proposed effort and duration estimation models are meant to use as a decision support tool. ISBSG dataset is used for this implementation. Machine learning models’ results can be used to predict software costs with a high accuracy rate. Keywords: ISBSG · Software project estimation · Effort · Duration estimation · SVM · Logistic regression · Prediction

1 Introduction Software cost estimation is a basic stage that is to be done at initial phase of the software development process. The primary target is to have clear undertaking subtleties and determinations to help partners in dealing with the task as far as HR, resources, programming, information and even in the possibility study. Exact estimation results will cause the undertaking supervisor to improve estimation for the venture cost. Be that as it may, the mistake may result from the venture cost estimation process that will influence the undertaking conveyance. A task with wrong or imprecision assessment will confront issues with conveyance timing, assets required, spending plan or even in quality or operational side and some of the time the undertaking may come up short or prematurely ended. Thus, the cost estimation is a huge piece of the product extends thus it keeps on being an intricate issue in the product design field. Along these lines, numerous investigations and looks into have been directed to upgrade and improve the estimation procedure and get progressively precise and trustworthy outcomes. In numerous logical inquiries about, ML techniques are being utilized and executed no doubt in © The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 M. N. Mohanty et al. (Eds.): METASOFT 2022, AISSE 1, pp. 275–288, 2022. https://doi.org/10.1007/978-3-031-11713-8_28

276

S. S. Gantayat and V. Aditya

the different fields. ML can be an appropriate method to construct the proposed model because of the capacity to gain from recorded information and adjust the wide varieties that join programming venture advancement. In this work, ML techniques are be utilized to assess and look at the aftereffects of executing such methods on a dataset. By applying ML techniques on the dataset, it tends to be finished up if the ML procedures could be applied effectively on programming cost estimation information or not. On the off chance that truly, it is conceivable to know which technique scored the best outcomes and it is probably going to choose if an ML model can be created to assess and appraise the product cost. The point of the task is to handle these constraints and restricted the gap between state-of-the-art look into discoveries and potential organization of vigorous AI calculations practically speaking for the effort and also duration at the underlying undertaking software development lifecycle. In this way, an exhaustive methodology is introduced, starting from information arrangement to the models’ execution and support that guarantees their ease of use just as extraordinary estimation precision and heartiness for commotion inside information. For that reason, a functional and successful methodology for getting ready information and model buildings are applied and introduced dependent on the ISBSG dataset, which gives the most solid wellspring of an enormous volume of late programming ventures from various businesses with ML prescient calculations.

2 Objective Software estimation is one of the most testing regions of undertaking the executives. The motivation behind this work is to limit the gap between modern research results and executions inside associations by proposing successful and viable ML techniques and support approaches by the usage of research discoveries and best practices of various industries. This can be accomplished only when ISBSG dataset is applied, keen information planning, ML techniques (Support Vector Machines, Logistic Regression, etc.) and cross approval. For figuring of exertion and span SLOC(Lines of code) are considered from the ISBSG dataset.

3 Challenges The challenges are the non-availability of balanced datasets as many cases in Software development, different available information is available to various projects and all the cases these are not same. • Numerous methods are available for dealing with noisy data. • Noisy and inconsistent information may seriously impact the prescient exactness of AI models. • Low quality of information. • Particularly the huge event of missing qualities and outliners may prompt conflicting and inconsistent outcomes.

Software Effort and Duration Estimation Using SVM and Logistic Regression

277

4 Proposed System The proposed framework manages estimation of exertion and span of different ML models on ISBSG dataset. The point of this task is to think about AI technique, for example, SVM, Logistic relapse as far as exactness. This proposed framework causes the enterprises to know the assessments of clients on their extend and improve the nature of their activities. In this paper, performance measurements, for example, F1, Recall, Precision and Support are determined. Alongside execution measurements, Regression measurements are likewise determined a portion of the decline measurements are MSE, MAE, RMSE are determined. In light of SLOC (lines of code) COCOMO estimation strategies, for example, organic, semi-disconnected, installed are determined for exertion and term of programming venture estimation. Numerous methods are available for dealing with loud data. Noisy and inconsistent information may seriously impact the prescient exactness of AI models. Low quality of information. particularly the huge event of missing qualities and outliners may prompt conflicting and inconsistent outcomes. In this way, information arrangement is a basic errand during the time spent structure Machine learning models in which preprocessing of data can be through choice, data cleaning, decrease, change and with highlight. For SVM, Radial Basis Function(RBF) is used.

5 Literature Survey Using ML algorithms with software simulation could increase the project success rate. Considering the view of many other authors namely Pospieszny executed the ensembling with ML calculations for the product exertion and span estimation [1]. The author utilized the RBF portion in SVM classifier. Following will be the equation for calculation.

K(x, xi ) = e−x∗ g(z) =

(x−xi )2

1 1 + e−z

(1) (2)

Here, x ranges from 0 to 1 and the default value of x is 0.1. Jianglin Huang et al., used the missing information for the information preprocessing for AI-based software cost estimation, which is a typical case in Software Engineering datasets. The information preprocessing is a fundamental stage for the usage which impacts the exactness of ML methods. They analyzed the adequacy of the four information handling procedures, for example, MDT, Scaling, FS and CS with their associations with ML methods to check the prescient correctnesses [2]. [0, 1]interval =

actual value − min(all values) max(all values) − min(all values)

(3)

The above recipe computes the scaling an incentive to deal with missing estimations of the numerical sort. The downside of this technique is it can’t deal with missing all-out information that could be taken care of by the most regular example strategy. ML approaches could be proper estimation procedure since they can expand the exactness of estimation via preparing rules of estimation and rehashing the run cycles.

278

S. S. Gantayat and V. Aditya

There are numerous components that sway programming exertion gauges, for example, group size, simultaneousness, force, discontinuity, programming multifaceted nature, PC stage and diverse site attributes in the event of programming advancement [3]. n 2 i=1 (Pi − Ai ) RMSE = (4) N The disadvantage of the technique is RMSE is definitely not a decent pointer of normal model execution and maybe a deceptive marker of normal mistake. Many efforts are made in order to present summarized analysis of software cost estimation. The main objective is to perform deep study and also to evaluate different methods used for predicting the software development cost to increase our understanding of this area of research. Some methods are used to enhance benefits of software cost estimation for effort and cost like 10-fold cross-validation, Raza Tayyab [4]. The drawback of the system is the product procedure improvement is confronting the challenges with such exertion estimation as a result of the absence of information for investigation and absence of data in regards to the information types. Salvador García, shown that the data preprocessing in information mining is a broad zone that joins procedures from various fields, to improve the nature of information bases to perform learning and information extraction assignments. There has been an enormous number of information preprocessing calculations proposed in this paper. The Expectation-Maximization (EM) calculation is a meta-calculation used to enhance the most extreme probability of information by demonstrating subordinate irregular factors is utilized to deal with missing worth ascription in the dataset. The benefit of this technique is the information preprocessing to stay away from the bogus control [5]. Effort estimation are frequently affected by a few components, both natural and extrinsic. The aftereffects of flow study can empower the analysts to confront new research difficulties and adjust their work 835 with contemporary research. They may likewise utilize other subject displaying strategies to recognize the concealed examples and patterns from a huge bibliographic dataset. The upside of the technique is the use of intrinsic and extrinsic made projects are easily portabile and have improved performance [6]. Marta Fernández-Diego, introduced the consequences of a deliberate survey about the use of ISBSG. The International Software Benchmarking Standards Group (ISBSG) keeps up a product advancement storehouse with more than 6000 programming ventures. This dataset makes it conceivable to gauge an undertaking’s size, exertion, length, and cost. This work presents a depiction of the current utilization of ISBSG in programming research. The benefit of utilizing the dataset ISBSG that it offers an abundance of data with respect to rehearses from a wide scope of associations, applications, and improvement types, which comprises its principle potential. ISBSG is more suitable for research related to effort and productivity rather than defects [8]. Erik Stensrud mention in his paper to acquire an MMRE which is near value to the MRE for a project of a given size, It is to be recommended that the informational index to be divided into subsamples of projects with genuinely equal size and report MMRE for each subsample.According to the project size MRE value changes, the MRE average will be taken over all projects. The below recipe is used for figuring the mean relative

Software Effort and Duration Estimation Using SVM and Logistic Regression

279

error of a software project [9]. MRE = 1 − e−residual

(5)

The main advantage is that error term which is independent of project size is in its mean i.e.; average will be applied to projects of any size. Ladron-de-Guevaraetal, used the ISBSG dataset to examine models to appraise a product venture’s size, exertion, length, and cost. The point of this paper is to figure out which and to what degree factors in the ISBSG dataset have been utilized in programming designing to assembling exertion estimation models. The use of ISBSG factors for sifting subordinate factors and autonomous factors are depicted. 20 factors (from 71) generally utilized as free factors for exertion estimation are recognized and dissected obviously in this paper. In general ISBSG dataset is mostly used for estimating the effort and duration of any software projects because it will contain more amount of industry software projects with all necessary requirements [10].

6 System Design 6.1 Data Set For this paper, some projects ISBSG Release 2015 dataset is used to show the changes that are taken recently in software development. For the selection and review of data dependent variables are to be taken at first. For calculating the effort, it is chosen to utilize Normalized Work Effort to present the total effort required to play out an activity. For calculating the duration, the real elapsed time parameter was obtained by subtracting two variables: Project Elapsed Time and Project Inactive Time from the ISBSG dataset. From 125 factors, assembled into 15 classifications, just a subset of them was picked that may impact forecast of effort and also duration of programming software development at the beginning period of the lifecycle. The dataset ISBSG also contains a lot of missing values. The unbalanced dataset is taken without balancing as there are many cases it is difficult to balance the dataset and also experimented with unbalanced dataset in this paper (Table 1). Table 1. Selected variables for effort and duration estimation. Variable

Description

Type

Categories

Role

Industry sector

Organization tye

Nominal

14

Input

Application type

Addressed app. type

Nominal

16

Input

Development type

enhancement

Nominal

3

Input

Development platform

PC, Mid range

Nominal

4

Input

Language type

Programming language

Nominal

3

Input

Package customization

Check project is having PC or not

Nominal

3

Input

Relative size

Functional points

Nominal

7

Input

(continued)

280

S. S. Gantayat and V. Aditya Table 1. (continued)

Variable

Description

Type

Categories

Role

Architecture

System architecture

Flag

6

Input

Agile

Agile used

Nominal

2

Input

Used methodology

Development used methodology

Nominal

3

Input

Resource level

Team effort

Nominal

4

Input

Effort

Total project work in months

Continuous

-

Output

Duration

Total project elapsed time

Continuous

-

Output

6.2 Architecture The following flowchart shows the steps of the execution of the model for predicting the effort and duration. The ISBSG dataset is preprocessed to avoid noisy data. In this step, we mainly considered noisy data handling by the most frequent item in the column (Fig. 1).

Fig. 1. Proposed methodology

Software Effort and Duration Estimation Using SVM and Logistic Regression

281

6.3 Machine Learning Algorithms 6.3.1 SVM Support vector machines (SVMs) is a tool for supervised machine learning algorithms for classification and regression. It is used with a hyperplane for classification. It has a kernel function which classifies the data. It supports multiple continuous and categorical values (Fig. 2).

Fig. 2. SVM plane

6.3.2 Logistic Regression Logistic regression comes under the category of supervised machine leaning classification algorithm. So for any classification problem, the target variable which is called to be a required output Y axis which accepts discrete values for given set of inputs mainly on X axis. It is Similar to Linear regression which follows a linear function for data and also the sigmoid function is used in logistic regression model for the data interpretation (Fig. 3). g(z) =

1 1 + e−z

Fig. 3. Logistic regression

(6)

282

S. S. Gantayat and V. Aditya

In Logistic regression supervised model one of the most important concept is threshold value. Threshold value is needed for any classification technique. The value of threshold will be affected by two parameters such as recall and precision. It gives good result when both are set to give 1 as the desired output. Following are the categories of logistic regression: (1) Binomial which will have two parameters either 1 or 0 as output variable. (2) Multinomial which will be having more than 2 as its output variables. (3) Ordinal which will be applied on output variables in orderly manner. 6.4 Performance Metrics The following performance matrices are used for the analysis of the data. Confusion Matrix, Accuracy, Recall, Precision, F1 Measure, Mean Absolute Error, Mean Squared Error, Root Mean Squared Error.

7 Implementation 7.1 Data Preprocessing Raw data (real world data) is always incomplete and that data cannot be sent to a model. In this paper, the following steps are used to handle missing values for categorical data by the most frequent occurrence of an element in a column. Label Encoding converts the labels into a numeric form into a machine-readable form to operate by the machine learning algorithms. It helps the algorithm to follow in supervised learning. The steps for pre-processing of data: Missing Values Elimination, Check for Categorical Data, Data Standardization, Data Splitting. 7.2 COCOMO Model for Effort and Duration The Constructive Cost Model (COCOMO) uses a basic regression formula with parameters that are derived from historical project data and current project characteristics. The first level will be the basic COCOMO which is good for quick, early, rough order of magnitude estimates of software costs, but its accuracy is limited due to its lack of factors to account for the difference in project attributes (Cost Drivers). Intermediate COCOMO takes these Cost Drivers into account and Detailed COCOMO additionally accounts for the influence of individual project phases [12],[13]. Basic COCOMO computes software development effort (and cost) as a function of program size. Program size is expressed in estimated thousands of source lines of code (SLOC). COCOMO applies to three classes of software projects. The organic projects has small teams having good experience people. The embedded projects are developed within a set of “tight” constraints. It is also combination of organic and semi-detached projects like hardware, software, operational, etc.. The semi-detached projects have medium teams with mixed experience working with a mix of rigid and less than rigid requirements. The basic COCOMO equations take the form:

Software Effort and Duration Estimation Using SVM and Logistic Regression

283

Effort Applied (E) = a (KLOC)b in man-months Development Time (D) = c (Effort Applied)d in months People required (P) = Effort Applied / Development Time [count] where KLOC is the estimated number of delivered lines (expressed in thousands) of code for the project. The coefficients a, b, c and d are given in the following table [11] (Table 2). Table 2. COCOMO table Type of Software Project

a

b

c

d

Organic

3.2

1.05

2.5

0.38

Semi-Detached

3.0

1.12

2.5

0.35

Embedded

2.8

1.20

2.5

0.32

For the implementation, the out of 123 attributes,15 attributes are selected. These are Architecture, ApplicationType, DevelopmentType, DevelopmentPlatform, LanguageType, RelativeSize, UsedMethodology, AgileMethodUsed, ResourceLevel, PackageCustomisation, IndustrySector, Effort, Projecteaspedtime, Projectinactivetime and Duration. The two columns, Project elapsed time and Project inactive time are dropped as it is not important in the effort and duration calculations. Since there was no missing data for the attributes, After the label encoding the 13 attributes, Architecture, Application Type, Development Type, Development Platform, Language Type, Relative Size, Used Methodology, Agile Method Used, Resource Level, Package Customisation, Industry Sector. After preprocessing with a frequent filling pattern, the following table obtained (Fig. 4). Statistical Information of the Effort and Duration (Table 3, 4, 5, 6).

Effort: Effort Mean

Duration: : 5005.3112

Duration Mean

: 7.0483

Effort Standard Deviation : 16773.1245

Duration Standard Deviation : 6.9788

Effort Normal Distribution : 2.2788e-05

Duration Normal Distribution : 1.3538e-40

The dataset splitted into training and testing datasets with the ratio 80:20. The Xtraining and X-testing dataset columns are considered from 1 to 12 and Y-training and Y-testing dataset columns are considered from 12 to 14.

284

S. S. Gantayat and V. Aditya Table 3. ISBSG sample dataset ISBSG data Release February 2015 ISBSG Project ID

Industry Sector

Application Type

Developm ent Type

10279

Banking

11278

Service Industry

Enhanceme nt Enhanceme nt

11497

Banking

11738

Banking

11801

Governm ent

Surveillance and security; Workflow support & management;C omplex process control; Surveillance and security; Surveillance and security; IdM;

13026

Banking

Surveillance and security;

Develo pment Platfor m PC

Lang uage Type

Relati ve Size

Architec ture

Development Methodologie s

3GL

XS

Multi

3GL

S

Stand alone Client server

Agile Development; Agile Development;

Stand alone Stand alone Multitier with web public interface Stand alone

Agile Development; Agile Development; Agile Development;

Yes

Agile Development;

Yes

Enhanceme nt Enhanceme nt Enhanceme nt

Enhanceme nt

PC

3GL

S

PC

3GL

XS

Multi

3GL

S

PC

3GL

XXS

Agile Meth od Used Yes

Used Metho dology

Reso urce Level

Yes

Yes

1

1

1

Yes Yes

1 Yes

1

1

Fig. 4. Graphs of effort-frequency and duration-frequency

Table 4. Preprocessing dataset Archit ecture

Applicati on Type

Stand alone

Transacti on/Produ ction System Billing;

Client server ... Multitier Stand alone Client servr

... Customer relationsh ip mangmt Electroni c Data Intrchang Cars selling;

Develop ment Type

Proj ect elaps ed time 6.0

Proje ct inacti ve time 0.0

6.0

1100. 0 ... 57.0

3.0

0.0

0.0

... 4.3

... 0.0

... 4.3

Electronics & Computer

80.0

1.0

0.0

1.0

Communica tion

1449. 0

6.0

0.0

6.0

Developm ent Platform

Lang uage Type

Relat ive Size

Used Method ology

Resou rce Level

New Develop ment

MR

4GL

M1

No

1.0

Service Industry

1850. 0

Enhance ment ... Enhance ment

MF

3GL

S

Yes

1.0

... Multi

... 3GL

... S

... Yes

... 1.0

Wholesale & Retail ... Medical & Health Care

New Develop me Enhance men

PC

3GL

S

Yes

1.0

Multi

3GL

S

Yes

1.0

Industry Sector

Effor t

Du rati on

Software Effort and Duration Estimation Using SVM and Logistic Regression

285

Table 5. Label encoded dataset Archit ecture

Applica tion Type

Develop ment Type

Used Development Language Relative Methodo Size Type Platform logy

Agile Method Used

Package Resource Customi Level sation

5 0 0 ...

514 474 307 ...

1 1 0 ...

2 3 1 ...

2 2 1 ...

1 2 2 ...

1 0 2 ...

0 0 0 ...

0 0 0 ...

0 5

209 307

1 0

1 1

1 1

1 1

0 2

0 0

0

61

0

3

1

3

2

0

Industry Sector

Effort

1 1 1 ...

10 2 12 ...

1850 856 23913 ...

0 0

0 1

0 0

960 2312

0

1

1

1449

Table 6. Binary coded dataset Archite cture

Appli cation Type

Develo pment Type

Develop mentPla tform

Lang uage Type

Relativ eSize

Used Metho dology

1 0 0 ... 1 0

1 1 1 ... 1 1

1 1 0 ... 0 0

1 1 1 ... 1 1

1 1 1 ... 1 1

1 1 1 ... 1 1

1 0 1 ... 1 1

Agile Meth odUs ed 0 0 0 ... 0 0

Resour ceLeve l

Package Customi sation

0 0 0 ... 0 0

1 1 1 ... 1 1

Indu Dura stry tion Sect Effort or 1 1 1 1 1 1 1 1 1 ... ... ... 0 1 1 1 1 1

7.3 Classifier SVC In SVC, the parameters used, Cost C = 1.0, Decision function shape = one-verse-rest, Polynomial degree = 3, gamma = auto_deprecated, kernel = rbf (Gaussian function), tolerance = 0.001. the hyperparameter gamma is automatically selected based on the decision function (Table 7, 8). Table 7. Confusion matrix Confusion Matrix

True Positive

True Negative

True Positive

185

5

True Negative

61

1101

Table 8. Classification report Classification Precision Recall F1-Score Support Report 0

0.20

0.80

0.50

190

1

0.86

1.00

0.92

1162

Macro Avg

0.53

0.90

0.71

676

Weighted Avg

0.77

0.97

0.86

1025

286

S. S. Gantayat and V. Aditya

7.4 Logistic Regression In the logistic regression the data set in training and testing ratio is 80:20. Here, the X_training dataset has 5408 records with 11 features, and the X_testing dataset 1352 records with 11 features. Y_train dataset has 5408 records with 1 feature and Y_test dataset has 1352 records with 1 feature. After the cross validation, TP: [1578 1577 1577], TN: [0 0 0], FN: [0 0 0] FP: [226 225 225], which has accuracy: 87.5% (Table 9). Table 9. Errors Types

MAE

MSE

ME

RMSE

Errors

0.14

0.14

1.00

0.37

8 Results and Discussion The performance of various ML algorithms, i.e. logistic regression and SVM and their metrics are shown below. Mostly incorrect predictions may lead to project failure. So there is a need to tackle the deficiency of the project failure and increase the success rate of project. Using ML models, we can achieve accurate results for software effort and duration estimation (Table 10, 11, 12). Table 10. Classification accuracy for software effort and duration estimation Classification Techniques

Before Cross Validation

After Cross Validation

Support Vector Machine

0.85

0.87

Logistic Regression

0.85

0.88

Table 11. Performance metrics for software effort and duration estimation Precision

Recall

F1-Score

Support

0.77

0.89

0.86

1352

0.43

0.50

0.46

1352

0.74

0.86

0.79

1352

Software Effort and Duration Estimation Using SVM and Logistic Regression

287

Table 12. Regression metrics for software effort and duration estimation Regression Metrics

ERROR RATE

Mean Absolute Error

0.14

Mean Squared Error

0.14

Max Error

1.00

Root Mean Squared Error

0.37

Calculating Effort and Duration The LOC of the project is considered for calculating the Effort and Duration of the project. From this, we can also predict the corresponding project details. Mean KLOC = 469, Effort calculation in person months using the Table 3 ans Persons required to complete project effort = 2041, duration = 45 (Table 13). Table 13. Effort calculation in person months KLOC

a

b

Effort

c

d

Duration

469

3.2

1.05

2041

2.5

0.38

45

Persons Required 45

3

1.12

2943

2.5

0.35

40

73

2.8

1.20

4493

2.5

0.32

36

124

9 Conclusion Using machine learning algorithms with software simulation allows developer to show the working of a program to customer and that could result in increasing methods of project estimation.The proposed effort and duration estimation models in this project are widely used and considered as a decision support tool for small or large scale organizations. Estimating the software project success rate by using machine learning models would contribute better towards the resource allocation in future.

References 1. Pospieszny, P.: An effective approach for software project effort and duration estimation with machine learning algorithms. J. Syst. Software 137, 184–196 (2018). https://doi.org/10.1016/ j.jss.2017.11.066 2. Jianglin, H.: An empirical analysis of preprocessing for machine learning-based software cost estimation. Inf. Softw. Technol. 67, 108–127 (2016). https://doi.org/10.1016/j.infsof. 2015.07.004 3. Rekha, T.: Machine learning methods of effort estimation and its performance evaluation criteria. Int. J. Comput. Sci. Mob. Comput. 6, 61–67 (2017)

288

S. S. Gantayat and V. Aditya

4. Tayyab, M.R., Usman, M., Ahmad, W.: A machine learning based model for software cost estimation. In: Bi, Y., Kapoor, S., Bhatia, R. (eds.) IntelliSys 2016. LNNS, vol. 16, pp. 402– 414. Springer, Cham (2018). https://doi.org/10.1007/978-3-319-56991-8_30 5. Saljoughinejad, R., Khatibi, V.: A new optimized hybrid model based on COCOMO to increase the accuracy of software cost estimation. J. Advances Comp. Eng. Technol. 4, 27–40 (2018) 6. Salvador, G.: Tutorial on practical tips of the most influential, preprocessing algorithms in mining. Knowl.-Based Syst. 98, 1–29 (2016). https://doi.org/10.1016/j.knosys.2015.12.006 7. Lehtinen Timo, O.A.: Perceived causes of software project failures – an analysis of their relationships. Inf. Softw. Technol. 56, 623–643 (2014). https://doi.org/10.1016/j.infsof.2014. 01.015 8. Arslan, F.: A review of machine learning models for software cost estimation. Review of Computer Eng. Res. 6(2), 64–75 (2019). https://doi.org/10.18488/journal.76.2019.62.64.75 9. Mall, R.: Fundamentals of Software Engineering. PHI Learning Pvt. Ltd., India (2009) 10. Pankaj, J.: An Integrated Approach to Software Engineering. Springer Science & Business Media, India (2012) 11. Pressman, R.S.: Software Engineering: A Practitioner’s Approach, 9th edn. Tata McGrawHill, India (2009) 12. International Software Benchmarking Standards Group (2013). https://www.isbsg.org/2015/ 07/01/new-release-r13-of-de-data/

A Framework for Ranking Cloud Services Based on an Integrated BWM-Entropy-TOPSIS Method Soumya Snigdha Mohapatra(B) and Rakesh Ranjan Kumar Department of Computer Science and Engineering, C V Raman Global University, Mahura, Bhubaneswar, India [email protected], [email protected]

Abstract. Cloud computing has grown as a computing paradigm in the last few years. Due to the explosive increase in the number of cloud services, QoS (quality of service) becomes an important factor in service filtering. Moreover, it becomes a non-trivial problem when comparing the functionality of cloud services with different performance metrics. Therefore, optimal cloud service selection is quite challenging and extremely important for users. In the existing approaches of cloud service selection, the user’s preferences are offered by the user in a quantitative form. With fuzziness and subjectivity, it is a hurdle task for users to express clear preferences. To address this challenge, in this paper, we proposed a hybrid MultiCriteria Decision Making (MCDM) methodology to aid the decision maker to evaluate different cloud services based on subjective and objective assessments. To do that, we introduced a novel Technique for Order of Preference by Similarity to Ideal Solution (TOPSIS) method that combines subjective and objective aspects. We used the entropy weight method to do an objective assessment in order to reduce the influence of erroneous or fraudulent Quality of Service (QoS) information. For subjective assessment, we employed a systematic MCDM method called Best Worst Method (BWM). In the end, a numerical example is shown to validate the effectiveness and feasibility of the proposed methodology. Keywords: Cloud computing · Multicriteria Decision Making (MCDM) · TOPSIS · Best Worst Method · Entropy

1 Introduction Cloud computing has grown into a burgeoning computing paradigm that is transforming how computing, storage, and on-demand service solutions are managed and delivered [1]. Cloud computing offered three different service models to its user either infrastructure as a service, platform as a service and software as a service (IaaS, PaaS and SaaS). Cloud computing is categorized into four types: private cloud, public cloud, hybrid cloud and community cloud [2, 3]. The cornerstone of cloud computing is that users can access cloud services from anywhere, at any time on a subscription basis. Cloud

© The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 M. N. Mohanty et al. (Eds.): METASOFT 2022, AISSE 1, pp. 289–297, 2022. https://doi.org/10.1007/978-3-031-11713-8_29

290

S. S. Mohapatra and R. R. Kumar

computing provides the “pay-as-you-use” pricing model, where cloud user charged only for the consumed resources. Many of the world’s largest IT firms (including IBM, eBay, Microsoft, Google, and Amazon) have moved their existing business solutions to the cloud because of the advantages it offers [4]. A growing number of cloud service providers (CSPs) now provide their customers a wide range of options for selecting the best cloud service for their individual functional needs. Many CSPs offer identical services, but at varied pricing and quality levels, and with a wide range of additional options and features. However, a supplier may be cheap for storage but expensive for compute. Because of the wide range of cloud service options available, it can be difficult for customers to determine which CSP is best suited to meet their specific needs. Incorrect selection of a cloud service provider (CSP) can lead to service failure, data security or integrity breaches and non-compliance with cloud storage standards in the future. Cloud service selection usually involves matching customer needs to the features of cloud services offered by various CSPs. The growing number of CSPs and their variable service offerings, pricing, and quality have made it difficult to compare them and choose the best one for the user’s needs. To determine which cloud service provider (CSP) is the greatest fit for a cloud user’s needs, a wide range of evaluation criteria for distinct cloud services from multiple CSPs must be considered. Thus, selecting the best CSP is a difficult Multi-Criteria Decision Making (MCDM) problem in which various choices must be reviewed and ranked using a variety of criteria based on specific user preferences [5]. To address these issues, we propose a cloud service selection approach that integrates both subjective and objective weighing methods in accordance to their specified importance in order to address the limits of subjective and objective weighting. The suggested methodology combines TOPSIS, BWM, and the Entropy. The Entropy method is used to objectively evaluate discrepancies between QoS criteria, whereas the BWM method based on expert opinion is used to calculate QoS subjective criteria weight. Finally, in order to produce a suitable ranking of cloud service providers, the approach for order preference by similarity to ideal solution (TOPSIS) method is applied in this paper. Using a real-world use case, the integrated method was proven to be superior to the most often used MCDM strategy (i.e., AHP). The proposed approach beats AHP in terms of computing complexity and consistency, making it more efficient and trustworthy. The remainder of this article is organized as follows. Section 2 talks about the related work. The proposed cloud service ranking framework is discussed in Sect. 3. Section 4 explain the proposed methodology for optimal cloud service selection. In Sect. 5, a numerical case study and sets of experiment is included to depict the feasibility of the proposed methodology. In addition, the results and their validation are discussed. At last, Sect. 6 discusses the concluding remarks and future scope.

2 Related Works In this section, we first review some of the notable cloud service selection model based on QoS.

A Framework for Ranking Cloud Services

291

A thorough review of the literature reveals that the application of MCDM- based techniques for cloud service selection and ranking has received a significant amount of attention. Some of the frequently used MCDM techniques such as AHP [6], analytic network process [7],Technique for Order of Preference by Similarity to Ideal Solution (TOPSIS) [8], Best Worst Method(BWM) [9] and outranking [10]. These decision-making methods can be divided into stochastic, deterministic, and fuzzy methods depending on the type of data they use and how many decision-makers participate in the decision process, i.e. single or multiple (group). Using a modified DEA and SDEA model, the author [11] shows how to pick the best cloud service from among a variety of options based on the needs of the customer. Using a fuzzy ontology, a new fuzzy decision- making framework is proposed in the paper [12], which is capable of handling fuzzy information and finding the best cloud service. The authors use fuzzy AHP and TOPSIS to calculate QoS weights and measure cloud service performance. Lang et al. [13], proposed a Delphi method for identify and classify the QoS criteria for cloud service provider evaluation. In [14], a framework called TRUSS was presented for the identification of trustworthy cloud services. A brokerage- based cloud service selection framework was proposed in paper [15]. Ding et al. [16] came up with collaborative filtering to make service recommendations that are time-sensitive. The objective was to expedite the process of identifying cloud service provider with a higher level of customer satisfaction.

3 Proposed Cloud Service Selection Framework This section introduces a proposed broker-based framework (OPTCLOUD) for cloud service as shown in Fig. 1. This framework consist two distinct components: (i) Cloud Broker (ii) Cloud Service Directory. – Cloud Broker: The suggested framework is built around the cloud broker. It performs a variety of activities, as illustrated in Fig. 1, including cloud service discovery and cloud service ranking. It interacts with the cloud service directory to filter the cloud services that meet the cloud user’s requirements. The cloud broker’s cloud service ranking module ranks filtered cloud services according to the significance of each QoS parameter provided by the cloud user. For each cloud service, a ranking is generated using the proposed methodology. The cloud service discovery module is used to discover and store information about the various cloud services available. – Cloud Service Directory: It’s a database that store information about cloud service providers and their services on various QoS attributes. This is where data from cloud provider’s service standards and performance is stored. It maintains the cloud service provider’s detailed artifacts and assets, including the SLA and functional and nonfunctional standards. This directory is used for pre-screening services by cloud broker when they are looking for candidate services that meet their customer’s needs. To validate the claims of cloud service providers, the cloud broker service uses this component.

292

S. S. Mohapatra and R. R. Kumar

Fig. 1. Proposed Framework.

4 Cloud Service Selection Methodology We present an integrated MCDM methodology that integrates TOPSIS, BWM, and the Entropy method in this section. The objective weight of QoS criteria is evaluated using the Entropy methodology, whereas the subjective weight of QoS criteria is determined using the BWM method. In order to establish a proper ranking of cloud service providers, the objective and subjective weights are combined and employed in the technique for order preference by similarity to ideal solution (TOPSIS) method. The steps of the proposed approach are as follows:Step 1: Create a Decision Matrix: We create a decision matrix DM of m*n, in which m represents the eligible cloud service alternatives denoted by (CSPi) that satisfy the cloud customer’s functional and non-functional requirements, and n represents the number of QoS criteria for determining the best cloud service provider. It is shown in Eq 1. ⎥ ⎢ ⎢ X11 · · · X1n ⎥ ⎥ ⎢ ⎥ ⎢ (1) DM = ⎣ ... . . . ... ⎦ Xm1 · · · Xmn where x ij represents the QoS value the delivered by (CSPi ) on QoS criteria j.

A Framework for Ranking Cloud Services

293

Step 2: Objective QoS Weights Calculation: Based on unbiased data, the objective weight approach can overcome man-made disruptions in order to provide more accurate results. The entropy approach was used to determine the objective weight of each criterion in this step. Using probability theory, the entropy approach is used to determine the degree of disorder in a given set of information. When a dataset has the lower probability values, it carries more information with higher entropy values. In order to determine the objective weight, we follow the same steps as those proposed in the paper [17]. Step 3: Apply Best Worst Method for QoS criteria weight calculation: A cloud customer’s subjective preference for certain QoS criteria also plays a role in choosing which service to use. For each cloud, we have n number of QoS criteria for each cloud service alternative, resulting in an n-dimensional preference vector for each. This vector indicates the cloud customer’s preferences for each criterion. We use the BWM methods to assess each QoS criterion’s subjective weight. The BWM technique, which was developed by Razaei, is one of the most extensively used MCDM approaches [9]. Compared to other well-known MCDM methods like AHP, this novel method finds more consistent results with fewer pair wise comparisons [18]. In AHP method, if there are n number of QoS criteria then n( n–1 )/2 comparisons are required while in the BWM, only 2n–3 comparisons are needed. In order to determine the subjective weight, we follow the same steps as those proposed in the paper [19]. Step 4: Combined subjective and Objective weight: We employ both the subjective weighting method and the entropy-based objective weighting method in this stage. As shown in Eq. 2, the combination weight can be determined by combining the subjective weight and objective weight. cwj = ewj ∗ swj

j = 1, 2. . .n

(2)

where cwj represents the combined weight and ewj denotes the objective weight and swj denotes the subjective weight. We have used this combined weight in TOPSIS method. Step 5: TOPSIS method for cloud service selection and ranking: In this step, we apply TOPSIS method to select and rank the cloud service alternatives. TOPSIS is one of the most often used approaches to deal with the MCDM problem, and it was proposed by Hwang and Hoon [20]. To obtain the final result, we follow the TOPSIS technique steps outlined in the paper [21].

5 A Case Study with Experiment To demonstrate the proposed methodology’s effectiveness and validity, a real- world cloud service selection scenario is used. The primary goal of our research is to identify the most appropriate cloud service from a variety of available options. CloudHarmony.com data is used in the case study [22]. Using a set of dynamically launched virtual machines (VMs) and a predetermined amount of time, the CloudHarmony dataset measures the overall performance of various cloud services. The proposed methodology’s numerical calculations are presented as follow:

294

S. S. Mohapatra and R. R. Kumar

For this case study, we looked at 10 alternative cloud services by the acronyms CSP1, CSP2 ... and CSP10 from the CloudHarmony dataset. Eight quality criteria (Q1...Q8) are used to evaluate these services. A QoS decision matrix with ten different cloud service alternatives is shown in Table 1. Table 1. Decision matrix for eleven cloud alternatives Cloud service

Q1

Q2

Q3

Q4

Q5

Q6

Q7

Q8

CSP1

307.75

71

2.1

71

73

78

84

2.37

CSP2

498.5

91

4.8

91

60

89

82

31.17

CSP3

283.74

86

3.3

87

53

89

66

96.78

CSP4

130.33

63

7.8

63

73

78

84

14.66

CSP5

3610.2

96

1.4

99

67

100

77

6.4

CSP6

1314.75

78

3.5

79

73

78

84

30.75

CSP7

2561.33

87

1.2

96

67

78

72

28

CSP8

297.38

71

1.9

72

73

89

75

6.38

CSP9

498.5

91

4.8

91

60

89

82

31.17

CSP10

305.4

93

12.2

98

73

100

84

7

Following that, we used the proposed approach (Sect. 4) to evaluate the cloud services and determine the final ranking of the cloud service choices. Finally, the overall ranking of all cloud alternatives is shown in Table 2. As a consequence of the final results, we have determined that the CSP2 is the best cloud service alternative among all of the other options available. Table 2. Final rank. Cloud service providers

Ranking

CSP1

9

CSP2

2

CSP3

1

CSP4

10

CSP5

7

CSP6

6

CSP7

8

CSP8

8

CSP9

3

CSP10

5

A Framework for Ranking Cloud Services

295

5.1 Experiments In this subsection, we provide a comprehensive case study to demonstrate the feasibility, efficiency, and viability of the suggested methodology for cloud service selection. Compared with Other Existing MCDM Methods:- We compared our results with other popular MCDM methods namely AHP_Entropy_TOPSIS, BWM_TOPSIS and AHP_TOPSIS [19, 21, 23]. The Fig. 2 shows the experimental result of different methods. In most cases, there is a clear resemblance between the outcomes acquired through the suggested methodology and other techniques. Therefore, the results of the proposed approach can be concluded as accurate and precise. Sensitivity Analysis of Result:- Sensitivity analysis is used to verify the proposed scheme’s resilience and efficiency in this part. The sensitivity analysis examines how the ranking of a cloud service provider may change when different weight values are used. In this case, we go through the entire procedure to keep track of any modifications that may occur. Each time a criterion weight is changed, the cloud service providers’ ranks are evaluated. We conducted a sensitivity analysis by swapping the weights of each of the nine criteria for the weights of another criterion. Therefore, we created fifteen distinct experiments. We assigned a unique name to each experiment (E1….. E15). Figure 3 shows the outcomes of 15 experiments. CSP3 emerged as the best service in 14 out of 15 experiments, as shown in Fig. 3. For second place, CSP2 was preferable to the other in 13 out of 15 studies. Finally, sensitivity analysis shows that rank of cloud service providers is proportional to the weight of the associated criteria. Therefore, we can infer that the suggested method is reliable and rationally ranks alternatives in accordance with preferences expressed by stakeholders.

Fig. 2. Ranking of cloud service provider with different methods

296

S. S. Mohapatra and R. R. Kumar

Fig. 3. Result of sensitivity analysis.

6 Concluding Remarks Finding the best cloud service for cloud users is a challenge if there are many QoS criteria. In this study, we proposed a novel cloud service selectionmethodology that combines Entropy, BWM and TOPSIS. This paper presents a hybrid MCDM strategy that uses Best Worst and Entropy methods to calculate QoS criteria weight and TOPSIS to rank cloud service alternatives. This contribution provides a new framework for the cloud service selection process. The proposed scheme demonstrates its feasibility and efficiency through a series of experiments with real datasets. Finally, we make a comparison with the other method to show that the proposed methodology outperforms them. Additionally, we perform a sensitivity analysis to ensure that our proposed cloud service selection methodology is robust and consistent. We will also extend our suggested methodology to include various MCDM methods for determining the most reliable cloud service.

References 1. Mell, P., Grance, T., et al.: The NIST Definition of Cloud Computing (2011) 2. Buyya, R., Yeo, C.S., Venugopal, S., Broberg, J., Brandic, I.: Cloud computing and emerging it platforms: Vision, hype, and reality for delivering computing as the 5th utility. Futur. Gener. Comput. Syst. 25(6), 599–616 (2009) 3. Armbrust, M., et al.: A view of cloud computing. Commun. ACM 53(4), 50–58 (2010) 4. Buyya, R., Yeo, C.S., Venugopal, S.: Market-oriented cloud computing: Vision, hype, and reality for delivering it services as computing utilities. In: Proceedings of the 10th IEEE International Conference on High Performance Computing and Communications, 2008. HPCC 2008, pp. 5–13. IEEE (2008)

A Framework for Ranking Cloud Services

297

5. Garg, S.K., Versteeg, S., Buyya, R.: A framework for ranking of cloud computing services. Futur. Gener. Comput. Syst. 29(4), 1012–1023 (2013) 6. Saaty, T.L.: Decision making with the analytic hierarchy process. Int. J. Serv. Sci. 1(1), 83–98 (2008) 7. Satty, T.L., Vargas, L.G.: Models, methods, concepts and applications of the analytic hierarchy process. Int. Ser. Oper. Res. Manage. Sci 34, 1–352 (2001) 8. Sirisawat, P., Kiatcharoenpol, T.: Fuzzy ahp-topsis approaches to prioritizing solutions for reverse logistics barriers. Comput. Ind. Eng. 117, 303–318 (2018) 9. Rezaei, J.: Best-worst multi-criteria decision-making method. Omega 53, 49–57 (2015) 10. De Leeneer, I., Pastijn, H.: Selecting land mine detection strategies by means of outranking MCDM techniques. Eur. J. Oper. Res. 139(2), 327–338 (2002) 11. Jatoth, C., Gangadharan, G.R., Fiore, U.: Evaluating the efficiency of cloud services using modified data envelopment analysis and modified super-efficiency data envelopment analysis. Soft. Comput. 21(23), 7221–7234 (2016). https://doi.org/10.1007/s00500-016-2267-y 12. Sun, L., Ma, J., Zhang, Y., Dong, H., Hussain, F.K.: Cloud-fuser: Fuzzy ontology and MCDM based cloud service selection. Futur. Gener. Comput. Syst. 57, 42–55 (2016) 13. Lang, M., Wiesche, M., Krcmar, H.: Criteria for selecting cloud service providers: A delphi study of quality-of-service attributes. Inf. Manage. 55(6), 746–758 (2018) 14. Tang, M., Dai, X., Liu, J., Chen, J.: Towards a trust evaluation middleware for cloud service selection. Futur. Gener. Comput. Syst. 74, 302–312 (2017) 15. Sundareswaran, S., Squicciarini, A., Lin, D.: A brokerage-based approach for cloud service selection. In: Proceedings of the 2012 IEEE Fifth International Conference on Cloud Computing, pp. 558–565. IEEE (2012) 16. Ding, S., Li, Y., Wu, D., Zhang, Y., Yang, S.: Time-aware cloud service recommendation using similarity-enhanced collaborative filtering and arima model. Decis. Support Syst. 107, 103–115 (2018) 17. Kumar, R.R., Kumar, C.: Designing an efficient methodology based on entropy- topsis for evaluating efficiency of cloud services. In: Proceedings of the 7th International Conference on Computer and Communication Technology, pp. 117–122 (2017) 18. Saaty, T.L.: How to make a decision: The analytic hierarchy process. Eur. J. Oper. Res. 48(1), 9–26 (1990) 19. Kumar, R.R., Kumari, B., Kumar, C.: CCS-OSSR: A framework based on hybrid MCDM for optimal service selection and ranking of cloud computing services. Clust. Comput. 24(2), 867–883 (2021) 20. Hwang, C.-L., Yoon, K.: Multiple attribute decision making: Methods and applications a state-of-the-art survey, vol. 186. Springer Science & Business Media (2012) 21. Kumar, R.R., Shameem, M., Khanam, R., Kumar, C.: A hybrid evaluation framework for qos based service selection and ranking in cloud environment. In: Proceedings of the 2018 15th IEEE India Council International Conference (INDICON), pp. 1–6. IEEE (2018) 22. Cloud Harmony Reports. http://static.lindsberget.se/state-of-the-cloud-compute-0714.pdf. Accessed 12 Mar 2017 23. Kumar, R.R., Mishra, S., Kumar, C.: A novel framework for cloud service evaluation and selection using hybrid MCDM methods. Arab. J. Sci. Eng. 43(12), 7015–7030 (2018)

An Efficient and Delay-Aware Path Construction Approach Using Mobile Sink in Wireless Sensor Network Piyush Nawnath Raut

and Abhinav Tomar(B)

Netaji Subhas University of Technology, Delhi 110078, India {piyushn.cs20,abhinav.tomar}@nsut.ac.in

Abstract. Using Mobile sinks (MSs) for data collection in wireless sensor networks (WSNs) is a prevalent method for diminishing the hotspot problem. There have been numerous proposed algorithms for data collection in WSN using MS and rendezvous points (RPs). However, the positions of the RPs affect the connectivity, network lifetime, delay, and other factors that substantially impact the performance of WSN concerning the critical applications. In this view, we propose an algorithm to solve the NP-hard problem of finding an optimal path while balancing energy consumption in a delay-bound application such as fire detection. The proposed algorithm uses a virtual polygon path and minimum spanning tree to divide the network and select optimal rendezvous points for the mobile sink. A convex hull-based algorithm generates the mobile sink’s optimal path through RPs. We have performed extensive simulations and have compared our algorithm with an existing algorithm to demonstrate the efficiency. The results show that the proposed algorithm outperforms the compared algorithm in terms of hop counts by 15% and results in improved network lifetime. Keywords: Mobile sink · Data collection · Virtual path · WSN · Delay-aware

1 Introduction Wireless sensor networks (WSNs) are used for different monitoring applications such as environmental monitoring, fire detection, security applications in the military, agriculture monitoring, and patient monitoring in health care. A wireless sensor network (WSN) contains sensors spread over an area to monitor that region. These sensors are powered by batteries and have limited resources. The data collected by the sensor nodes need to be transferred to a sink where the processing is done. Data transmission consumes a considerable amount of energy of the sensor nodes and hence plays a substantial part in network’s lifetime determination. The lifetime of a node affects the lifetime of the whole network [6]. The sink cannot always be in the node’s communication range; hence, the nodes need to perform multi-hop transfer of data to the sink. The nodes closer to the sink must communicate a high volume of data due to multi-hop transmission leading to faster energy depletion than other nodes in the WSN. This problem is referred to as the “hotspot” problem. Mobile sink (MS) has emerged as a promising solution to © The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 M. N. Mohanty et al. (Eds.): METASOFT 2022, AISSE 1, pp. 298–307, 2022. https://doi.org/10.1007/978-3-031-11713-8_30

An Efficient and Delay-Aware Path Construction Approach

299

tackle the hotspot problem. MS visits the rendezvous points (RPs) selected throughout the network. The mobile sink traverses the whole network by visiting all the RPs, and the sensor nodes transfer their data to the MS when MS visits the nearest RP. Over the years, many researchers have proposed a solution for selecting the rendezvous points in a network to effectively balance the network’s energy. However, the mobile sink solution has a disadvantage. Since the MS has to traverse the whole network, a significant delay is introduced in the data collection process. The additional delay introduced by the MS might not always be acceptable based on the application. One such application could be fire detection, wherein the MS will have to traverse the complete WSN within the given time while also ensuring that the network’s energy consumption is balanced. If the path generated by the MS is small, then the delay for data collection is reduced. However, it increases the energy required for the data transmission in a multi-hop manner as the number of hops increases. For decreasing the number of hops, the path length will have to be increased, leading to an increase in the delay for data collection. Therefore, there is a scope of further research for constructing the path balancing the delay and hop counts. A novel virtual polygon-based approach for efficient data collection in delay-bound WSNs is proposed in this paper. The motivation behind using virtual polygon is to create an ideal virtual path about the same distance from the center of the network and the network’s boundaries to balance the network’s energy consumption. We then construct the absolute path by referencing the virtual polygon path. The contributions are summarized as follows. 1. An efficient data collection approach is proposed for time-critical applications. 2. Considering the uneven distribution of the nodes, a virtual path is constructed that can scale from a square to a rectangular field. 3. Taking the virtual path as a reference, a tour for the mobile sink is generated which balances the network energy in a given delay by reducing the hop counts required for data transmission. 4. Extensive simulations are performed to demonstrate the effectiveness of our algorithm over the existing algorithm [1]. The proposed algorithm gives better results with lower hop counts, lower dead nodes and increased network lifetime than the compared algorithm [1]. The rest of the paper is arranged in the following manner. Section 2 briefs related works. Section 3 describes network model, energy model, and problem statement. The algorithms used are presented in Sect. 4. The simulation results are discussed in Sect. 5. Section 6 concludes the paper.

2 Related Work Many researchers have proposed data collection mechanism techniques [2, 3, 8, 9] requiring the MS and RPs. In the paper [2], the authors have proposed an algorithm for delay sensitive applications. The nodes are assigned weights based on their degree, the amount of packets they relay to the RP, and their distance from the route in hops. The disadvantage is, however, that the method for tour construction (Christofide’s heuristic) takes

300

P. N. Raut and A. Tomar

O(n3 ) time, and it is called O(n2 ) times. The total time complexity is O(n5 ) which is very high. In paper [1], the author presented a method that increases network lifetime by using a minimum spanning tree and a virtual path. It selects efficient RP placements using cost functions considering various parameters. It uses the same TSP algorithm used in the above paper [2]. The TSP algorithm is called n times, making the total time complexity O(n4 ) in the worst case. W. Wen et al. [4] proposed a mobile sink’s energy-aware path construction method. It uses a minimum spanning tree having base station as a root and identifies the efficient data collection points. It uses a convex polygon to construct a path that takes O(n2 ) time. The algorithm’s time complexity is O(n2 ). However, the algorithm does not consider the scenario where the nodes are deployed in different components disconnected from each other. A similar algorithm is also used by one paper [5] for path construction. It selects rendezvous points based on the value of the benefit index from the node. Punriboon, C. et al. [12] have proposed a fuzzy logic-based method for path construction in 2021. In 2005 Heinzelman, W.R. et al. [10] proposed a routing protocol that uses clustering for balancing the energy consumption of nodes. The nodes in a cluster select the cluster head, which takes the burden of collecting data from all other nodes in that cluster. The clusters and cluster heads are randomized for balancing the energy consumption of the sensors. Another paper [11] also uses the concept of clustering for balancing the energy more efficiently in the network. Most of the existing algorithms focus on balancing the network’s energy consumption. Few algorithms that balance network energy and delay have considerable time complexity. We have proposed a virtual polygon path-based method for constructing a path while considering both the delay and energy consumption.

3 Preliminaries 3.1 Network Model

Fig. 1. Network Model

The network model consists of a group of sensor nodes randomly distributed. Figure 1 shows a depiction of the model. The nodes are distributed in a rectangular grid. The sensor nodes are homogenous, i.e., all sensors have the same hardware and configuration. The network also consists of MS and rendezvous points. Sensors collect data and send it to

An Efficient and Delay-Aware Path Construction Approach

301

the rendezvous point by multi-hop data transmission. To gather data, the mobile sink goes to the RPs. It is assumed that there are no obstacles in the way of the mobile sink in the monitoring region, and the sensor nodes have adequate buffer to retain the data until the sink visits them to collect it. The mobile sink is considered to have an infinite supply of energy. 3.2 Energy Model We have used the radio energy model from paper [7] for data transmission. Assuming that the packet size of b bits is to be transferred over a distance d, the energy required for data transmission ETx is calculated using the following equations. ETx = bEelec + bfs d 2

(1)

“The electronics energy, Eelec , depends on factors such as the digital coding, modulation, filtering, and spreading of the signal.” [7]. The energy consumed by the node receiving a packet of size b bits is calculated using the following equation. ERx = bEelec

(2)

3.3 Problem Description The objective is to maximize lifetime of the network while designing a path that satisfies the requirement of acceptable delay in the data collection by mobile sink. Let, T = time required to collect data using the mobile sink. This time will include the time required to travel the complete path T travel and the time required for data collection at each rendezvous point. Let there be r such rendezvous points in the network. Let T i be the time spent by the sink at rendezvous point i. Then, r T = Ttravel + Ti (3) i=1

In a delay bound situation, let T max be the maximum tolerable time for gathering data using mobile sink from all sensors in the network. Hence the path chosen should result in time under T max T ≤ Tmax

(4)

To find the optimal path length, the mobile sink is assumed to be moving at a constant speed vm/s, and the data transmission occurs instantly. Then the maximum path length permissible for the mobile sink can be given by Lmax = Tmax ∗ v

(5)

We now start dividing the network and selecting RPs from each sub-network. We divide the network until we get the final path length L almost equal to Lmax . L ≤ Lmax

(6)

This paper represents an algorithm to find the best path satisfying the condition mentioned above in Eq. 5.

302

P. N. Raut and A. Tomar

4 Proposed Algorithm The process of data collection using the proposed algorithm is carried out in two phases. The first phase consists of RPs selection and construction of the path. In the second phase, the MS visits the RPs and transmits a small message to sensor nodes within the communication range. After receiving the message, the sensor nodes transmit data packet using CSMA/CA (Carrier Sense Multiple Access with Collision Avoidance). In the proposed algorithm, the network is divided into sectors, and creates a virtual polygon route. The network is divided using MST, and a cost function is used to select a rendezvous point from the subnetworks. Then a convex hull-based approach is used to calculate the cost of the tour from the rendezvous points. We repeat the process till we get the best possible path satisfying the delay constraint. 4.1 Generation of Virtual Path For generate the virtual path, we first need to divide the network into eight sectors. And using these sectors to create a virtual polygon path. To divide the network, we first calculate the center point C of the network. C(x, y) =

x1 + x2 + x3 + . . . . + xn y1 + y2 + y3 + · · · + yn , n n

(7)

In the above equation, x and y represent the x and y coordinates of sensor nodes. Next, we determine the four boundary points of our network. These are the corners of our rectangular network. The four boundary points are (min(x), min(y)), (min(x), max(y)), (max(x), max(y)), (max(x), min(y)) respectively. where minimum and maximum values for location coordinates among all nodes are represented by min and max. The network area is divided into four sectors by four line segments drawn from the region’s center point to the region’s border points. A horizontal and vertical line flowing through the center further divides the network area into eight sectors. For constructing a virtual path, eight virtual points are identified from the graph. The virtual path will be constructed as a polygon connecting these eight points. The VPs from all sectors are calculated as follows. k k j=1 xj j=1 yj , VP(sector(i)) = (8) k k where i indicates the sector number and ranges from 1 to 8, k is the number of nodes belonging to that sector. Adjacent points are then connected to create a virtual polygon path, as shown in Fig. 2. 4.2 Network Division and RP Generation First, an MST is constructed from the nodes using the Kruskal’s algorithm on the graph obtained by connecting nodes within range. We want to have balanced energy consumption. In order to divide the network into equal parts to have balanced energy consumption, we need to divide the MST by removing an edge which results in sub-MSTs having almost

An Efficient and Delay-Aware Path Construction Approach

303

equal diameters. For RP generation, a point is selected from the sub-MSTs formed after the division of MST. This point is the rendezvous point. We use a cost function [1] to calculate each node’s appropriateness cost as an RP based on two parameters. Since the requirement is of collecting data in a delay bound, the rendezvous points closer to the virtual path are a better fit for the data collection. Hence, the value of our cost function increases with the decrease in distance from the virtual path. The node having a higher degree should be more appropriate to be selected as a rendezvous point since it must have many nodes closer, resulting in its high degree. Selecting such a node will reduce the routing overhead of that node and the hop counts. cost(si) =

degree(si) dist(si,VP)

(9)

The complete path connecting all RPs is shown in Fig. 3.

Lemma 1. The algorithm has the time complexity of O(n). Where, n represents the number of sensor nodes. Proof 1. There are eight edges in virtual path (VP). The loop in step 3 runs eight times. The steps from 4 to 17 all run in a constant time. So total time complexity is O(8n) = O(n). Linear normalization is performed on both numerator and denominator, and then the node having the highest cost is made as RP from the current sub-MST. We are using the

304

P. N. Raut and A. Tomar

algorithm described in [4] to generate the tour. It uses a convex hull for generating a tour and has a time complexity of O r 2 .

Lemma 2. The algorithm has the time complexity of O(r 3 ). Where, r is the number of rendezvous points. Proof 2. Steps 1 to 3 take O(n) time. Step 4 takes O(nlogn). Steps 7 to 16 run for r number of times as each step adds one RP. Steps 7 to 15 take at most O(n) time. Step 16 takes O r 2 time. So, the total time complexity is O(nlogn + r(max r 2 , n )). As the number of RPs will increase the time complexity will be O r 3 .

Fig. 2. Virtual Polygon Path

Fig. 3. Final tour with data paths

An Efficient and Delay-Aware Path Construction Approach

305

5 Simulation Results 5.1 Simulation Setup We have performed simulations of the algorithm on MATLAB R2020b, and the results were obtained on the hardware configuration of intel i7-1065G7 CPU with 16 GB RAM. We have compared it with MST based circle virtual path approach [1]. In the compared algorithm, a circle is used as a virtual path generated as the average bisector of lines intersecting the center of the network. It also uses a heuristic TSP to calculate the final tour, having O n3 time complexity. Both the algorithms are simulated and analyzed for the same data. We have compared results from both the algorithms on the parameter of total hop count and network lifetime. Table 1. Simulation parameters Parameter

Value

Area of network

310 × 155 m2

Communication range

30 to 50 m

Nodes count

150 to 300

Lmax

550 m

Packet size

4000 bits

Sensor node energy

2J

E_elec

50 x 10−9 Joules

Efs

10 × 10−12 Joules

v

2 m/s

We computed each number as an average of the output received for 15 iterations because the results for hop counts might vary a lot depending on the placements of the sensor nodes. The simulation parameters are listed in Table 1. 5.2 Results and Discussion As the sensor node’s communication range increases, hop counts decrease (Fig. 4a). When the range is less, the suggested approach produces better results. As the range of the communication increases, more area comes under coverage by direct communication, and hence the difference between the hop counts reduces. Similarly, hop counts increase as the node density increases (Fig. 4b). The network’s energy consumption is directly proportional to the total hop counts required for data transmission. As the rounds of data collection increase, the number of nodes in the network that are dead is increasing. We have performed simulations over 15 different networks to find the average number of dead nodes with the rounds of data collection. Even though all nodes initially have the same amount of energy, it can be observed in Fig. 5 that with

306

P. N. Raut and A. Tomar 500

450 MST Circle Path

450

MST Circle Path

400

MST Polygon Path

MST Polygon Path

Hop Counts

Hop Counts

350 400

350

300 250

300

200

250

150 30

35

40

45

50

150

180

Communication Range (in meters)

210

240

270

300

Node Density

(a)

(b)

Fig. 4. (a). Hop count vs. range (b). Hop count vs. node density

155

25

145

MST Polygon Path

135

Nodes Alive

Average Dead Nodes

20

MST Circle Path

15

10

125 115 105 95

MST Circle Path

5

MST Polygon Path

85 0

75 0

750

1500

2250

Rounds

Fig. 5. Dead nodes vs. rounds

3000

0

1000

2000

3000

4000

5000

6000

Rounds

Fig. 6. Network lifetime comparison

the increase in rounds, the proposed method has lower dead nodes than the compared algorithm [1]. Because of the appropriate positioning of RPs, the suggested technique has a more balanced consumption of energy among the nodes. Because the number of dead nodes in the suggested technique is smaller than in the compared algorithm, the number of packets delivered to the sink will be higher, resulting in higher throughput. We have simulated the network lifetime for a single network with 150 sensor nodes with a communication range of 50 m. The data is routed by the shortest path to the closest RP in a multi-hop manner without aggregation. The proposed algorithm, when compared with MST based circle virtual path algorithm [1], gives a better network lifetime (see Fig. 6). This difference is due to the difference in hop counts of both methods.

6 Conclusion We have proposed and implemented a technique for generating a path for a mobile sink that can be used in delay-bound WSN applications. A virtual polygon path and MST are used to create the path. It selects efficient RP placements using a cost function. The

An Efficient and Delay-Aware Path Construction Approach

307

suggested approach outperforms the existing algorithms that use circular virtual paths in terms of overall hop counts. The proposed approach gives about a 15% reduction in hop counts than the compared algorithm when run on a rectangular network. In addition, time complexity of the proposed algorithm is of O(r 3 ), which is less than the time complexity of the compared algorithm. Although we have assumed that the mobile sink cannot visit all sensor nodes individually in the given delay time, if such a case occurs, the value of r can go up to n. This situation could be avoided by stopping the algorithm if a specified percentage of nodes can transfer their data via a single hop. However, because the suggested technique only deals with the usage of one mobile sink, it may not be appropriate for very large wireless sensor networks with delay restrictions. The limitation of the algorithm is that it is not energy-aware. The lifetime of the network may be extended even further by letting the algorithm dynamically choose rendezvous places while taking into consideration the remaining energy of the sensors.

References 1. Nitesh, K., Azharuddin, M., Jana, P.: Minimum spanning tree-based delay-aware mobile sink traversal in wireless sensor networks: Delay-aware mobile sink traversal in WSN. Int. J. Commun. Syst. 30, e3270 (2017) 2. Salarian, H., Chin, K.-W., Naghdy, F.: An energy-efficient mobile-sink path selection strategy for wireless sensor networks. IEEE Trans. Veh. Technol. 63, 2407–2419 (2014) 3. Anwit, R., Jana, P.K., Tomar, A.: Sustainable and optimized data collection via mobile edge computing for disjoint wireless sensor networks. IEEE Trans. Sustain. Comput., 1 (2021) 4. Wen, W., Zhao, S., Shang, C., Chang, C.-Y.: EAPC: Energy-aware path construction for data collection using mobile sink in wireless sensor networks. IEEE Sens. J. 18, 890–901 (2018) 5. Wen, W., Dong, Z., Chen, G., Zhao, S., Chang, C.Y.: Energy efficient data collection scheme in mobile wireless sensor networks. In: Proceedings of the 2017 31st International Conference on Advanced Information Networking and Applications Workshops (WAINA), pp. 226–230. IEEE (2017) 6. Temene, N., Sergiou, C., Georgiou, C., Vassiliou, V.: A survey on mobility in wireless sensor networks. Ad Hoc Netw. 125, 102726 (2022) 7. Heinzelman, W.B., Chandrakasan, A.P., Balakrishnan, H.: An application-specific protocol architecture for wireless microsensor networks. IEEE Trans. Wirel. Commun. 1, 660–670 (2002) 8. Anwit, R., Tomar, A., Jana, P.K.: Tour planning for multiple mobile sinks in wireless sensor networks: A shark smell optimization approach. Appl. Soft Comput. 97, 106802 (2020) 9. Anwit, R., Tomar, A., Jana, P.K.: Scheme for tour planning of mobile sink in wireless sensor networks. IET Commun. 14, 430–439 (2020) 10. Heinzelman, W.R., Chandrakasan, A., Balakrishnan, H.: Energy-efficient communication protocol for wireless microsensor networks. In: Proceedings of the 33rd Annual Hawaii International Conference on System Sciences. IEEE Computer Society (2005) 11. Aydin, M.A., Karabekir, B., Zaim, A.H.: Energy efficient clustering-based mobile routing algorithm on WSNs. IEEE Access. 9, 89593–89601 (2021) 12. Punriboon, C., So-In, C., Aimtongkham, P., Leelathakul, N.: Fuzzy logic-based path planning for data gathering mobile sinks in WSNs. IEEE Access. 9, 96002–96020 (2021)

Application of Different Control Techniques of Multi-area Power Systems Smrutiranjan Nayak1(B) , Sanjeeb Kumar Kar1 , and Subhransu Sekhar Dash2 1 Department of Electrical Engineering, ITER, Siksha ‘O’ Anusandhan (Deemed to Be

University), Bhubaneswar, Odisha, India [email protected], [email protected] 2 Department of Electrical Engineering, Government College of Engineering, Keonjhar, Odisha, India [email protected]

Abstract. A fire-fly calculation is stated for load recurrence for control of manydistrict power frameworks. At first, both equivalent zone and non-warm framework are thought of & the ideal additions of the corresponding basic subsidiary regulator are upgraded utilizing the firefly calculation strategy. The prevalence of the proposed approach is exhibited by contrasting the outcomes and some as of late distributed strategies like hereditary calculation, microbes scrounging enhancement calculation, differential development, molecule swarm advancement, crossbreed microscopic organisms for-maturing improvement calculation molecule swarm streamlining, and Ziegler–Nichols-based regulators for a similar interconnected force framework. Further, the proposed approach is reached out to a three-inconsistent territory warm framework considering age rate limitation and lead representative dead-band. Examinations uncover on correlation that corresponding essential subsidiary regulator gives much better response contrasted with vital and relative necessary regulators. Keywords: Firefly algorithm · Multi-area power system · Load frequency control · Proportional-integral controller · Proportional integral derivative controller

1 Introduction A framework having electric energy system should be retain up at an ideal working level described by ostensible recurrence and volt-age profile and achieved by close control of certifiable receptive forces made through the controllable wellspring of the structure. One is the dynamic force and recurrence control, while another is about responsive force and control of voltage [1]. At last, prompting an unsteady condition towards the force structure [2]. Subsequently, the essential occupation of load frequency control is to keep the recurrence steady as opposed to arbitrarily change the dynamic force load, additionally suggested to obscure upsetting influence. An average huge scope power framework is made out of a few zones of creating units. To diminish the expense of power furthermore to improve the unwavering quality of force supply. In any case, the event partner, © The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 M. N. Mohanty et al. (Eds.): METASOFT 2022, AISSE 1, pp. 308–314, 2022. https://doi.org/10.1007/978-3-031-11713-8_31

Application of Different Control Techniques

309

the region that is dependent upon the heap change should adjust it without outside help; in any case, there would be financial clashes between the regions. Consequently, every zone needs a different load recurrence regulator to control the tie-line ability diversely [3, 4]. In a few old-style regulators structures, like necessary I, corresponding essential PI, vital subordinate ID, relative fundamental subsidiary PID & integral twofold subsidiary IDD, were put in, and their execution was looked at for a programmed age Automatic controller framework [5]. An imperfect AGC controller was proposed for a two indistinguishable warm territory framework with warm turbines and is interconnected through equal AC/DC joins utilizing yield criticism control technique [6, 7]. A facilitator configuration plot was stated to improve the combined activity of pitch point guideline and the landfill AGC for load of detached force framework with wind generators and flywheel energy stockpiling sys-tem [8]. Another model was stated for LFC. The system was gotten from the itemized region system and an adjusted truncation strategy is applied to diminish the request for the regulators [9, 10]. It clear from the writing study that the presentation of the force framework not just relies upon the fake methods utilized, it likewise relies upon the regulator structure and chosen target work. Subsequently, proposing and executing new superior heuristic improvement calculations to true issues are consistently welcome. This examination presents both the utilization of an incredible computational insight strategy like the firefly algorithm to improve the regulator boundaries of an frequency control framework and complete investigation of its adjusting execution contrasted with other as of late detailed present-day improvement methods just as old-style strategies. At first, a two equivalent territory non-warm framework is thought of and the prevalence of the proposed approach is exhibited by contrasting the outcomes and some as of late distributed present-day heuristic enhancement procedures like differential evolution (DE), genetic algorithm (GA), Bacterial foraging optimization algorithm (BFOA), particle swarm optimization (PSO), half and half PSO-BFOA, and regular Ziegler-Nichols (ZN) from regulators for a similar interconnected force framework. To show the capacity of the proposed way to deal with handle inconsistent force frameworks and actual limitations, the methodology is reached out to a three-inconsistent territory warm framework with help of GRC and GDB [11–13]. At long last, affectability examination is performed which shows the strength of the advanced regulator boundaries by differing the framework boundaries and working burden from their ostensible qualities.

2 Materials and Methods 2.1 System Examined The two-area non–reheat thermal power system is shown in Fig. 2. The pace of GDB impacts the components of the ability structure. A three locale test structure is seen as which consist of three warm units of various restrictions of one and all domain as demonstrated in Fig. 3 [3, 4]. 2.2 Controller Structure and Objective Function The structure of stated PID controller is given in Fig. 1.

310

S. Nayak et al.

The respective Area-Control-Error given by e1 (t) = ACE1 = B1 F1 + PTie

(1)

e2 (t) = ACE2 = B2 F2 − PTie

(2)

Here Ptie = Tie line power error, F = output of generator frequency & B = Frequency bias parameter. The objective function is given by tsim J = ITAE = (|Fi | + |PTie−i−k |) · t · dt

(3)

0

where ΔF i is frequency change for area 1, ΔPTie−i−k is incremental change for tie-line power joining in the middle of region i and area k, & tsim equivalent to time span for simulation.

Fig. 1. Structure of PID controller.

2.3 Firefly Technique The flow chart is shown in Fig. 4. The FA is a populace-based calculation created by Yang [14, 15]. It depends on three characteristics. 1. Every firefly is unisex and is pulled into various fireflies paying little psyche to their sex. 2. The measure of appeal of a firefly is proportional to its intelligence. 3. The intelligence of a firefly is controlled by target capacity to get improved. I (r) = I0 e−γ r ,

(4)

Application of Different Control Techniques

311

where I 0 = light intensity , γ = absorption coefficient & r = distance the attractiveness β of a firefly is stated as β = βe−γ r2 here β 0 = attractiveness at r = 0.

Fig. 2. Two-area non-reheat thermal system

Fig. 3. Three - unequal area having reheat, GRC and GDB

(5)

312

S. Nayak et al.

Fig. 4. Flowchart of firefly algorithm

3 Results and Discussions The execution of the stated FA-optimized PI/PID controller is contrasted with some techniques DE, PSO & hybrid BFOA-PSO. The given proposed FA optimized PID controller is robust & delivers satisfactory performance when disturbance location changes. The first aim of this work is designed for FA adjusted PID controller over FA adjusted PI controller, hBFOA-PSO adjusted PI controller, DE adjusted PI controller and BFOA adjusted PI controller are demonstrated. The second aim of this work is to design PID controllers that are robust and perform betters [16–19]. Change in frequency results as shown in Fig. 5 and Fig. 6 respectively.

Application of Different Control Techniques

313

Fig. 5. In region-2 frequency change for 10% change in zone-1 & 20% change in zone-2

Fig. 6. Deviation in recurrence of zone-1 for 10% change for region-1 & region-2.

4 Conclusion In this examination FA as an enhancement method to upgrade the LFC framework and a generally utilized two equivalent zone power framework is to inspect primary occurrence and boundaries of PI and PID regulator are advanced utilizing the FA utilizing an ITAEbased wellness work. A FA calculation is applied to tuned PID gains for the AGC of force framework. The upside of the FA adjusted PID regulator over FA adjusted PI regulator, hBFOA-PSO adjusted PI regulator, DE adjusted PI regulator, BFOA adjusted PI regulator is shown and likewise the benefits of PID over PI regulator, I regulator is illustrated. A few control techniques have been proposed by a few examiners for AGC of interconnected force framework to accomplish improved performance.

References 1. Elgerd, O.I.: Electric Energy Systems Theory—An Introduction. Tata McGraw Hill, New Delhi (2000)

314

S. Nayak et al.

2. Shayeghi, H., Shayanfar, H.A., Jalili, A.: Load frequency control strategies: a state-of-the-art survey for the researcher. Int. J. Energy Convers. Manag. 50(2), 344–353 (2009) 3. Saikia, L.C., Nanda, J., Mishra, S.: Performance comparison of several classical controllers in AGC for a multi-area interconnected thermal system. Int. J. Elect. Power Energy Syst. 33, 394–401 (2011) 4. Nayak, S., Kar, S., Dash, S.S.: A hybrid search group algorithm and pattern search optimized PIDA controller for automatic generation control of interconnected power system. In: Dash, S.S., Das, S., Panigrahi, B.K. (eds.) Intelligent Computing and Applications. AISC, vol. 1172, pp. 309–322. Springer, Singapore (2021). https://doi.org/10.1007/978-981-15-5566-4_27 5. Shabani, H., Vahidi, B., Ebrahimpour, M.A.: Robust PID controller based on the imperialist competitive algorithm for load frequency control of power systems. ISA Trans. 52, 88–95 (2012) 6. Sahu, R.K., Panda, S., Padhan, S.: A hybrid firefly algorithm and pattern search technique for automatic generation control of multi-area power systems. Int. J. Elect. Power Energy Syst. 64, 9–23 (2015) 7. Kundur, P.: Power System Stability and Control. Tata McGraw Hill, New Delhi (2009) 8. Hamid, Z.B.A., et al.: Optimal sizing of distributed generation using firefly algorithm and loss sensitivity for voltage stability improvement. Indo. J. Electr. Eng. Comp. Sci. 17(2), 720–727 (2020) 9. Panda, S., Mohanty, B., Hota, P.K.: Hybrid BFOA- PSO algorithm for automatic generation control of linear and non-linear interconnected power systems. Appl. Soft Comput. 13(12), 4718–4730 (2013) 10. Nayak, S.R., Kar, S.K., Dash, S.S.: Performance comparison of the hSGA-PS procedure with PIDA regulator in AGC of power system. In: ODICON-2021, pp. 1–4 (2021) 11. Rout, U.K., Sahu, R.K., Panda, S.: Design and analysis of differential evolution algorithm based automatic generation control for interconnected power system. Ain Shams Eng. J. 4(3), 409–421 (2013) 12. Moradian, M., Soltani, J., Arab-Markadeh, G.R., Shahinzadeh, H., Amirat, Y.: A new grid connected constant frequency three-phase induction generation system under unbalancedvoltage conditions. Electronics 10(8), 938 (2021) 13. Huang, C., Xianzhong, K.D., Zang, Q.: Robust load frequency controller design based on a new strict model. Elect. Power Compon. Syst. 41(11), 1075–1099 (2013) 14. Yang, X.S., Hosseini, S.S.S., Gandomi, A.H.: Firefly algorithm for solving non-convex economic dispatch problems with valve loading effect. Appl. Soft Compt. 12, 1180–1186 (2012) 15. Saikia, L.C., Sahu, S.K.: Automatic generation control of a combined cycle gas turbine plant with classical controllers using firefly algorithm. Int. J. Elect. Power Energy Syst. 53, 27–33 (2013) 16. Padhi, J.R., Debnath, M.K., Pal, S., Kar, S.K.: AGC investigation in wind-thermal-hydrodiesel power system with 1 plus fractional order integral plus derivative controller. Int. J. Recent Technol. Eng. 8(1), 281–286 (2019) 17. Mohanty, P.K., Sahu, B.K., Pati, T.K., Panda, S., Kar, S.K.: Design and analysis of fuzzy PID controller with derivative filter for AGC in multi-area interconnected power system. IET Gener. Transm. Distrib. 10(15), 3764–3776 (2016) 18. Sahu, B.K., Pati, T.K., Nayak, J.R., Panda, S., Kar, S.K.: A novel hybrid LUS–TLBO optimized fuzzy-PID controller for load frequency control of multi-source power system. Electr. Power Energy Syst. 74, 58–69 (2016) 19. Biswal, K., Swain, S., Tripathy M.C., Kar, S.K.: Modeling and performance improvement of fractional-order band-pass filter using fractional elements. IETE J. Res. (2021)

Analysis of an Ensemble Model for Network Intrusion Detection H. S. Gururaja1(B) and M. Seetha2 1 Department of Information Science and Engineering, B.M.S. College of Engineering,

Bengaluru, India [email protected] 2 Department of Computer Science and Engineering, G.Narayanamma Institute of Technology & Science, Hyderabad, India

Abstract. Network security is extremely important and mission-critical not only for business continuity but also for thousands of other huge and increasing number of systems and applications running over network continuously to deliver services. Intrusion-detection systems were emerged as a best way to improve network security. Traditional intrusion detection systems are rule-based and are not effective in detecting new and previously unknown intrusion events. Data mining techniques and machine algorithms have recently gained attention as an alternative approach to proactively detect network security breaches. In this research work, experimentations are performed on NSL-KDD dataset using various data mining algorithms. Decision Tree, Naïve Bayes, Random Forest, and Logistic Regression classifiers were used. The results obtained showed that models are biased towards classes with low distribution in the dataset. Keywords: Intrusion detection system · Multiclass classification · Binary classification · Ensemble classifier

1 Introduction Intrusion for the most part alludes to malevolent exercises coordinated at computer organize framework to compromise its integrity, availability and confidentiality. Network security is critical since cutting edge data innovation depends on it to drive businesses and administrations. Security can be implemented on the organize through intrusion detection systems (IDS). These are security gadgets or program as a rule actualized by huge and medium organizations to implement security approaches and screen arrange border against security dangers and pernicious exercises. Other related frameworks incorporate Firewall and Intrusion Prevention System (IPS). Basically, intrusion detection gadget or application scrutinizes each approaching or active organize activity and investigates packets (both header and payload) for known and obscure occasions. Identified known occasions and infringement are logged as a rule in a central security information and event management (SIEM) framework [1]. Malevolent exercises or obscure occasions may be set up to caution framework director © The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 M. N. Mohanty et al. (Eds.): METASOFT 2022, AISSE 1, pp. 315–325, 2022. https://doi.org/10.1007/978-3-031-11713-8_32

316

H. S. Gururaja and M. Seetha

or the related parcels dropped depending on the setups empowered on the intrusion detection system. Avoidance of security breaches cannot be totally dodged. Consequently, intrusion detection gets to be critical for organizations to proactively bargain with security dangers in their systems. Be that as it may, numerous existing intrusion detection systems are rule-based and are not very compelling in recognizing a new intrusion occasion that has not been encoded within the existing rules. Other than, intrusion detection rules advancement is time-consuming and it is restricted to information of known intrusions as it were. Supervised and unsupervised learning techniques help in distinguishing and separating known and unused intrusions from records or information. It is hence beneficial to investigate application of data mining methods [10, 11] as a viable elective approach to distinguish known and potential network intrusions. To form the encounter as natural as conceivable after the motion has been identified, it transfers the yield to a sound gadget to talk the interpreted yield. This makes the communication feel more normal rather than fair perusing out content on a screen. The main aim for us to choose and move ahead with this idea was trying to predict the type of network security breaches with more accuracy and precision to help protect users from data security breaches.

2 Literature Survey With the popularity of Internet and due to the widespread usage of networks, the number of attacks has increased, and numerous hacking tools and intrusive methods have gained traction. Within a network, one way to counter suspicious activity is by using the intrusion detection system (IDS) [1]. IDSs are also named as misuse/anomaly detectors by sorting out broadly based on the models of their detection. Mishandling detectors depend on understanding the models of known attacks, whereas irregularity detection creates profiles for users as the key use case of detection, and sorts the uniqueness of the normal and abnormal (anomaly) ones as incursion. Conversely, the sheer amount of the intrusions increased drastically as the speed and complexity of networks expand swiftly, this is seen when such networks are unlocked/opened to the general public. Intrusion detection aims to catch network attacks. To try and work out network security problems, it is one of the essential ways. The two major signs to examine intrusion detection systems (IDS) are Detection precision and stability [1]. It is becoming difficult for any intrusion detection system to suggest a trustworthy repair with the varying technology and the huge growth of Internet traffic it has been created that a behavioral model is present in the attacks that can be gotten to know from former study.

3 Implementation 3.1 KDD Dataset An improved version of KDDCUP-99 is organized by NSL-KDD dataset. This dataset is taken via the University of New Brunswick, Canada for the purpose of experimentation.

Analysis of an Ensemble Model for Network Intrusion Detection

317

The dataset is an extended version of the KDDCUP dataset (1999) from DARPA Intrusion Detection Evaluation Program. Due to redundant records in the initial versions of KDD dataset intrusion detection was poor [7]. The NSL-KDD dataset used for analysis consists of 22,544 records of test set, 1,25,973 records of training set, and a total of 42 features for each connection represents class label mapping to attack types. 3.2 Data Load and Pre-processing During data pre-processing stage, different attach classes were mapped as shown in the Fig. 1

Fig. 1. Intrusion types and subclasses

Denial of Service (DoS): This is an assault where an enemy coordinated a downpour of activity demands to a framework in arrange to create the computing or memory asset as well active or as well full to handle genuine demands and within the handle, denies true blue clients get to the machine. Probe Attack (Probe): testing organize of computers to accumulate data to be utilized to accept lowering its security controls. User to Root Attack (U2R): This is a misuse where the enemy begins out with getting to an ordinary client account via the framework (picked up either by dictionary attack, social engineering or by sniffing passwords) and is able to misuse a few helplessness to gain complete access to the machine. Remote to Local Attack (R2L): Here the aggressor who has the capacity to send network layer packets to a system over some network does not have credentials for that system abuses a few defenselessness to gain neighborhood access as a client of that system.

318

H. S. Gururaja and M. Seetha

3.3 Exploratory Data Analysis Basic exploratory data analyses were carried out among other things to understand the descriptive statistics of the dataset, find instances of missing values and redundant features, explore the data type and structure and investigate the distribution of attack class in the dataset [3]. 3.4 Standardization of Numerical Attributes The numerical features in the dataset were extracted and standardized to have zero mean and unit variance as a necessary requirement for many machine learning algorithms for implementing [6, 12]. 3.5 Encoding Categorical Attributes The categorical features in the dataset need to be encoded to integers before processing to various ML algorithms [12]. 3.6 Data Sampling The sparse distribution of certain attack classes such as U2R and L2R in the dataset while others such as Normal, DoS and Probe are significantly represented inherently leads to the situation of imbalance dataset. While this scenario is not unexpected in data mining tasks [4] involving identification or classification of instances of deviations from normal patterns in a given dataset, research has shown that supervised learning algorithms are often biased against the target class that is weakly represented in a given dataset. Sampling table is given in Fig. 2.

Fig. 2. Sampling table

Certain approaches such as random data sampling and cost-sensitive learning method [8] were suggested to address the problem of imbalance datasets. The use of sampling involves modification of an imbalanced data to a balanced distribution by various mechanisms. While oversampling replicates and increases data in class label with low distribution, random under sampling removes data from the original dataset with high class frequency.

Analysis of an Ensemble Model for Network Intrusion Detection

319

Cost-sensitive techniques improve imbalanced learning problem by various cost metrics for misclassifying data rather than creating balanced distributions using various sampling techniques [8]. Studies proved that for a balanced data showed significant classification performance when compared to an imbalanced dataset [9]. Popular sampling techniques includes synthetic minority oversampling technique (SMOTE) and Random Oversampling and Under sampling. 3.7 Feature Selection Feature selection may be a key information pre-processing step in data mining assignment including determination of critical highlights as a subset of unique highlights agreeing to certain criteria to diminish measurement in arrange to make strides the productivity of data mining algorithms [10]. Most of the information incorporates unessential, repetitive, or boisterous highlights. Feature selection diminishes the number of features, evacuates insignificant, excess, or boisterous highlights, and carries almost substantial impacts on applications: performance speedup of a data mining algorithm [11], progressing learning precision, and driving to way better model comprehensibility. The approaches to select and identify correct features are: 1. A wrapper employs the intended learning algorithm itself to evaluate the value of features & 2. A channel assesses features agreeing to heuristics based on common characteristics of the data. The former approach is effectively used to create better subsets of features but runs much more gradually than a filter. The implementation uses a wrapper approach wherein a Random Forest classifier algorithm [2] with a function to define feature importance was trained to extract feature importance from the training dataset. A second step implemented involved using a recursive feature extraction also based on Random Forest algorithm [2] to extract top 10 features relevant to achieve accurate classification of classes in the training set. 3.8 Data Partition After the selection of features, the dataset that has now been resampled and partitioned into two target classes (normal and attack) for all the attack classes in the dataset to make way for binary classification. For multiclass classification no such partition is required this process is skipped for that approach. 3.9 Train Models The training dataset was utilized to train the following classifier algorithms: Support Vector Machine (SVM), Decision Tree, Naive Bayes, Logistic Regression and k-Nearest Neighbor (KNN). Also, an ensemble classifier was trained to add in and average out the prediction results from the individually made classifiers. The ingenuity used with the

320

H. S. Gururaja and M. Seetha

classifier is to join technically different machine learning algorithms [5] and utilize a hard vote or the average prediction probability to predict the labels for each class. These classifiers can be useful for a list of equally well-performing classifiers so as to almost equalize their unique weaknesses. Two approaches were employed, one being a multiclass classification where the dataset as a whole with the 5 classes (i.e., DoS, Normal, Probe, R2L, U2R) was used to train the models. Since Support Vector Machine and Logistic Regression are primarily designed for binary classification, hence these 2 methods were omitted in this approach. The second approach involved creation of separate binary classifiers for each attack group. To achieve this the dataset is partitioned into 4 groups each with 2 classes (Normal and an attack group) as mentioned earlier. For each such group all the above-mentioned classifier algorithms along with the ensemble of these classifiers were trained. This improves the detection rates at the cost of a model being able to detect only the attack group it was trained on. 3.10 Evaluate Models The prepared models were assessed employing a 5-fold cross validation method. The concept of k-fold cross approval includes era of approval set out of training dataset to evaluate the model before it is uncovered to test data. In k-fold cross validation, the training dataset is haphazardly separated into k even-sized subsamples. Out of ksubsamples, a single sub-sample is held as the approval information for evaluating the model, and the remaining (k – 1) sub-samples are utilized as training data. The cross validation process will be repeated k-times (the folds), with each of the k subsamples utilized precisely once as the validation information. The k comes about from the folds can at that point be found the middle value of or combined to create a single value. The advantage of this strategy is that all tests are used for both validation & training.

4 Results and Discussions From the results obtained, all the models evaluated achieved an average of 99% on training set while model’s performance on test set indicates an average of more than 80% across all the four attack groups investigated. The test accuracy for ‘Normal_U2R’ across all the models showed a value of more than 90%. However, a review of performance as shown by the confusion matrix indicated that the ‘U2R’ detection rate was very poor across Decision Tree, Random Forest, k-Nearest Neighbor & Logistic Regression models. The attack class ‘R2L’ detection rate is also not far from being poor as ‘Normal’ and other classes with higher distribution got more attention of the models to the detriment of the low distribution classes. The results generally showed that sampling may improve model training accuracy. Due to large bias in the datasets, detecting U2R and R2L attacks are inevitable. Figure 3 gives the detection rates for Binary classifier.

Analysis of an Ensemble Model for Network Intrusion Detection

321

Fig. 3. Detection rates for binary classifier

Figures 4, 5, 6, 7, 8 and 9 gives the confusion matrix for SVM, Naïve Bayes, Decision Trees, Random Forest, KNN and Logistic Regression.

Fig. 4. Confusion matrix for SVM

322

H. S. Gururaja and M. Seetha

Fig. 5. Confusion matrix for Naïve Bayes

Fig. 6. Confusion matrix for Decision Trees

Fig. 7. Confusion matrix for Random Forest

Figure 10 and Fig. 11 gives the confusion matrices for the Ensemble multiclass and Ensemble binary classifiers.

Analysis of an Ensemble Model for Network Intrusion Detection

Fig. 8. Confusion matrix for KNN

Fig. 9. Confusion matrix for Logistic Regression

Fig. 10. Confusion matrices of the Ensemble multiclass classifiers

323

324

H. S. Gururaja and M. Seetha

Fig. 11. Confusion matrices of Ensemble binary classifiers

5 Conclusion One of the key lessons learnt here is that, for classification models, accuracy is not a good measure of a model performance where there is an imbalance dataset. Accuracy value may be high for the model but the class with lower samples may not be effectively classified. If input data is such that 10% of the samples represent attacks and the model predicts all the data samples not to be an attack then theoretically it would have an accuracy of 90%, but obviously such a model is in fact of no use since it failed to detect any attacks. Thus, other metrics such as Confusion Matrix is more realistic in evaluating classifier model performance than accuracy measure. The use of machine learning models has revolutionized the measure and accuracy to which we can make predictions of attacks. To create an infallible system, one might need to use even more complex models such as- ANN and Deep learning techniques. These models would take a much deeper dive into the computational arithmetic that is involved and will come out with better results. These better results do come at a price though, with these so called- much better models having a very high cost of computational demands and advanced hardware in the running systems. We would need specially dedicated systems with advanced hardware to be able to run these deep learning or ANN models. Overall, the scope for the future holds exciting innovations to explore and achieve better results.

6 Future Scope This study and research work can be expanded and bettered in various different ways in the future or iterations of this method. The lack of data was a major cause of less accurate results, especially for U2R and R2L attack groups. Random oversampling can only improve the imbalance in the dataset but in reality, the new data samples created do not provide any information gain over the original dataset. Any datasets available in the future with higher number of samples for attack groups may be used to improve the detection accuracy. Always, with higher amount of data the models will work to a better accuracy. Hence, more data, better will be the accuracy. Hence, the expansion of the

Analysis of an Ensemble Model for Network Intrusion Detection

325

NLS-KDD dataset or any other similar all-encompassing dataset can help with better prediction percentages. In this research work, we have broadly classified 4 attack groups, but each individual attack on its own may have different attributes or characteristics which may not be generalized for all the attacks belonging to the same group. Also, classifying the attacks under attack umbrellas may work for most applications but some may require the specifics of each sub category. It may be better to train for each individual attack type rather than general attack group since each attack type may have unique characteristics that maybe completely different, but this will increase the computational complexity drastically and will require specially designed high-powered systems to get results in a feasible time period.

References 1. Peddabachigiri, S., Abraham, A., Grosan, C., Thomas, J.: Modeling of intrusion detection system using hybrid intelligent systems. J. Netw. Comput. Appl. 30, 114–132 (2007) 2. Zhang, J., Zulkernine, M., Haque, A.: Random-forest based network intrusion detection systems. IEEE Trans. Syst. Man Cybern. 38, 649–659 (2008) 3. Panda, M., Abraham, A., Das, S., Patra, M.R.: Network intrusion detection system: a machine learning approach. Intell. Decis. Technol. 5(4), 347–356 (2011) 4. Dokas, P., Ertoz, L., Kumar, V., Lazarevic, A., Srivastava, J., Tan, P.N.: Data mining for network intrusion detection. In: Proceedings of NSF Workshop on Next Generation Data Mining, pp. 21–30 (2002) 5. Data Science Association: Introduction to Machine Learning. The Wikipedia Guide (2016) 6. Swamynathan, M.: Mastering Machine Learning with Python in Six Steps. Apress, New York (2017) 7. Tavallaee, M., Bagheri, E., Lu, W., Ghorbani, A.A.: A detailed analysis of the KDD CUP 99 data set. In: Proceedings of the IEEE Symposium on Computational Intelligence in Security and Defense Applications, pp. 1–6 (2009) 8. He, H., Garcia, E.A.: Learning from imbalanced data. IEEE Trans. Knowl. Data Eng. 21(9), 1263–1284 (2009) 9. Weiss, G.M., Provost, F.: The effect of class distribution on classifier learning: an empirical study. Technical Report ML-TR-43, Department of Computer Science, Rutgers University (2001) 10. Wang, J.: Data Mining Opportunities and Challenges, pp. 80–105. Idea Group Publishing (2003) 11. Liu, H., Setiono, R., Motoda, H., Zhao, Z.: Feature selection: an ever-evolving frontier in data mining. In: JMLR: Workshop and 17 Conference Proceedings, vol. 10, pp. 4–13 (2010) 12. Pedregosa, F., et al.: Scikit-learn: machine learning in Python. J. Mach. Learn. Res. 12, 2825–2830 (2011)

D2D Resource Allocation for Joint Power Control in Heterogeneous Cellular Networks Hyungi Jeong1 and Wanying Guo2(B) 1 Gachon University, Seongnam, South Korea

[email protected]

2 Sungkyunkwan University, Seoul, South Korea

[email protected]

Abstract. For the interference problem caused by the multiplexing of macro cellular user resources by D2D (Device-to-Device) users and micro cellular users in HetNets (heterogeneous networks), a resource allocation scheme for joint power control is proposed. First, under the constraints of satisfying the user SINR (Signal to Interference and Noise Ratio) and the transmit power, the optimal transmit power is derived for each D2D user and the micro cellular user when the signal resources of macro cellular user are multiplexed according to the system interference model; Secondly, the user’s channel selection is planned as a two-sided matching problem which between the user and the channel, and a Gale-Shapley algorithm is used to obtain a stable matching solution; Finally, the obtained matching solution was further optimized as the initial condition by the exchange-search algorithm. Simulation results show that the total system capacity and energy efficiency exceed 93% and 92% of the optimal solution, respectively. In contrast to the random resource allocation, power-free and exchange-search schemes, and with power-controlled exchange-free search, and the average increase increased more than 48%, 15% and 4% respectively. Keywords: D2D · Gale-Shapley algorithm · Heterogeneous networks · Resource allocation

1 Introduction With the development of communication technology, the widespread use of the Internet of Things (IoT) and multimedia applications, the data traffic of mobile communication networks has gained an explosive growth. D2D communication has a high wireless transmission rate ratio and low delay, and has a very broad application prospect [1–3]. At present, 5G network is a hybrid communication network composed of macro cellular, micro cellular and D2D [4]. Because D2D users, micro cellular users and macro cellular users will have serious same-frequency interference when sharing communication resources, how to effectively conduct interference management and improve the utilization rate of spectrum resources is the current research hot spot. Currently, the interference management of D2D communication in cellular networks. [5] proposes a greedy heuristic algorithm in which D2D users select the resource block © The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 M. N. Mohanty et al. (Eds.): METASOFT 2022, AISSE 1, pp. 326–337, 2022. https://doi.org/10.1007/978-3-031-11713-8_33

D2D Resource Allocation for Joint Power Control

327

through the size of multiplexing the degree of interference caused by the cellular user channel resources, but does not control the user’s transmit power. [6] proposed a resource allocation scheme that considers only individual interference, and designed an iterative algorithm to solve the interference problem between D2D users and cellular users. The scheme guarantees the quality of service for cellular users, but does not consider that for D2D. [7] proposes a D2D-only resource management scheme based on network auxiliary control in a heterogeneous network, which can effectively improve the performance of the system. [8] proposes an algorithm for interference-sensing resource allocation based on knucksack theory. Although the purpose formula of this algorithm scheme minimized the system interference while maintaining the total system rate, in many cases can not provide a feasible allocation scheme. [9] proposes a game theory-based joint clustering strategy and power control scheme to minimize transmission time, but does not allocate channel resources. [10] proposes a resource allocation algorithm which based on valuation theory, despite partly improving the system capacity, failed the performance of ideal results in the poor system environment. [11] and [12] propose a power control and resource allocation scheme based on non-cooperative game theory, in which the cellular users and D2D users each independently decide their own power to maximize their respective energy efficiency. [13] first guarantees the cellular users’ the service quality by limiting the transmit power of D2D users, and then uses the Relax-based algorithm to solve the problem of resource allocation for the D2D users. [14] proposed a resource allocation scheme based on Gale-Shapley algorithm, which consists from the following deficiencies: 1) Only the system environment consisting of macro cellular users and D2D users was considered, without considering the interference of the system by micro cellular users. 2) The scheme simply achieves a stable match, and does not maximize the system capacity. 3) The fixed transmit power of the user in the system does not maximize the system performance. For the shortcomings of the proposed scheme in [14], this paper proposes a joint power control resource allocation scheme in a heterogeneous cellular network environment. The scheme first derived the optimal emission power when each D2D user and the small cellular user multiplex the macro cellular user channel resources from the suction system interference model and constraints. Next, the Gale-Shapley algorithm was used to obtain a stable matching solution. Finally, the allocation scheme was further optimized by using the exchange search algorithm. Simulation results show the total system capacity approximates the optimal solution while ensuring the quality of user.

2 System Model and Problem Statement 2.1 System Model A heterogeneous cellular network model consisting of one macro station and multiple micro base stations is shown in Fig. 1. Micro cellular base stations obey a homogeneous Poisson distribution of a density of λs . Three kind of users are distributed in the system randomly: H = {1, 2, ..., D}, M = {1, 2, ..., C} and W = {1, 2, ..., J }. They represent collection of D2D users, macro cellular users and micro cellular users, respectively. In this paper, we study the resource allocation of uplink channel resources for D2D and small cellular users. The total number of channel resources is set to N, assuming

328

H. Jeong and W. Guo

Fig. 1. System model

that the base station can obtain all the channel status information. Binary variables Xcn , Xjn , and Xdn are 1, indicating that the channel n is assigned to the macro cellular user c, micro cellular user j, and D2D user d, otherwise equal to 0. It is assumed that the channel occupied by the macro cellular user can be only multiplexed by a micro cellular user or a D2D user, and that the micro cellular user and the D2D users do not share the same channel resources. The SINR for the D2D user d when multiplexing the channel n is: SINRnd =

P d G dt ,dr + N0

Xcn P c G c,dr

(1)

here P d means the transmit power of the D2D user d. P c means the transmit power of macro cellular user c. G dt ,dr means the channel gain between the transmitter dt and the receiver dr of the D2D user d. G c,dr means the channel gain between the macro cellular user c and D2D receiver dr . And N0 means noise power. The SINR of micro cellular user j is: SINRnd =

P d G dt ,dr Xcn P c G c,dr + N0

(2)

where bj is a micro cellular base station accessed by the small cellular user j. Pj is the transmit power of micro cellular user j. Gj,bj means the channel gain that between micro cellular user j and micro cellular base station bj . And Gc,bj means the channel gain between macro cellular user c and micro cellular base-station bj . The SINR of the macro cellular user c in the channel n is: SINRnc =

P c G c,B Xjn P i G j,B + Xdn P d G d ,B + N0

(3)

D2D Resource Allocation for Joint Power Control

329

where P c means the transmit power of macro cellular user c. Then Gc,B means the channel gain that between macro cellular user c to macro cellular base-station B. Gd,B means the channel gain that between D2D user d to macro cellular base-station B. Gj,B means the channel gain that between micro cellular user j to macro cellular base-station B. Because cellular users and D2D users do not share the same channel, Xdn and Xjn do not equal to 1. 2.2 Problem Statement We assumed, channel resources N are fully occupied by the macro cellular users, and that the macro cellular users transmit at the maximum transmit power PM that P c = PM . On the premise of guaranteeing the quality of the communication for the user c, to maximize the total system capacity, select the optimal transmit power for D2D user small cellular users and allocate channel resources. The objective functions and constraints of the optimization problem obtained from the Shannon formula are shown from (4)–(11). max

Xcn ,Xjn ,Xdn (c∈M ,j∈W ,d ∈H )

=

T (Xjn , Xdn , Ccn ) N

max

Xcn ,Xjn ,Xdn (c∈M ,j∈W ,d ∈H )

Xcn lb(1 + SINRnc

(4)

n=1

Xjn lb(1 + SINRnj ) + Xdn lb(1 + SINRnd ) SINRnc ≥ SINRnc,threshold , ∀c ∈ M

(5)

SINRnj ≥ SINRnj,threshold , ∀j ∈ W

(6)

SINRnd ≥ SINRnd ,threshold , ∀d ∈ H

(7)

N

Xjn ≤ 1,

n=1 N n=1

J

Xjn ≤ 1, ∀j ∈ W

(8)

Xjn ≤ 1, ∀d ∈ H

(9)

j=1

Xdn ≤ 1,

D d =1

Xdn Xjn = 0

(10)

0 ≤ P d ≤ PM , 0 ≤ P j ≤ PM

(11)

Here, Eqs. (5), (6) and (7) ensures that the SINR of macro cellular users, small cellular users, and D2D users is greater than the threshold value. Equations (8) and (9) ensure that each micro cellular user or a D2D user can only reuse one macro cellular user’s channel resource, while each channel resource can be reused only by one D2D user or a micro cellular user. Equation (10) ensure that micro cellular users and D2D

330

H. Jeong and W. Guo

users cannot simultaneously reuse the channel resources of a macro cellular user and (11) indicates that the transmit power of the micro cellular and D2D users cannot be greater than the maximum transmit power. The resource allocation problem defined by (4)–(11) is a mixed-integer nonlinear programming problem, which can be optimally solved by the ergodic method. However, the proposed method is too complex to require sub-optimal algorithms with low complexity and approximating the optimal solution.

3 Resource Allocation Algorithm for Joint Power Control 3.1 Power Control From Eqs. (1), (3), (7) and (11), we can obtain that the transmit power change interval when the D2D user d multiplexing channel n is: d d ≤ P d ≤ Pmax Pmin

(12)

And here d = max{ Pmin

SINRnd ,threshold PM G c,dr + N0 SINRnd ,threshold

d = min{ Pmax

G dt ,dr

PM G c,B − N0 SINRnd ,threshold

, 0}

(13)

, PM }

(14)

Cn (P d ) = Wlb(1 + SINRnd ) + Wlb(1 + SINRnc )

(15)

G d .B SINRnc,threshold

The capacity of channel n is:

= Wlb(1 + SINRnd + SINRnc + SINRnc SINRnd )

From Eq. (15), in order to maximum the capacity of channel n, we need obtain the optimal transmit power Pd of D2D user d on channel n through Eq. (16). max{C1n (P d )} = max{SINRnc + SINRnd + SINRdc SINRnd } Pd

Pd

(16)

After derivation by C1n Pd to Pd is equal to 0, an unary quadratic equation about P can be obtained. The root discriminant of this equation is: = 4G dt ,dr G B,c PM (G B,c N0 + PM G B,c G c,dr − G dt ,dr N0 )

(17)

When > 0, we can obtain two roots Pd1 and Pd2 . The transmit power for D2D user d is: ⎧ d ⎪ P d , ≥ 0, P2d ≤ Pmin ⎪ ⎪ max ⎨ d ≤ Pd ≤ Pd arg maxP d ∈P{P d ,P d } (C1(P d )), ≥ 0, Pmin max 2 min max (18) pd = d , ≥ 0, P d ≥ P d ⎪ P ⎪ max min 2 ⎪ ⎩ d Pmax , < 0

D2D Resource Allocation for Joint Power Control

331

j j Similarly, the change interval Pmin , Pmax when the micro cellular user j multiplicates the channel n. The optimal transmit power Pj is: ⎧ j j j ⎪ Pmax , ≥ 0, P2 ≤ Pmin ⎪ ⎪ ⎪ j j j ⎨ arg max (C1(P j )), ≥ 0, Pmin ≤ P2 ≤ Pmax j j P j ∈P{Pmin ,Pmax } pj = j j j ⎪ ⎪ Pmin , ≥ 0, P2 ≥ Pmax ⎪ ⎪ ⎩ j Pmax , < 0

(19)

3.2 Power Control From above, the optimal transmission power of each D2D user and the micro cellular user on each channel obtains the channel capacity obtained from each user by Shannon’s formula. Consider the D2D users and the micro cellular users as Player A, and the channel as Player B. The matching problem between Player A and Player B can optimize the allocation of channel resources through the Gale-Shapley algorithm. Each D2D user and micro cellular user establishes Prelist means the list of user preferences based on the capacity obtained on the channel. If now, there are $m$ D2D users and n micro cellular users access, we can use 1 m represent D2D user and [m + 1, m + n] represent micro cellular users. The channel establishes Chlist means the preference list based on allowing different users to obtain the total channel capacity. We define the following parameters: • Collection Ass(k) shows that the channel k contains the D2D users or micro cellular users which have already matched. • Defines the list of channels Pre = δ1, δ2,... δD, δD+1,..., δD+K, that current D pairs D2D users and K micro cellular users most want to multiplexing. • The set of users without matches is represented as nomatch. The total number of channel is Channeltotal .

Table 1. Soft coalition algorithm. Number

User

Preference Level 1

2

3

4

5

6

5

3

4

1

6

φ

1

D2D User 1

2

D2D User 2

1

2

4

3

6

5

3

Micro Cellular User 1

2

3

5

1

φ

φ

4

Micro Cellular User 2

5

4

6

3

1

2

5

Micro Cellular User 3

4

2

1

3

5

6

332

H. Jeong and W. Guo

The following is the resource allocation algorithm based on Gale-Shapley.

First, we initialize several system parameters. Then, the channels matching between the initialized D2D users and the micro cellular users are empty. For each connected D2D user or micro cellular user, the channel resource that most wants to be found according to the preference list. If this channel resource has been multiplied by other users, the channel matches the more appropriate user from the current two users according to the preference list, rejecting another user. List of denied access user update preferences. The next loop finds the communication request from the rejected channel with the highest priority. Finally, some D2D users and micro cellular users are unable to find suitable channel resources due to the limitation of the SINR threshold. But we assume that such a user has found the appropriate resource and is removed from set nomatch.

D2D Resource Allocation for Joint Power Control

333

3.3 Search and Exchange Algorithm From above resource allocation algorithm based on Gale-Shapley, We can obtain a stable match which between the user and the channel from the set Ass. But the match does not maximize the system capacity of the total. Therefore, based on this matching results, we further improve the total capacity of the system by using the following Search and Exchange Algorithm.

4 Simulation and Results 4.1 Search and Exchange Algorithm To verify the effectiveness of the proposed scheme, consider comparing the performance of the five schemes in Table 2 under the LTE-FDD heterogeneous cellular network system. The specific parameter settings are shown in Table 3 [15]. Table 2. Resource allocation schemes.

4.2 Results Figure 2 shows that the total capacity of the system changes with the number of access users when the macro cellular user is certain. As shown from the figure, the total capacity of the system keeps increasing with the number of access users increases. Among

334

H. Jeong and W. Guo Table 3. Simulation parameters.

them, Scheme 4 is the scheme proposed in this paper. Because the scheme combines power control and combines algorithms 1 and 2, the total system capacity obtained approximately reaches the system capacity of scheme 5. The optimal exhaustive search algorithm needs to go through all the assignments with too much computation; The random resource allocation algorithm, because there is no optimized objective function, has low computing complexity but poor performance.

Fig. 2. System capacity comparison of the 5 schemes

Figure 3 shows the curve between the total system capacity and the macro cellular user SINR threshold. With the improvement of the SINR threshold, the total system capacity is constantly decreasing. Because the higher SINR threshold causes some macro cellular users to not communicate properly. As can be seen from Eqs. (1), (2) and (3), the increased SINR threshold will cause some D2D and small cellular users to communicate normally due to avoiding excessive interference to macro cellular users, so the total system capacity is constantly reduced.

D2D Resource Allocation for Joint Power Control

335

Fig. 3. SINR threshold v.s. the system capacity

Figure 4 shows the curve between system energy efficiency and the number of access users when a macro cellular user is certain. As shown from the figure, with the increase of access users, although the resource reuse scheme can improve the system cost throughput, the total power consumed by the user also increases. Therefore, energy efficiency reduction occur using scheme 1 and 2. The energy efficiency is improved by optimizing the transmit power of the access users in schemes 3, 4 and 5. Scheme 5 can achieve optimal system energy efficiency, but needs to go through all cases and compute too much.

Fig. 4. System energy efficiency comparison of the 5 schemes

336

H. Jeong and W. Guo

5 Conclusion This paper proposes a joint power control resource allocation scheme for the D2D user and the micro cellular users to multiplex the macro cellular user uplink channel resources in a heterogeneous network. The scheme first dynamically adjusts the transmission power of D2D users and micro cellular users according to the system interference model and constraints. Then the Gale-Shapley based resource allocation algorithm and the Search and Exchange algorithm are used to obtain the optimization scheme of channel allocation. Experimental results show that the proposed scheme can achieve a nearly optimal total system capacity and effectively improve the system energy efficiency. The next step is to construct a joint objective function for the power control and the total capacity of the system, to study the optimal solution to this objective function, and thus to further improve the energy efficiency.

References 1. Adnan, M.H., Ahmad Zukarnain, Z.: Device-to-device communication in 5G environment: Issues, solutions, and challenges. Symmetry 12(11), 1762 (2020) 2. Waqas, M., Niu, Y., Li, Y., Ahmed, M., Jin, D., Chen, S., et al.: A comprehensive survey on mobility-aware D2D communications: principles, practice and challenges. IEEE Commun. Surv. Tutorials 22(3), 1863–1886 (2019) 3. Bennis, M., Debbah, M., Poor, H.V.: Ultrareliable and low-latency wireless communication: tail, risk, and scale. Proc. IEEE 106(10), 1834–1853 (2018) 4. Della Penda, D., Abrardo, A., Moretti, M., Johansson, M.: Distributed channel allocation for D2D-enabled 5G networks using potential games. IEEE Access 7, 11195–11208 (2019) 5. Sawyer, N., Smith, D.B.: Flexible resource allocation in device-to-device communications using stackelberg game theory. IEEE Trans. Commun. 67(1), 653–667 (2018) 6. Tehrani, M.N., Uysal, M., Yanikomeroglu, H.: Device-to-device communication in 5G cellular networks: challenges, solutions, and future directions. IEEE Commun. Mag. 52(5), 86–92 (2014) 7. Tsai, A.-H., Wang, L.-C., Huang, J.-H., Lin, T.-M.: Intelligent resource management for device-to-device (D2D) communications in heterogeneous networks. In: The 15th International Symposium on Wireless Personal Multimedia Communications, pp. 75–79. IEEE (2012) 8. Hoang, T.D., Le, L.B., Le-Ngoc, T.: Energy-efficient resource allocation for D2D communications in cellular networks. IEEE Trans. Veh. Technol. 65(9), 6972–6986 (2015) 9. Karaoglu, B., Heinzelman, W.: Cooperative load balancing and dynamic channel allocation for cluster-based mobile ad hoc networks. IEEE Trans. Mob. Comput. 14(5), 951–963 (2014) 10. Zhou, Z., Dong, M., Ota, K., Wu, J., Sato, T.: Energy efficiency and spectral efficiency tradeoff in device-to-device (D2D) communications. IEEE Wirel. Commun. Lett. 3(5), 485– 488 (2014) 11. Ali, M., Qaisar, S., Naeem, M., Mumtaz, S.: Energy efficient resource allocation in D2Dassisted heterogeneous networks with relays. IEEE Access 4, 4902–4911 (2016) 12. Mahmood, N.H., Lauridsen, M., Berardinelli, G., Catania, D., Mogensen, P.: Radio resource management techniques for eMBB and mMTC services in 5G dense small cell scenarios. In: 2016 IEEE 84th Vehicular Technology Conference (VTC-Fall), pp. 1–5. IEEE, September 2016

D2D Resource Allocation for Joint Power Control

337

13. Islam, M.T., Taha, A.E.M., Akl, S., Abu-Elkheir, M.: A stable matching algorithm for resource allocation for underlaying device-to-device communications. In: 2016 IEEE International Conference on Communications (ICC), pp. 1–6. IEEE, May 2016 14. Chang, W., Jau, Y.T., Su, S. L., Lee, Y.: Gale-Shapley-algorithm based resource allocation scheme for device-to-device communications underlaying downlink cellular networks. In: 2016 IEEE Wireless Communications and Networking Conference, pp. 1–6. IEEE, April 2016 15. Chaudhury, P., Mohr, W., Onoe, S.: The 3GPP proposal for IMT-2000. IEEE Commun. Mag. 37(12), 72–81 (1999)

Prediction of Covid-19 Cases in Kerala Based on Meteorological Parameters Using BiLSTM Technique Jerome Francis, Brinda Dasgupta, G. K. Abraham, and Mahuya Deb(B) Department of Advanced Computing, St Joseph’s College (Autonomous), Bangalore, Karnataka, India [email protected]

Abstract. The need for accurate and advanced prediction of Covid19 cases is increasing by day. The coronavirus pandemic has been declared a health emergency of international concern by the World Health Organization for 2020. India has recorded a record high of more than 40,000 new case on April 30, 2021. Current study talks about possible relationship of Covid19 transmission with meteorological parameters namely humidity, temperature, rain, speed, and pressure. The weather data been collected from NASA archives. The daily number of confirmed covid cases, deaths and recovered were extracted from a public website (https:// data.covid19india.org/). Aggregated daily data were combined for a specific state for approximately nine months from 01-01-2021 and converted into a multivariate time series data frame. We then tried to build a Bi-LSTM architecture that gives the highest accuracy in predicting the number of confirmed covid cases for n days based on given data. Keywords: Bi-directional LSTM · Bi-:LSTM · Input gate · Forget gate · Output gate

1 Introduction In 2019, the corona virus was first reported in Wuhan, China. These cases go unnoticed because they begin with common pneumonia of unknown cause. It didn’t take long before it became an international crisis. Currently, more than 250 million people have been infected and more than 5 million people have died. As a precautionary measure, closures were imposed in countries, resulting in the closure of public places and restrictions on other activities. Research shows that Corona virus is mainly transmitted by respiratory droplets and person-to-person transmission. The rate of disease transmission can be reduced if the rate of disease transmission is determined. This will help governments plan better public health policies to deal with the consequences of the pandemic [1]. Climatic and environmental factors such as pollution strongly influence the transmissibility of Covid19 [9]. It affects the spatial and temporal distribution of infectious diseases. © The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 M. N. Mohanty et al. (Eds.): METASOFT 2022, AISSE 1, pp. 338–347, 2022. https://doi.org/10.1007/978-3-031-11713-8_34

Prediction of Covid-19 Cases in Kerala Based on Meteorological Parameters

339

Research results show that in the early stages of the pandemic, the sharp increase in covid19 cases is due to it being positively associated with temperature and negatively associated with humidity [11]. But there are common contradictions because in some cases a negative linear relationship has also been found between temperature and aggregate daily cases of Covid19. Some studies show that the spread of Covid19 is faster in temperate and cold climates than in hot and tropical climates. Recent research indicates that urban areas with high levels of air pollution are more prone to serious infections and deaths due to the presence of extremely high levels of pollutants. Relationship between environmental factors and pandemic. Therefore, it is important to understand how translation is catalyzed by environmental conditions. Therefore, we undertake this study to analyze the role of environmental and climate variables in the spread of Covid19 [2].

2 Literature Review The present study analyzes and predicts future Covid19 cases based on meteorological parameters using LONG SHORT TERM MEMORY (LSTM). The results indicate that daily covid cases show a positive correlation with specific humidity, an inverse association with maximum temperature across different geographical locations in India. The data had been used in building univariate and multivariate time series forecasting models based on the LSTM approach. The model has been further tuned and used to predict COVID-19 cases from 1st July 2020 to 31st July 2020. As per results, for short term forecasts (1 day) univariate LSTM shows comparatively better performance, for medium range forecasts (2–7 days), multivariate LSTM shows improved performance. Specific humidity plays a pivotal role in estimating primarily in western and northwestern areas of India. Temperature on the other hand greatly influences the forecast parameters in the Southern and Eastern areas of India [2]. Epidemiological models cannot account for the influence of climate variables on virus spread. Important relationships between seasonal aspects of weather, airborne virus transmission and pandemics in a given year are Epidemiological models cannot account for the influence of climate variables on virus spread. Important relationships between seasonal aspects of weather, airborne virus transmission and pandemics in a given year are explored. Clearly, it has been shown that seasonal weather can cause two outbreaks in a year, as in the case of the worldwide COVID19 pandemic. The results show that two pandemics per year are inevitable. Outbreaks are independent of the seasons and appear to be closely related to changes in relative humidity, wind speed and temperature [4]. India was massively hit by the Covid-19 pandemic in the year 2020. There was a sudden spike in the number of active cases. The most sudden change in this pattern was observed during mid-April. Several meteorological parameters such as daily maximum temperature, daily minimum temperature, daily mean temperature, dew point temperature, wind speed, relative humidity have been recorded from a timespan of March 01, 2020 to June 04, 2020 over 9 majorly affected cities. The same results were analyzed to better understand the peak of covid19 infections on a given day and on days 7, 10, 12, 14 and 16 since cases were detected. Spearman’s correlation was used and it showed a significantly weaker association with wind speed, daily maximum temperature, daily minimum temperature, mean daily temperature but the association better system with latency of 14 days. In addition, using

340

J. Francis et al.

support vector regression, the peak in confirmed cases was successfully estimated with a certain delay of 12–16 days. Analysis supports the concept of an incubation period of 14 ± 02 days in India. The literature indicates that the transmission dynamics of coronavirus (linear) correlates well with any weather factors. Transmission is affected by a certain weather regime. Therefore, the multivariable nonlinear approach should be used [5]. The transmission of Covid-19 increases manifold with higher population density. The threat is not just restricted to public health, but the economy too gets affected immensely. 20 densely populated cities whose infection count went above 500 as of 15 May 2020 were considered as the epicenters. The paper states that the daily number of Covid19 cases has a strong covariance with temperature, which is responsible for about 65–85% of the explained variance. This study further elaborates on the combined temperature and humidity profiles that can certainly support the rapid growth of COVID-19 cases in the early stages. These results are significant for estimating future Covid19 peaks and thus modeling cities according to environmental conditions. Another key parameter that has been taken into account is the alarming CO2 emissions. High levels of CO2 emissions cause extreme changes in climatic conditions to which zoonotic viruses are very susceptible. The epicenters of Covid19 have been observed to be located directly above CO2 emission hotspots. Therefore, we can conclude that extreme weather conditions can have notable influence on the transmission dynamics of the Covid19 pandemic. Strong measures needed to reduce greenhouse gas emissions are of primary importance in preventing such future pandemics [6]. The role played by temperature, humidity, and absolute humidity in the transmission of COVID-19 is yet to be figured out. Data analysis has been performed on the daily averaged meteorological data for the last three years (2017–2019) recorded in the months of March, April, and May. A similar timeline has been considered for the year 2020. The research findings indicated the presence of positive association between daily COVID-19 cases and temperature and a mixed association with relative and absolute humidity over the Indian region. The relation of aerosols (AOD) and other pollutions (NO2) with COVID-19 cases during lockdown was also studied. There was a significant decline in aerosols (AOD) and NO2 with a maximum percentage decline of about 60 and 45, respectively. A notable decrease in surface PM2.5 PM10 and NO2 was also observed in six mega cities during the lockdown period. Results have suggested that Covid-19 has higher chances of transmission in warm, humid regions or during summer/monsoon months. It further states the need for an effective public health policy to lower the local transmission of the virus [7]. Another study examined the correlation between meteorological parameters and the COVID19 pandemic in Mumbai. Sample data were collected during the months of April to July 2020.The different methods that have been used to predict the association of COVID19 with meteorological parameters are the Spearman rank correlation test, twotailed p-test and artificial neural network (ANN) techniques. The results of the study indicate that there exists a significant amount of correlation between COVID19 cases and a number of meteorological parameters such as temperature, relative humidity and surface pressure [8]. Tested parameters that were significantly correlated were collected and thus used to model and predict COVID19 infection using artificial neural network techniques [8].

Prediction of Covid-19 Cases in Kerala Based on Meteorological Parameters

341

3 Methodology 3.1 Data for Model Development The climate data containing daily aggregate for temperature, precipitation, pressure, windspeed and humidity for all the districts of Kerala was collected from a NASA repository. (https://power.larc.nasa.gov/data-access-viewer/). Large volumes of data are required for model construction in neural network applications. In this research, the data used for model development included the number of confirmed Covid19 positive cases and weather parameter measurements such as temperature, humidity, wind speed, pressure, rain etc. The Covid19 data collected includes the number of confirmed cases, deaths and recovered for a day (https://data.covid19india.org/). The lockdown effect on different states of India is different. Hence in a bid to reduce the bias in our data and to build an accurate model, we decided to focus only on the state of Kerala, which had the highest number active covid19 cases in India when this paper was being written. We considered different districts of Kerala and for each district, we collected the weather data from a website of NASA. Then we tried to predict the daily Covid19 positive cases in each of the district based on the past confirmed covid cases of that district along with its weather data. 3.2 Modeling Framework In recent years, the superior performance of unidirectional LSTM compared to stateof-the-art recurrent neural networks (RNNs) has attracted a lot of attention. In addition, compared with normal RNNs, which cannot learn patterns with long dependencies, LSTMs can learn patterns with long dependencies. As a result, in forecasting time series data, LSTM has been shown to perform better than RNN. Several LSTM model extensions, now known as bidirectional LSTMs (BiLSTM), have evolved from the traditional algorithm by training on more data. This model repeats the forward and reverse learning of the input data twice. The architectures of these models are presented in Fig. 1.

Fig. 1. Uni-LSTM/BiLSTM architecture

342

J. Francis et al.

3.3 LSTM – Long Short Term Memory LSTM is an extension of RNN. RNN’s are networks with loops in them, allowing information to persist. The LSTM contains a forget gate that can be used to train individual neurons on what is important and how long it remains important. An ordinary LSTM unit consists of a block input zt, an input gate it, a forget gate ft, an output gate ot, and a memory cell ct. The forget gate ft is used to remove information that is no longer useful in the cell state using Equation. The input at a given time xt and the previous cell output ht − 1 are fed to the gate and multiplied by weight matrices, followed by the addition of the bias. The result is passed through a sigmoid function that returns a number between 0 and 1. If the output is 0, the information is forgotten for a given cell state; if the output is 1, the information is retained for future use. Adding useful information to the cell state is performed by the input gate it using Equation. First, the information is controlled by the sigmoid function, which filters the values to be stored, similar to the forget gate. Then, a vector of new candidate values of ht − 1 and xt is generated with the block gate zt using Equation, which outputs from −1 to +1. The vector values and the controlled values are multiplied to obtain useful information using Equation. The output gate ot decides which information in the cell is used to calculate the output of the LSTM unit using Equation. In the model we build, we passed sigmoid function in the input gate with a dropout of 0.2 for regularization. All other layers other than output layer has dense 50. For compiling LSTM, ADAM optimizer is used. The loss is calculated using the mean square. (1) ft = σg Wf ∗ xt + Uf ∗ ht−1 + bf it = σg (Wi ∗ xt + Ui ∗ ht−1 + bi )

(2)

ot = σg (Wo ∗ xt + Uo ∗ ht−1 + bo )

(3) (4) (5)

ht = ot .σc (ct )

(6)

σg : sigmoid, σc : tanh, ft is the forget gate, it is the input gate, ot is the output gate, ct is the cell state, ht is the hidden state.

Prediction of Covid-19 Cases in Kerala Based on Meteorological Parameters

343

3.4 Bi LSTM - Bidirectional Long Short Term Memory Bidirectional long-short term memory (bi-lstm) is the process of making any neural network o have the sequence information in both directions backwards (future to past) or forward (past to future). In bidirectional, our input flows in two directions, making a bi-lstm different from the regular LSTM. With the regular LSTM, we can make input flow in one direction, either backwards or forward. However, in bi-directional, we can make the input flow in both directions to preserve the future and the past information. For a better explanation, let’s have an example. As used in LSTM, sigmoid function is used in the input function. For regularization dropout of 0.2 is used. ADAM optimizer is used for compiling the model. The loss is calculated using the least squared error. it = σ (Wix ∗ xt + Wih ∗ ht−1 + bi )

(7)

ft = σ Wfx ∗ xt + Wfh ∗ ht−1 + bf

(8)

ot = σ (Wox ∗ xt + Woh ∗ ht−1 + bo )

(9) (10) (11) (12)

ft is the forget gate, it is the input gate, ot is the output gate, ct is the cell state, ht is the hidden state (Fig. 2).

Fig. 2. Bi-LSTM activation function

344

J. Francis et al.

For each district, the covid and weather data are first combined into a single data frame. The data was then divided into two. The first 80% of the sequence data is used to train the model before being tested on the final 20%. The LSTM networks are made up of four layers: a sequence input layer (with one feature), Uni-LSTM/Bi-LSTM layers (with 300 hidden units), a fully connected layer (with one response), and a regression layer. Table 1 shows the hyper parameter settings for the model. Multiple hyper parameters were examined to determine the best combination of values for accuracy. The tanh and sigmoid functions were used for state and gate activation functions respectively. 3.5 Evaluation of Bi-lSTM The accuracy of the model is calculated using the mean absolute percentage error (MAPE).The average absolute difference between the model’s predicted output (K1) and the expected true output (K) is calculated by MAPE: n |K − K1|/K × 100 (13) MAPE(% ) = 1/n i=1

accuracy (%) = (100 −MAPE)

(14)

Table 1. . Model hyper-parameters for BI-LSTM Gradient decay factor

0.9

Initial learning rate

0.05

Minimum batch size

128

Maximum epochs

300

Drop-rate

0.2

Training optimizer

Adaptive moment estimation optimizer

4 Results Bi-LSTM achieves high prediction outcomes up to 55 days (20% total data) in the future, the Figs. 3 and 4 shows how the data performed in the Bi-LSTM model during training and testing. The predictions from the given data was accurate with only a MAPE error of 0.01%.

Prediction of Covid-19 Cases in Kerala Based on Meteorological Parameters

345

Fig. 3. Model performance on train data

Fig. 4. Model performance on test data

The table below shows the different results we got for different architectures we tried out for the LSTMs. The hybrid architecture using 3Bi-LSTM Layer and 1LSTM layer showed the most accuracy of accuracy of 9.99%. A 3 Layered Bi-LSTM Model showed an accuracy of 99.98%. Since all the models were hyper tuned using similar parameters we can safely say that the hybrid architecture works the best with the given data (Table 2).

346

J. Francis et al. Table 2. .

Models

Accuracy %

MAPE %

1LSTM LAYER

94.23

5.77

2LSTM LAYER

95.82

4.18

3LSTM LAYER

96.49

3.51

4LSTM LAYER

93.88

6.12

1Bi-LSTM LAYER

97.27

2.73

2Bi-LSTM LAYER

99.35

0.65

3Bi-LSTM LAYER

99.98

0.02

1Bi-LSTM LAYER, 1LSTM LAYER

99.35

0.65

2Bi-LSTM LAYER, 1LSTM LAYER

99.98

0.02

3Bi-LSTM LAYER, 1LSTM LAYER

99.99

0.01

5 Conclusion Bidirectional and unidirectional LSTM models were built to predict the number of daily confirmed covid cases up to 55 days into the future. The models were evaluated based on the MAPE score obtained during testing. A simple and thorough strategy was adopted to evaluate the performance of different architectures and modelling parameters. The results showed a better performance for the bidirectional LSTM model compared to unidirectional LSTM model. The results also demonstrated the challenges of hyperparameter tuning of the LSTM algorithms. The results showed that the 4layered hybrid BiLSTM model outperformed other models with an accuracy of 99.99%. With the least MAPE. While it is acknowledged that more comprehensive testing is required on much larger and cleaner dataset, this contribution demonstrates the potential of BiLSTM and hybrid LSTMs. Future directions in this research include un-biased collection of more accurate weather data from weather stations. The accuracy of the model can be improved by using more data for training the model and by including new features into the dataset.

References 1. Satrioa, C.B.A., Darmawan, W., Nadia, B.U., Hanafiah, N.: Time series analysis and forecasting of coronavirus disease in Indonesia using ARIMA and PROPHET. Procedia Comput. Sci. 179, 524–532 (2020) 2. Salgotra, R., Gandomi, M., Gandomi, A.H.: Time series analysis and forecast of the COVID19 pandemic in India using genetic programming. Chaos, Solitons Fractals 138, 109945 (2020) 3. Dbouk, T., Drikakis, D.: On respiratory droplets and face masks. Phys. Fluids 32, 063303 (2020) 4. Gupta, A., Pradhan, B., Maulud, K.N.A.: Estimating the impact of Daily Weather on the temporal pattern of COVID 19 Outbreak in India. Earth Syst. Environ. 4, 523–534 (2020)

Prediction of Covid-19 Cases in Kerala Based on Meteorological Parameters

347

5. Sasikumar, K., Nath, D., Nath, R., Chen, W.: Impact of extreme hot climate on COVID-19 outbreak in India. GeoHealth 4, e2020GH000305 (2020) 6. Kumar, S.: Effect of meteorological parameters on spread of COVID-19 in India and air quality during lockdown. Sci. Total Environ. 745, 141021 (2020) 7. Kumar, R.R., Kumar, G.: A correlation study between meteorological parameters and COVID19 pandemic in Mumbai, India, diabetes and metabolic syndrome. Clin. Res. Rev. 14, 1735– 1742 (2020) 8. Allam, M.S., Sultana, R.: Influences of climatic and non-climatic factors on COVID-19 outbreak: a review of existing literature. Environ. Challenges 5, 100255 (2021) 9. Ma, Y., et al.: Effects of temperature variation and humidity on the death of COVID-19 in Wuhan, China. Sci. Total Environ. 724, 138226 (2020) 10. Prata, D.N., Rodrigues, W., Bermejo, P.H.: Temperature significantly changes COVID-19 transmission in (sub) tropical cities of Brazil. Sci. Total Environ. 729, 138862 (2020) 11. Comunian, S., Dongo, D., Milani, C., Palestini, P.: Air pollution and COVID-19: the role of particulate matter in the spread and increase of COVID-19’s morbidity and mortality. Int. J. Environ. Res. Public Health 17(12), 4487 (2020)

Monitoring COVID-Related Face Mask Protocol Using ResNet DNN Atlanta Choudhury and Kandarpa Kumar Sarma(B) Department of ECE, Gauhati University, Guwahati 781014, Assam, India {atlantachoudhury07,kandarpaks}@gauhati.ac.in

Abstract. Several epidemics of different corona virus illnesses have occurred around the world in the previous two decades. These epidemics frequently resulted in respiratory tract infections, which have been deadly in certain cases. With the advent of COVID-19, a corona virus disease, we are currently facing an enigmatic health disaster. Airborne transmission is one of COVID-19’s modalities of transmission. When humans breathe in the droplets released by an infected person through sneezing, singing, breathing, speaking, coughing, they become infected. As a result, public health officials have made face masks mandatory, which can limit disease transmission by 45% . In case of face recognition, which is often used for security verification, face masks create a difficult problem because these algorithms were generally trained with human faces without masks, but today, due to the advent of the Covid-19 pandemic, they are obliged to detect faces with masks. The proposed model consists of one component; this is based on the ResNet-50 deep transfer learning (DTL) model and is aimed for feature extraction. The model is used to train a ResNet-50 based architecture that can recognize masked faces well. The findings of this research might be effortlessly integrated into existing facial recognition algorithms that are used to detect faces as part of the security verification process. This research examines the performance of the ResNet-50 algorithm in determining who is wearing a masked face in real-time. The proposed technique is found to yield good accuracy (94.0%). Keywords: COVID-19 · Deep learning (DL) · Deep transfer learning (DTL) · ResNet-50 · Face mask

1 Introduction Many countries have implemented new requirements for wearing face masks (medical, cloth, N-95 face mask) as a result of the corona virus (COVID-19) outbreak. Governments have begun to develop new techniques for managing space, social distance, and supplies for medical personnel and ordinary citizens. In addition, the government has ordered hospitals and other organizations to implement new transmission prevention measures in order to stop COVID-19 from spreading. COVID-19 has a transmission rate of around 2.4 [1, 2]. However, the rate of transmission may vary depending on how governments measure and implement this. COVID-19 is spread by mainly airdrops and close contact; therefore, governments have begun enacting new regulations requiring © The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 M. N. Mohanty et al. (Eds.): METASOFT 2022, AISSE 1, pp. 348–355, 2022. https://doi.org/10.1007/978-3-031-11713-8_35

Monitoring COVID-Related Face Mask Protocol Using ResNet DNN

349

people to wear face masks. Wearing face masks has the purpose of lowering the rate of transmission and spread. The World Health Organization (WHO) has suggested that people and medical personnel use personal protective equipment (PPE) kit for isolate themselves from the infected people. However, most countries’ capacity to increase PPE production was severely limited at the beginning of this pandemic era [3]. COVID-19 is becoming a major public health and economic concern due to the virus’s negative effects on people’s quality of life, contributing to acute respiratory illnesses, death, and financial crises around the world [4]. According to [5], COVID-19 has infected almost six million people in more than 180 countries, with a fatality rate of 3%. COVID-19 spreads quickly in crowded places and through close touch. In many nations, governments face enormous obstacles and hazards in protecting citizens from the corona virus [6]. Because many nations require people to wear face masks in public, masked face detection is critical for face applications like object detection [7]. To combat and win the COVID-19 pandemic, governments will need advice and surveillance of people in public places, particularly in congested regions, to ensure that face mask laws are followed. This might be accomplished by combining surveillance technologies with artificial intelligence models (Fig. 1).

Fig. 1. Block diagram of face mask datasets for ResNet-50 algorithm

350

A. Choudhury and K. K. Sarma

Fig. 2. Block diagram of wearing mask or no mask

2 ResNet-50 Based Method ResNet-50, a convolutional neural network with 50 layers, is one of the ResNet versions. It has 48 convolution layers, as well as one Max Pool and one Average Pool layer each. ResNet-50’s architecture is depicted in depth in Fig. 2. ResNet [5] is a deep residual learning framework-based network. Even with incredibly deep neural networks, it solves the vanishing gradient problem. Despite having 50 layers, Resnet-50 contains approximately 23 million trainable parameters, which is significantly less than other architectures. The reasons for its performance are still up for debate. However, explaining residual blocks and how they work is the simplest method to grasp the concept. Consider a neural network block whose input represents the location where the true distribution H should be learned (x). The difference (or residual) between these is denoted as (Fig. 3): R(x) = Output − Input = H(x) − x We acquire it by rearranging it. H(x) = R(x) + x The residual block is attempting to figure out what the genuine output is, H(x). Taking a closer look at the figure above, we can see that the layers are learning the residual, R, because we have an identity relationship due to x. In a typical network, the layers learn actual output H(x), whereas in a residual network, the layers learn the residual R(x). It has also been discovered that learning the residual of the output and input is easier than learning the input alone. Because they are bypassed and add no complexity to the design, the identity residual model allows for the reuse of activation functions from prior layers.

Monitoring COVID-Related Face Mask Protocol Using ResNet DNN

351

Fig. 3. Architecture of ResNet-50 for facemask detection

3 Methodology Since the goal of our work is to detect people wearing a mask, we used a transfer learning technique in which a model established for another face recognition problem is applied to ours. We employ transfer learning on a ResNet-50 architecture-based convolutional neural network model. We choose ResNet-50 because of its performance in numerous image recognition tasks, notably because it outperforms other models in terms of time and memory. Machine learning model is divided into three parts. These are–

Fig. 4. Block diagram of ResNet-50 learning model

352

A. Choudhury and K. K. Sarma

(a) Preprocessing Unit (b) Learning Algorithm (c) Evaluation and Prediction Unit (Fig. 4) Preprocessing Unit One of the most important processes in every machine learning application is preparing the datasets. We are going to use the Kaggle dataset, which includes both faces with and without masks. It contains approximately 7000 photos, divided into 3725 images for persons wearing face masks and 3828 images for people not wearing face masks. Additionally, the photos are of various sizes, colors, and contrast to cover all situations. These datasets were separated into two groups after being converted to CSV files. One of them is the 6042-image training set (80%). The testing data set contains 755 photos (10%), whereas the validation data set has 755 images (10%). Learning Algorithm We provide all data to apply the training of ResNet-50 algorithm to see the performance after getting information from the preprocessing operations. Evaluation and Prediction Unit To avoid over-fitting, use dropout rates after the training procedure is completed. Then, after training and testing, we compare data using the Mean Square Error (MSE). We also use the Adaptive Moment Estimation optimizer technique to minimize the learning rate and identify the optimal accuracy algorithm for MFD.

4 Experimental Results Comparing ResNet-50 accuracy from several research papers with our proposed model algorithm, accuracy is shown in Table 2. The model accuracy and model loss graphs are shown in Figs. 5 and 6 for ResNet-50 learning model. Table 1 shows the results of different parameters calculated for ResNet-50 learning method. From the results accuracy of this learning technique has accuracy of 94%. Accuracy =

No. of correctly detected Total number of samples

Table 1. Performance evaluation of ResNet-50 Sl. no

Model

Class

Precision

Recall

F1 score

Support

Confusion matrix

Accuracy

1

RESNET-50

0

0.99

0.87

0.93

745

[651 94]

94%

1

0.89

0.99

0.94

765

[4 761]

94%

Monitoring COVID-Related Face Mask Protocol Using ResNet DNN

353

The majority of related work focuses solely on mask face categorization using ResNet-50. Table 2 shows a performance comparison of different approaches in terms of Accuracy (AC) (Fig. 7). Table 2. Performance comparison of ResNet-50 model in term of Accuracy Ref

Model

Classification

Accuracy

[1]

Resnet-50

Yes

89%

[2]

Resnet-50

Yes

81%

[3]

Resnet-50

Yes

88.9%

[4]

Resnet-50

Yes

95.8%

[5]

Resnet-50

Yes

99%

[6]

Resnet-50

Yes

98.2%

[7]

Resnet-50

Yes

–

[8]

Resnet-50

Yes

98.7%

[9]

Proposed model

Yes

94%

Fig. 5. Training and validation accuracy of ResNet-50

354

A. Choudhury and K. K. Sarma

Fig. 6. Training and validation loss in ResNet-50

Fig. 7. Model summary of ResNet-50

5 Conclusion Here we presented the detection of masked face by ResNet-50 and compared the results of different papers published in conferences and journals. In this work ResNet-50 model is used for image detection and it produce quite high-performance outcomes. The improvement of detection performance of the proposed model by introducing a mean IoU to estimate the best number of anchor boxes. For training and validating our new model in a supervised condition, a new dataset has to be created based on two public masked face datasets. From the result, it can be seen that the percentage of the accuracy rate of ResNet-50 algorithm is 94%. We could try increasing the dataset used in this work by simulating masks on our unmasked dataset and vice versa. Then we can compare the current results with this.

Monitoring COVID-Related Face Mask Protocol Using ResNet DNN

355

References 1. Mandal, B., Okeukwu, A., Theis, Y.: Masked face recognition using ResNet50. In: Computer Vision and Pattern Recognition, 19 April 2021. https://doi.org/10.48550/arXiv.2104.08997 2. Loey, M., Manogaran, G., Taha, M.H.N., Khalifa, N.E.M.: Fighting Against COVID-19: A Novel Deep Learning Model Based on YOLOv2 with ResNet-50 for Medical Face Mask Detection. Elsevier, 6 November 2020 3. Hariri, W.: Efficient masked face recognition method during the COVID-19 pandemic. SIViP 16, 605–612 (2021). https://doi.org/10.1007/s11760-021-02050-w 4. Balasubramanian, V.: Facemask detection algorithm on COVID community spread control using efficient net algorithm. J. Soft Comput. Paradigm (JSCP) 03(02), 110–122 (2021). http:// irojournals.com/jscp/, https://doi.org/10.36548/jscp.2021.2.005 5. Kalpe, A., Singh, A., Kholamkar, H., Pathave, P., Phaltankar, V., Gupta, A.K.: A survey: different techniques of face mask detection. Int. Res. J. Eng. Technol. (IRJET) 8(05) (2021). p-ISSN: 2395-0072 6. Sethi, S., Kathuria, M., Kaushik, T.: Face mask detection using deep learning: n approach to reduce risk of Coronavirus spread. J. Biomed. Inform. 120, 103848 (2021). https://doi.org/10. 1016/j.jbi.2021.103848 7. Nithyashree, K., Kavitha, T.: Face mask detection in classroom using deep convolutional neural network. Turkish J. Comput. Math. Educ. 12(10), 1462–1466 (2021) 8. Sadeddin, S.: Face mask detection trained on ResNet-50. WOLFARM Notebook

Author Index

A Abraham, G. K., 338 Aditya, V., 275 Ajay, Pasili, 20 Alankar, Bhavya, 164, 173, 181, 191

G Gantayat, Sasanko Sekhar, 275 Godslove, Julius Femi, 65 Guo, Wanying, 326 Gururaja, H. S., 315

B Baghel, Amit, 247 Banu, K. Sharmila, 56 Barisal, Swadhin Kumar, 10 Behera, S. K., 1 Bisoy, Sukant Kishoro, 82 Bohidar, Sankalpa, 256

J Jeong, Hyungi, 326 Jha, Pragya, 82

C Chandra, Pushpendra Kumar, 247 Chauhan, Ritu, 164, 173, 181, 191 Choudhury, Atlanta, 348 D Das, Lopamudra, 232 Das, Nilima R., 1 Das, Shom Prasad, 43, 219 Dasgupta, Brinda, 338 Dash, J. K., 209 Dash, Ranjan Kumar, 145 Dash, Subhransu Sekhar, 308 Deb, Mahuya, 338 Dey, Raghunath, 92 Dharmasastha, K. N. S., 56 F Francis, Jerome, 338

K Kalaichevlan, G., 56 Kar, Sanjeeb Kumar, 308 Kaur, Harleen, 164, 173, 181, 191 Kishore, Pushkar, 10 Kumar Chandra, Pushpendra, 20 Kumar, B. Anil, 134 Kumar, Neeraj, 181 Kumar, Rakesh Ranjan, 289 L Lakshmidevi, N., 134 Lincy, B., 56 M Mallick, Ayeshkant, 1 Mallick, Ranjan Kumar, 256 Mirza, Arisha, 191 Mishra, Deepti Bala, 156 Mishra, Pranati, 145 Mohanty, Prithviraj, 266 Mohanty, Subhadarshini, 104 Mohapatra, Puspanjali, 92 Mohapatra, Satyajit, 145

© The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 M. N. Mohanty et al. (Eds.): METASOFT 2022, AISSE 1, pp. 357–358, 2022. https://doi.org/10.1007/978-3-031-11713-8

358 Mohapatra, Soumya Snigdha, 289 Mohapatra, Srikanta Kumar, 32 Mohapatra, Subasish, 104 Moharana, Santosh Kumar, 104 Monisha, Chippada, 20 Muduly, Sarmistha, 104 N Nahak, Narayan, 256 Nanda, Sarita, 232 Nanda, Sony, 232 Nayak, Ajit Kumar, 65 Nayak, Bhagyalaxmi, 232 Nayak, Gayatri, 10 Nayak, Smrutiranjan, 308 Negi, Satish Kumar, 20, 247 P Padmanandam, Kayal, 113 Palai, G., 200 Panda, G. B., 209 Panda, Nibedan, 266 Panda, Niranjan, 92 Panda, S., 209 Parida, Swarnalipsa, 156 Pavan Kumar, Koli, 20 Piri, Jayashree, 92 Pitla, Nikitha, 113 Pradhan, Jitesh, 32, 43, 219

Author Index Priyadarshi, Ankur, 32 Prusty, Sashikanta, 72 Prusty, Sushree Gayatri Priyadarsini, 72 R Rao, G. Nageswara, 266 Rath, Dharashree, 156 Rath, Suneel Kumar, 43, 219 Raut, Piyush Nawnath, 298 Roy, Manish Chandra, 124 S Sahoo, Anita, 124 Sahu, Madhusmita, 43, 82, 219 Samal, S. R., 200 Samal, Tusarkanta, 124 Sarangi, Prakash Kumar, 32 Sarma, Kandarpa Kumar, 348 Satapathy, Samarjeet, 256 Seetha, M., 315 Sharma, Santosh Kumar, 32 Swain, Ayusee, 200 Swain, Kaliprasanna, 200 Swain, S. K., 200 T Tomar, Abhinav, 298 Tripathy, B. K., 56 Tulsibabu, Sai, 266