Intelligent and Cloud Computing: Proceedings of ICICC 2021 (Smart Innovation, Systems and Technologies, 286) 9811698724, 9789811698729

This book features a collection of high-quality research papers presented at the International Conference on Intelligent

162 15 19MB

English Pages 667 [626] Year 2022

Table of contents :
Committees
Chief Patron
Patrons
Finance Chair
General Chair
International Advisory Committee
Program Chair
Program Committee
Organizing Chair
Organizing Co-chair
Conference Coordinator
Convener
Co-convenor
Technical Program Committee Chairs
Preface
Contents
About the Editors
Part I Cloud Computing, Image Processing, and Software
1 An Intelligent Online Attendance Tracking System Through Facial Recognition Technique Using Edge Computing
1.1 Introduction
1.1.1 Motivation and Contribution
1.1.2 Organization of the Paper
1.2 Related Works
1.2.1 In the Year 2017
1.2.2 In the Year 2018
1.2.3 In the Year 2019
1.2.4 In the Year 2020
1.3 Significance of Edge Computing for Online Attendance Tracking System
1.4 Methodology
1.4.1 Student Registration
1.4.2 Faculty Registration
1.4.3 Image Capture
1.4.4 Face Detection
1.4.5 Extraction of Face Encodings
1.4.6 Face Recognition
1.4.7 Attendance Updation
1.5 Practical Implementation
1.6 Results and Discussion
1.7 Conclusion and Future Work
References
2 Cloud Analytics: An Outline of Tools and Practices
2.1 Introduction
2.2 Related Work
2.3 Exploring the Ideology in Cloud-Based Analytics
2.4 Cloud Analytics Platforms
2.4.1 Qlik Sense Enterprise
2.4.2 Tableau
2.4.3 IBM Watson Analytics
2.4.4 Microsoft Power BI
2.4.5 TIBCO Spotfire X
2.4.6 SAS Business Analytics
2.5 Discussions
2.5.1 Benefits to Cloud-Based Analytics
2.5.2 Overview of Emerging Trends
2.6 Conclusion
References
3 An Enhanced Fault-Tolerant Load Balancing Process for a Distributed System
3.1 Introduction
3.2 Background
3.2.1 Related Work
3.3 Proposed Enhanced Load Balancing Scheme with Resource Failure Rate Consideration
3.4 Simulated Results
3.4.1 Results
3.5 Conclusion
References
4 HeartFog: Fog Computing Enabled Ensemble Deep Learning Framework for Automatic Heart Disease Diagnosis
4.1 Introduction
4.2 Related Works
4.3 HeartFog Architecture
4.3.1 Technology Used
4.3.2 Hardware Components Used
4.3.3 Software Components Used
4.4 HeartFog Design
4.4.1 Dataset
4.4.2 Heart Patient Data Pre-processing
4.4.3 Ensemble DL Application
4.4.4 Android Interface and Communication
4.4.5 Experimental Set-Up
4.4.6 Implementation
4.5 Results and a Discussion
4.6 Conclusion and Future Scope
References
5 A Cloud Native SOS Alert System Model Using Distributed Data Grid and Distributed Messaging Platform
5.1 Introduction
5.2 Related Work
5.3 Preliminaries
5.3.1 ZooKeeper as Cluster Coordinator
5.3.2 Kafka as a Data Streaming Platform
5.3.3 Ignite as the In-Memory Data Grid
5.4 Proposed Architecture
5.5 Implementation
5.6 Conclusion
References
6 Efficient Peer-to-Peer Content Dispersal by Spontaneously Newly Combined Fingerprints
6.1 Introduction
6.2 Literature Review
6.3 Proposed Work
6.3.1 Overview
6.3.2 Framework Architecture
6.3.3 Content Uploading and Splitting
6.3.4 Fingerprint Detection
6.3.5 Content Distribution
6.3.6 Anonymous Communication Protocol
6.3.7 Identification of Illegal Distribution
6.4 Experimental Results
6.4.1 Content Uploading and Splitting
6.4.2 Fingerprint Generation
6.4.3 Content Distribution
6.4.4 Identifying and Preventing Illegal Redistribution
6.5 Conclusion
References
7 An Integrated Facemask Detection with Face Recognition and Alert System Using MobileNetV2
7.1 Introduction
7.2 Related Works
7.3 Methodology
7.3.1 Facemask Detection Model
7.3.2 The Face-Recognition Model
7.3.3 Database
7.3.4 Web Application
7.4 Results
7.5 Conclusion
References
8 Test Case Generation Using Metamorphic Relationship
8.1 Introduction
8.2 Usability of Successful Test Cases
8.3 The Formalization and Insight of Metamorphic Testing
8.4 Conclusion
References
Part II IoT/Network
9 IRHMP: IoT-Based Remote Health Monitoring and Prescriber System
9.1 Introduction
9.2 Previous Work
9.2.1 Architecture
9.2.2 Communication Protocol
9.3 IRHMP—IoT-Based Remote Health Monitoring and Prescriber System
9.3.1 Architecture
9.3.2 Functionalities Doctor
9.3.3 Implementation Working
9.4 Conclusion
References
10 On Boundary-Effects at Cellular Automata-Based Road-Traffic Model Towards Uses in Smart City
10.1 Introduction
10.2 Review of the State-of-the-Art Available Works
10.3 Proposed Empirical Investigation
10.3.1 Search for More ECA Rules Toward Uses in Road-Traffic Simulation
10.3.2 Investigations on ECA Dynamics Toward Road-Traffic Simulation at Various Fixed-Boundary Situations
10.4 Detailed Discussions
10.5 Conclusive Remarks
References
11 An Optimized Planning Model for Management of Distributed Microgrid Systems
11.1 Introduction
11.1.1 Components of a Micro Grid
11.1.2 Type of Micro Grid
11.2 Literature Survey
11.3 Proposed Model
11.3.1 Methodology
11.4 Results and Discussion
11.4.1 For PV System
11.4.2 For Wind System
11.4.3 For Fuel Cells Power Generating Systems
11.5 Conclusions
References
12 Hard and Soft Fault Detection Using Cloud Based VANET
12.1 Introduction
12.2 Literature Study
12.3 Proposed Approach
12.3.1 Hard Permanent Fault Detection
12.3.2 Soft Fault Diagnosis
12.3.3 Fault Status Transmission Through Vehicular Cloud
12.4 Simulation Experiments and Discussions
12.5 Conclusion
References
13 A Novel Intelligent Street Light Control System Using IoT
13.1 Introduction
13.2 Related Works
13.3 State of Art
13.3.1 Node MCU
13.3.2 PIR Sensor
13.3.3 Channel Relay
13.3.4 LDR
13.4 Proposed System
13.5 Features of Smart Street Light System
13.6 Results and Discussion
13.7 Conclusion and Future Scope
References
14 Enhancing QoS of Wireless Edge Video Distribution using Friend-Nodes
14.1 Introduction
14.2 System Model
14.3 Video Dissemination Architecture
14.4 Simulation and Results
14.5 Conclusion
References
15 Groundwater Monitoring and Irrigation Scheduling Using WSN
15.1 Introduction
15.2 Materials and Method
15.2.1 System Design
15.2.2 Data Collection
15.2.3 Hardware
15.2.4 Software and Database
15.2.5 Groundwater Level Measurement and Pumping Time
15.2.6 Crop Water Demand
15.2.7 Irrigation Water Requirements
15.2.8 Energy Consumption
15.2.9 Process Flow Diagram
15.2.10 Dashboard
15.3 Results
15.3.1 Discharge and Recharge of Groundwater
15.3.2 Electricity Consumption
15.3.3 Crop Water Demand
15.4 Conclusion
References
16 A Soft Coalition Algorithm for Interference Alignment Under Massive MIMO
16.1 Introduction
16.2 System Model
16.3 Proposed Model
16.4 Experiment Results
16.5 Conclusion
References
Part III Optimization and Nature Inspired Methods
17 Improved Grasshopper Optimization Algorithm Using Crazy Factor
17.1 Introduction
17.2 Grasshopper Optimization Algorithms
17.2.1 Basic Grasshopper Optimization Algorithm (GOA)
17.2.2 Improved GOA with Crazy Factor (Crazy-GOA)
17.3 Result Analysis
17.4 Conclusion
References
18 Reliability Estimation Using Fuzzy Failure Rate
18.1 Introduction
18.2 Reliability Using Fuzzy System
18.3 Discussion
18.4 Illustration
18.5 Conclusion
References
19 A Sine Cosine Learning Algorithm for Performance Improvement of a CPNN Based CCFD Model
19.1 Introduction
19.2 Proposed SCA-CPNN-Based CCFD Model
19.2.1 Evaluation Criteria
19.3 Experimental Result Discussion
19.4 Conclusion
References
20 Quality Control Pipeline for Next Generation Sequencing Data Analysis
20.1 Introduction
20.2 Background and Related Work
20.3 Datasets
20.4 Methods and Material
20.4.1 Duplicate Removal and Trimming
20.4.2 NGS Dataset
20.4.3 NULL Value Removal
20.4.4 Normalization
20.4.5 Dimensionality Reduction
20.5 Result Discussion
20.5.1 The Differentially Expressed Genes and Outlier Detection Visualization
20.5.2 The Differentially Expressed Gene Selection Using Dimension Reduction
20.6 Conclusion and Future Work
References
21 Fittest Secret Key Selection Using Genetic Algorithm in Modern Cryptosystem
21.1 Introduction
21.1.1 Background
21.1.2 Motivation
21.1.3 Contribution
21.2 Preliminaries
21.2.1 Illustration of AVK
21.2.2 Genetic Algorithm
21.3 Proposed Work
21.4 Experiment Examples
21.5 Experiment Results
21.6 Performance Analysis
21.7 Randomness Verification
21.8 Conclusion
21.9 Future Work
References
22 Variable Step Size Firefly Algorithm for Automatic Data Clustering
22.1 Introduction
22.2 Related Work
22.3 Variable Step Size Firefly Algorithm
22.4 Result Analysis
22.5 Conclusion
References
23 GWO Based Test Sequence Generation and Prioritization
23.1 Introduction
23.1.1 Motivation
23.2 Basic Concept
23.2.1 GWO Algorithm
23.2.2 Adjacency Matrix
23.2.3 Objective Function/Fitness Function
23.3 Proposed Approach
23.3.1 Overview
23.3.2 Generation of Control Flow Graph
23.4 Implementation
23.4.1 Experimental Result
23.4.2 Prioritization
23.5 Comparison with Related Work
23.6 Conclusion and Future Work
References
Part IV Intelligent Computing
24 Artificial Intelligent Approach to Predict the Student Behavior and Performance
24.1 Introduction
24.2 Related Work
24.3 Existing System
24.4 Proposed System
24.5 Module Description
24.5.1 Feature Extraction
24.5.2 Emotion Expression Classification
24.5.3 Current Emotion Detection System
24.5.4 Facial Expression Recognition
24.6 Conclusion
References
25 Graph Based Automatic Keyword Extraction from Odia Text Document
25.1 Introduction
25.2 Literature Survey
25.3 Unsupervised Techniques for Ranking and Keyword Extraction
25.3.1 Graph Based Text-Rank Model
25.3.2 TF-IDF
25.4 Implementation and Experimental Results
25.5 Result Analysis
25.6 Conclusion
References
26 An Attempt for Wordnet Construction for Odia Language
26.1 Introduction
26.2 Related Work
26.3 Methods for Wordnet Construction and Procedure for Odia Wordnet Construction
26.4 Experimental Setup
26.5 Advantages and Disadvantages of Expansion Approach
26.6 Application
26.7 Conclusion and Future Works
References
27 A Deep Learning Approach for Face Mask Detection
27.1 Introduction
27.2 Related Works
27.3 Framework Used
27.4 Proposed Methodology
27.5 Results and Discussion
27.6 Conclusions and Future Scope
References
28 A Computational Intelligence Approach Using SMOTE and Deep Neural Network (DNN)
28.1 Introduction
28.2 Methodology Used
28.2.1 Dataset
28.2.2 Over-Sampling
28.2.3 DNN
28.2.4 Performance Metrics
28.3 Proposed Model
28.4 Experiments
28.5 Conclusion
References
29 Face Mask Detection in Public Places Using Small CNN Models
29.1 Introduction
29.2 Related Work
29.3 Small CNN Architectures: MobileNetv2 and ShuffleNet
29.3.1 MobileNetv2
29.3.2 ShuffleNet
29.4 Materials and Methodology
29.4.1 Dataset Description
29.4.2 Methodology
29.5 Result and Discussion
29.6 Conclusion
References
30 LSTM-RNN-Based Automatic Music Generation Algorithm
30.1 Introduction
30.2 Related Work
30.2.1 Melody-RNN
30.2.2 Biaxial-RNN
30.2.3 WaveNet
30.3 System Architecture
30.3.1 LSTM
30.3.2 LSTM with Attention
30.3.3 Encoder-Decoder
30.3.4 Encoder-Decoder with Attention
30.4 IV Implementation
30.5 Performance Evaluation
30.6 Conclusion
References
31 A KNN-PNN Decisioning Approach for Fault Detection in Photovoltaic Systems
31.1 Introduction
31.1.1 Types of PV Systems
31.1.2 Effect of Weather on PV Power Generation
31.1.3 Faults in PV Modules
31.2 Literature Survey
31.3 Proposed Model
31.3.1 Dataset Used
31.3.2 Working
31.4 Results and Discussion
31.4.1 Performance Evaluation for Dataset A
31.4.2 Performance Evaluation for Dataset B
31.5 Conclusions
References
32 Detecting Phishing Websites Using Machine Learning
32.1 Introduction
32.2 Literature Survey
32.3 Methodology
32.4 Results and Discussion
32.5 Conclusion and Future Work
References
33 Designing of Financial Time Series Forecasting Model Using Stochastic Algorithm Based Extreme Learning Machine
33.1 Introduction
33.2 Methodology
33.2.1 SLFN with ELM
33.2.2 SCA Algorithm
33.2.3 ELM-SCA Model
33.3 Experimental Study
33.4 Conclusion
References
34 Twin Support Vector Machines Classifier Based on Intuitionistic Fuzzy Number
34.1 Introduction
34.2 Background Review
34.2.1 Intuitionistic Fuzzy Membership Calculation
34.2.2 IFTSVM
34.3 Proposed Twin Support Vector Machines Classifier Based on Intuitionistic Fuzzy Number
34.4 Numerical Experiment
34.5 Conclusion
References
35 Automatic Detection of Epileptic Seizure Based on Differential Entropy, E-LS-TSVM, and AB-LS-SVM
35.1 Introduction
35.2 Materials and Methods
35.2.1 Clinical Dataset
35.2.2 Preprocessing
35.2.3 Feature Extraction
35.2.4 Performance Computation
35.3 Classifiers
35.4 Results and Discussion
35.5 Conclusion
References
36 Classification of Arrhythmia ECG Signal Using EMD and Rule-Based Classifiers
36.1 Introduction
36.2 Proposed Method
36.2.1 Clinical Dataset
36.2.2 Feature Extraction
36.2.3 Classifiers Used
36.2.4 Performance Measurements
36.3 Results and Discussion
36.4 Comparative Analysis
36.5 Conclusion
References
37 A Comparative Analysis of Data Standardization Methods on Stock Movement
37.1 Introduction
37.2 Related Work
37.3 Methods and Materials
37.3.1 Data Sets
37.3.2 Normalization
37.3.3 Technical Indicators
37.3.4 Support Vector Machines
37.3.5 Artificial Neural Network
37.3.6 K Nearest Neighbor
37.4 Proposed Methodology
37.5 Results and Discussion
37.6 Conclusion and Future Work
References
38 Implementation of Data Warehouse: An Improved Data-Driven Decision-Making Approach
38.1 Motivation of the Work
38.2 Introduction
38.3 Literature Review
38.4 State of the Art
38.5 Experimental Analysis
38.6 Conclusion and Future Scope
References
39 An Empirical Comparison of TOPSIS and VIKOR for Ranking Decision-Making Models
39.1 Introduction
39.2 MCDM Process
39.2.1 A Brief Overview of TOPSIS
39.2.2 A Brief Overview of VIKOR
39.2.3 Comparative Study of TOPSIS and VIKOR
39.3 Simulation
39.4 Conclusion
References
40 An Efficient Learning Model Selection for Dengue Detection
40.1 Introduction
40.2 Related Works
40.3 Proposed Work
40.3.1 Symptoms
40.3.2 Datasets Resources
40.3.3 Criteria for Percentage Selection of Trained and Tested Data
40.3.4 Proposed Algorithm for Evolution of Best ML Technique
40.3.5 Process Flow Chart
40.3.6 ML Algorithms Used
40.4 Results and Discussion
40.5 Conclusion
References
41 A Modified Convolution Neural Network for Covid-19 Detection
41.1 Introduction
41.2 Modified CNN Model for Covid-19 Detection from Chest X-Ray Images
41.2.1 Performance Matrices
41.3 Experimental Result Discussion
41.4 Conclusion
References
42 Bi-directional Long Short-Term Memory Network for Fake News Detection from Social Media
42.1 Introduction
42.2 Related Work
42.3 Methodology
42.4 Results
42.5 Conclusion
References
43 Effect of Class Imbalanceness in Credit Card Fraud
43.1 Introduction
43.2 Preliminary Work
43.3 Proposed Credit Card Fraud Detection Model
43.3.1 Proposed SMOTEMLCFDS Model
43.4 Results and Discussions
43.4.1 Experimental Setup
43.4.2 Description of the Dataset
43.4.3 Performance Metrics
43.4.4 Result Analysis
43.5 Conclusions
References
44 Effect of Feature Selection on Software Fault Prediction
44.1 Introduction
44.2 Literature Survey
44.3 Methodology
44.3.1 Datasets
44.3.2 Classification Techniques
44.3.3 Evaluation Criteria
44.4 Results and Discussions
44.5 Conclusion and Future Work
References
45 Deep Learning-Based Cell Outage Detection in Next Generation Networks
45.1 Introduction
45.1.1 Motivations
45.1.2 Contributions
45.2 Our Proposal
45.3 Performance Analysis
45.4 Conclusion
References
46 Image Processing and ArcSoft Based Data Acquisition and Extraction System
46.1 Introduction
46.2 System Analysis
46.2.1 System Function Structure
46.2.2 System Hardware Requirements
46.3 System Design
46.3.1 Process Design
46.3.2 Data Extraction Design
46.4 Implementation
46.4.1 Environment Deployment
46.4.2 Function Realization of Monitoring Terminal
46.4.3 Function Realization of Data Extraction
46.5 Conclusion
References
Part V Intelligent Computing (eHealth)
47 Machine Learning Model for Breast Cancer Tumor Risk Prediction
47.1 Introduction
47.2 Literature Review
47.3 Experimental Setup
47.4 Result Analysis
47.5 Conclusion
References
48 Comparative Analysis of State-Of-the-Art Classifier with CNN for Cancer Microarray Data Classification
48.1 Introduction
48.2 Related Work
48.3 Materials and Methodology
48.3.1 Dataset Description
48.3.2 Convolutional Neural Network
48.4 Proposed Work
48.5 Result and Analysis
48.6 Conclusion
References
49 Comparative Study of Machine Learning Algorithms for Breast Cancer Classification
49.1 Introduction
49.2 Related Work
49.3 Overview of Machine Learning Models
49.4 Dataset Description
49.5 Proposed Model
49.5.1 Exploratory Data Analysis and Data Preprocessing
49.5.2 Model Evaluation and Results
49.6 Conclusion
References
50 miRNAs as Biomarkers for Breast Cancer Classification Using Machine Learning Techniques
50.1 Introduction
50.2 Dataset Used
50.3 Proposed Model
50.3.1 Data Preparation
50.3.2 Feature Selection
50.3.3 Classifier Models
50.3.4 Evaluation Parameters
50.4 Results
50.4.1 Biological Relevance Analysis
50.5 Conclusion
References
51 A Computational Intelligence Approach for Cancer Detection Using Artificial Neural Network
51.1 Introduction
51.2 Literature Survey
51.3 Proposed Model
51.4 Result Analysis
51.4.1 Neural Network Model for Proposed Work
51.4.2 Dataset Description and Result Analysis
51.5 Conclusion
References
52 Brain MRI Classification for Detection of Brain Tumors Using Hybrid Feature Extraction and SVM
52.1 Introduction
52.2 Proposed Model
52.3 Experimental Study
52.4 Result Analysis
52.5 Conclusion
References
53 Enhancing the Prediction of Breast Cancer Using Machine Learning and Deep Learning Techniques
53.1 Introduction
53.2 Related Works
53.2.1 Machine Learning Techniques
53.2.2 Deep Learning Techniques
53.3 Proposed System
53.4 Experimental Results
53.4.1 Machine Learning
53.4.2 Deep Learning
53.5 Conclusion
References
54 Performance Analysis of Deep Learning Algorithms Toward Disease Detection: Tomato and Potato Plant as Use-Cases
54.1 Introduction
54.1.1 Contributions
54.2 Performance Analysis of Deep Learning Algorithms
54.2.1 Dataset
54.2.2 Data Augmentation
54.2.3 Transfer Learning
54.3 Experimental Settings
54.4 Results and Discussion
54.5 Conclusions
References
55 Classification of Human Facial Portrait Using EEG Signal Processing and Deep Learning Algorithms
55.1 Introduction and Related Studies
55.2 Methodology
55.3 Performance Analysis and Results
55.4 Conclusion and Future Works
References
56 A Semantic-Based Input Model for Patient Symptoms Elicitation for Breast Cancer Expert System
56.1 Introduction
56.2 Review of Literature
56.3 Materials and Method
56.4 Result and Discussion
56.4.1 Evaluation Metrics
56.4.2 Comparative Analysis of Symptoms Count
56.4.3 Comparative Analysis of Precision of Symptoms Generated on the Existing and Proposed Models
56.4.4 Comparative Analysis of Precision and Accuracy of Diagnosis Using Modified ST Algorithm
56.5 Conclusion
References
Appendix
Author Index

Recommend Papers

Intelligent Systems and Sustainable Computing: Proceedings of ICISSC 2021 (Smart Innovation, Systems and Technologies, 289) 9811900108, 9789811900105

The book is a collection of best selected research papers presented at the International Conference on Intelligent Syste

106 79 19MB Read more

Ubiquitous Intelligent Systems: Proceedings of ICUIS 2021 (Smart Innovation, Systems and Technologies, 243) 9811636745, 9789811636745

This book features a collection of high-quality, peer-reviewed papers presented at International Conference on Ubiquitou

122 30 25MB Read more

Smart Intelligent Computing and Applications, Volume 1: Proceedings of Fifth International Conference on Smart Computing and Informatics (SCI 2021) (Smart Innovation, Systems and Technologies, 282) 9811696683, 9789811696688

The proceeding presents best selected papers presented at 5th International Conference on Smart Computing and Informatic

113 4 18MB Read more

Smart Intelligent Computing and Applications, Volume 2: Proceedings of Fifth International Conference on Smart Computing and Informatics (SCI 2021) (Smart Innovation, Systems and Technologies, 283) 9811697043, 9789811697043

The proceeding presents best selected papers presented at 5th International Conference on Smart Computing and Informatic

116 48 19MB Read more

Smart Systems: Innovations in Computing: Proceedings of SSIC 2021 (Smart Innovation, Systems and Technologies, 235) 9811628769, 9789811628764

This book features original papers from the 3rd International Conference on Smart IoT Systems: Innovations and Computing

105 81 18MB Read more

ICT for Intelligent Systems: Proceedings of ICTIS 2023 (Smart Innovation, Systems and Technologies, 361) 9819940397, 9789819940394

This book gathers papers addressing state-of-the-art research in all areas of information and communication technologies

109 14 17MB Read more

Networking, Intelligent Systems and Security: Proceedings of NISS 2021 (Smart Innovation, Systems and Technologies, 237) 9811636362, 9789811636363

This book gathers best selected research papers presented at the International Conference on Networking, Intelligent Sys

102 64 25MB Read more

International Conference on Intelligent Emerging Methods of Artificial Intelligence & Cloud Computing: Proceedings of IEMAICLOUD 2021 (Smart Innovation, Systems and Technologies, 273) 3030929043, 9783030929046

This book consists of different accepted papers of the conference. Firstly, the artificial intelligence and its applicat

105 15 16MB Read more

Intelligent and Cloud Computing: Proceedings of ICICC 2019, Volume 1 [1st ed.] 9789811559709, 9789811559716

This book features a collection of high-quality research papers presented at the International Conference on Intelligent

702 107 27MB Read more

Intelligent Manufacturing and Energy Sustainability: Proceedings of ICIMES 2021 (Smart Innovation, Systems and Technologies, 265) 9811664811, 9789811664816

This book includes best selected, high-quality research papers presented at the International Conference on Intelligent

124 106 20MB Read more

Intelligent and Cloud Computing: Proceedings of ICICC 2021 (Smart Innovation, Systems and Technologies, 286)
9811698724, 9789811698729

Author / Uploaded
Debahuti Mishra (editor)
Rajkumar Buyya (editor)
Prasant Mohapatra (editor)
Srikanta Patnaik (editor)

0 0 0
Like this paper and download? You can publish your own PDF file online for free in a few minutes! Sign Up

File loading please wait...

Citation preview

Smart Innovation, Systems and Technologies 286

Debahuti Mishra Rajkumar Buyya Prasant Mohapatra Srikanta Patnaik Editors

Intelligent and Cloud Computing Proceedings of ICICC 2021

Smart Innovation, Systems and Technologies Volume 286

Series Editors Robert J. Howlett, Bournemouth University and KES International, Shoreham-by-Sea, UK Lakhmi C. Jain, KES International, Shoreham-by-Sea, UK

The Smart Innovation, Systems and Technologies book series encompasses the topics of knowledge, intelligence, innovation and sustainability. The aim of the series is to make available a platform for the publication of books on all aspects of single and multi-disciplinary research on these themes in order to make the latest results available in a readily-accessible form. Volumes on interdisciplinary research combining two or more of these areas is particularly sought. The series covers systems and paradigms that employ knowledge and intelligence in a broad sense. Its scope is systems having embedded knowledge and intelligence, which may be applied to the solution of world problems in industry, the environment and the community. It also focusses on the knowledge-transfer methodologies and innovation strategies employed to make this happen effectively. The combination of intelligent systems tools and a broad range of applications introduces a need for a synergy of disciplines from science, technology, business and the humanities. The series will include conference proceedings, edited collections, monographs, handbooks, reference books, and other relevant types of book in areas of science and technology where smart systems and technologies can offer innovative solutions. High quality content is an essential feature for all book proposals accepted for the series. It is expected that editors of all accepted volumes will ensure that contributions are subjected to an appropriate level of reviewing process and adhere to KES quality principles. Indexed by SCOPUS, EI Compendex, INSPEC, WTI Frankfurt eG, zbMATH, Japanese Science and Technology Agency (JST), SCImago, DBLP. All books published in the series are submitted for consideration in Web of Science.

More information about this series at https://link.springer.com/bookseries/8767

Debahuti Mishra · Rajkumar Buyya · Prasant Mohapatra · Srikanta Patnaik Editors

Intelligent and Cloud Computing Proceedings of ICICC 2021

Editors Debahuti Mishra Department of Computer Science and Engineering Siksha ‘O’ Anusandhan Deemed to be University Bhubaneswar, India Prasant Mohapatra University of California Davis, CA, USA

Rajkumar Buyya CLOUDS Laboratory The University of Melbourne Melbourne, VIC, Australia Srikanta Patnaik Department of Computer Science and Engineering Siksha ‘O’ Anusandhan Deemed to be University Bhubaneswar, India

ISSN 2190-3018 ISSN 2190-3026 (electronic) Smart Innovation, Systems and Technologies ISBN 978-981-16-9872-9 ISBN 978-981-16-9873-6 (eBook) https://doi.org/10.1007/978-981-16-9873-6 © The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2022 This work is subject to copyright. All rights are solely and exclusively licensed by the Publisher, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed. The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use. The publisher, the authors and the editors are safe to assume that the advice and information in this book are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or the editors give a warranty, expressed or implied, with respect to the material contained herein or for any errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional claims in published maps and institutional affiliations. This Springer imprint is published by the registered company Springer Nature Singapore Pte Ltd. The registered company address is: 152 Beach Road, #21-01/04 Gateway East, Singapore 189721, Singapore

Committees

Chief Patron Prof. (Dr.) Manojranjan Nayak, President, Siksha ‘O’ Anusandhan Deemed to be University

Patrons Prof. (Dr.) Ashok Kumar Mahapatra, Vice-Chancellor, Siksha ‘O’ Anusandhan Deemed to be University Prof. (Dr.) Damodar Acharya, Chairman (Advisory Committee), Siksha ‘O’ Anusandhan Deemed to be University

Finance Chair Prof. (Dr.) Manas Kumar Mallik, Director, FET (ITER), Siksha ‘O’ Anusandhan Deemed to be University

General Chair Prof. (Dr.) Prasant Mohapatra, Vice-Chancellor for Research, University of California Davis, CA

v

vi

Committees

International Advisory Committee Prof. (Dr.) Cagatay Catel, Wageningen University and Research, The Netherlands Prof. (Dr.) Sanjiv K. Bhatia, University of Missouri, St. Louis, USA Prof. (Dr.) Joseph Davis, The University of Sydney, Australia Prof. (Dr.) Aryabartta Sahu, IIT Guwahati, India Prof. (Dr.) Rabi N. Mahapatra, Texas A&M University, USA Prof. (Dr.) Gadadhar Sahoo, IIT/ISM Dhanbad, India Prof. (Dr.) Peng-Yeng Yin, National Chi Nan University, Taiwan Prof. (Dr.) Sandeep Kumar Garg, IIT Roorkee, India Prof. (Dr.) Pedro Soto Acosta, University of Murcia, Spain Prof. (Dr.) Pabitra Mohan Khilar, NIT Rourkela, India Prof. (Dr.) Annappa, NIT Surathkal, India Prof. (Dr.) Shyi-Ming Chen, National Taiwan University of Science and Technology, Taipei Prof. (Dr.) Ashok Kumar Das, IIIT Hyderabad, India Prof. (Dr.) Sanjay Jain, National University of Singapore, Singapore Prof. (Dr.) Tapan Kumar Das, Vellore Institute of Technology, India Prof. (Dr.) Zhihua Cui, Taiyuan University of Science and Technology, China Prof. (Dr.) Soubhagya Sankar Barpanda, Vellore Institute of Technology, India Prof. (Dr.) Wai-Kiang Yeap, Auckland University of Technology, New Zealand Prof. (Dr.) D. J. Nagendra Kumar, Vishnu Institute of Technology, India Prof. (Dr.) Mohammad Hassan Shafazand, Shahid Bahonar University of Kerman, Iran Prof. (Dr.) Mayuri Kundu, Vishnu Institute of Technology, India Prof. (Dr.) Chan-Su Lee, Yeungnam University, South Korea Prof. (Dr.) Ishwar K. Sethi, Oakland University, USA Prof. (Dr.) Pradipta Kishore Dash, Director (Research and Consultancy), Siksha ‘O’ Anusandhan Deemed to be University, India Prof. (Dr.) Ganapati Panda, Former Deputy Director, IIT Bhubaneswar, India Prof. (Dr.) B. P. Sinha, Distinguished Professor (CSE), FET (ITER), Siksha ‘O’ Anusandhan Deemed to be University, India

Program Chair Prof. (Dr.) Srikanta Pattnaik, Director, International Relation and Publication, Siksha ‘O’ Anusandhan Deemed to be University

Committees

vii

Program Committee Prof. (Dr.) Pradipta Kumar Nanda, Pro-Vice-Chancellor, Siksha ‘O’ Anusandhan Deemed to be University Prof. (Dr.) Amiya Kumar Rath, Adviser (NAAC) and Professor, Department of CSE, VSSUT, Burla, India Prof. (Dr.) Pradeep Kumar Sahoo, Dean, FET (ITER), Siksha ‘O’ Anusandhan Deemed to be University Prof. (Dr.) Binod Kumar Pattanayak, FET (ITER), Siksha ‘O’ Anusandhan Deemed to be University Prof. (Dr.) Ajit Kumar Nayak, HoD (CS and IT), FET (ITER), Siksha ‘O’ Anusandhan Deemed to be University Porf. (Dr.) Tripti Swarnkar, HoD (CA), FET (ITER), Siksha ‘O’ Anusandhan Deemed to be University

Organizing Chair Prof. (Dr.) Debahuti Mishra, Siksha ‘O’ Anusandhan Deemed to be University

Organizing Co-chair Dr. Manoranjan Parhi, Department of Computer Science and Engineering

Conference Coordinator Dr. Sarada Prasanna Pati, Department of Computer Science and Engineering

Convener Dr. Bharat Jyoti Ranjan Sahu, Department of Computer Science and Engineering

Co-convenor Dr. Kaberi Das, Department of Computer Science and Engineering

viii

Committees

Technical Program Committee Chairs Dr. Bichtrananda Patra, Department of Computer Science and Engineering Dr. Dibyasundar Das, Department of Computer Science and Engineering Dr. Nikunja Bihari Kar, Department of Computer Science and Engineering

Preface

The 2nd International Conference on Intelligent and Cloud Computing (ICICC 2021) was organized by the Department of Computer Science and Engineering, ITER, Siksha ‘O’ Anusandhan (Deemed to be University) during October 22–23, 2021. ICICC 2021 has provided a forum that brought together researchers, academia, and practitioners from industry to meet and exchange their ideas and recent research achievements in all aspects of intelligent and cloud computing, together with their applications in the contemporary world. The conference aims at attracting contributions of the system and network design that can support existing and future applications and services. In recent years, intelligent and cloud computing attracted significant attention both in research and in industry. This approach corresponds to natural human vision and is the best way to represent, generate, and implement various contemporary achievements. The conference has received a good response with a large number of submissions. The total number of relevant papers accepted for publication has been broadly divided into three major parts: (i) intelligent computing, (ii) cloud computing, and (iii) data science and its applications. ICICC 2021 has not only encouraged the submission of unpublished original articles in the field of intelligent and cloud computing issues but also considered the several cutting-edge applications across organizations and firms while scrutinizing the relevant papers. Bhubaneswar, India Melbourne, Australia Davis, USA Bhubaneswar, India

Debahuti Mishra Rajkumar Buyya Prasant Mohapatra Srikanta Patnaik

ix

Contents

Part I 1

2

Cloud Computing, Image Processing, and Software

An Intelligent Online Attendance Tracking System Through Facial Recognition Technique Using Edge Computing . . . . . . . . . . . . Manoranjan Parhi, Abhinandan Roul, and Bravish Ghosh 1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.1.1 Motivation and Contribution . . . . . . . . . . . . . . . . . . . . . . 1.1.2 Organization of the Paper . . . . . . . . . . . . . . . . . . . . . . . . . 1.2 Related Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.2.1 In the Year 2017 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.2.2 In the Year 2018 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.2.3 In the Year 2019 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.2.4 In the Year 2020 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.3 Significance of Edge Computing for Online Attendance Tracking System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.4 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.4.1 Student Registration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.4.2 Faculty Registration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.4.3 Image Capture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.4.4 Face Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.4.5 Extraction of Face Encodings . . . . . . . . . . . . . . . . . . . . . . 1.4.6 Face Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.4.7 Attendance Updation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.5 Practical Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.6 Results and Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.7 Conclusion and Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Cloud Analytics: An Outline of Tools and Practices . . . . . . . . . . . . . . . Gunseerat Kaur, Tejashwa Kumar Tiwari, and Apoorva Tyagi 2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

3 3 5 5 6 6 6 6 7 7 9 9 10 10 10 10 11 11 12 12 14 14 17 17 19 xi

xii

Contents

2.3 2.4

Exploring the Ideology in Cloud-Based Analytics . . . . . . . . . . . . . Cloud Analytics Platforms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.4.1 Qlik Sense Enterprise . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.4.2 Tableau . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.4.3 IBM Watson Analytics . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.4.4 Microsoft Power BI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.4.5 TIBCO Spotfire X . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.4.6 SAS Business Analytics . . . . . . . . . . . . . . . . . . . . . . . . . . 2.5 Discussions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.5.1 Benefits to Cloud-Based Analytics . . . . . . . . . . . . . . . . . 2.5.2 Overview of Emerging Trends . . . . . . . . . . . . . . . . . . . . . 2.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

4

An Enhanced Fault-Tolerant Load Balancing Process for a Distributed System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Deepak Kumar Patel and Chitaranjan Tripathy 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2.1 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.3 Proposed Enhanced Load Balancing Scheme with Resource Failure Rate Consideration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.4 Simulated Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.4.1 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . HeartFog: Fog Computing Enabled Ensemble Deep Learning Framework for Automatic Heart Disease Diagnosis . . . . . . . . . . . . . . . Abhilash Pati, Manoranjan Parhi, and Binod Kumar Pattanayak 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2 Related Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.3 HeartFog Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.3.1 Technology Used . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.3.2 Hardware Components Used . . . . . . . . . . . . . . . . . . . . . . 4.3.3 Software Components Used . . . . . . . . . . . . . . . . . . . . . . . 4.4 HeartFog Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.4.1 Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.4.2 Heart Patient Data Pre-processing . . . . . . . . . . . . . . . . . . 4.4.3 Ensemble DL Application . . . . . . . . . . . . . . . . . . . . . . . . 4.4.4 Android Interface and Communication . . . . . . . . . . . . . . 4.4.5 Experimental Set-Up . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.4.6 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.5 Results and a Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.6 Conclusion and Future Scope . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

19 21 21 22 22 22 23 23 24 24 25 25 26 29 29 30 31 31 33 34 36 36 39 39 41 42 42 45 46 46 46 46 48 48 48 48 49 52 52

Contents

5

6

7

A Cloud Native SOS Alert System Model Using Distributed Data Grid and Distributed Messaging Platform . . . . . . . . . . . . . . . . . . Biswaranjan Jena, Sukant Kumar Sahoo, and Srikanta Kumar Mohapatra 5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.3 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.3.1 ZooKeeper as Cluster Coordinator . . . . . . . . . . . . . . . . . 5.3.2 Kafka as a Data Streaming Platform . . . . . . . . . . . . . . . . 5.3.3 Ignite as the In-Memory Data Grid . . . . . . . . . . . . . . . . . 5.4 Proposed Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.5 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Efficient Peer-to-Peer Content Dispersal by Spontaneously Newly Combined Fingerprints . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . T. R. Saravanan, G. Nagarajan, R. I. Minu, Samarjeet Borah, and Debahuti Mishra 6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.2 Literature Review . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.3 Proposed Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.3.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.3.2 Framework Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . 6.3.3 Content Uploading and Splitting . . . . . . . . . . . . . . . . . . . 6.3.4 Fingerprint Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.3.5 Content Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.3.6 Anonymous Communication Protocol . . . . . . . . . . . . . . 6.3.7 Identification of Illegal Distribution . . . . . . . . . . . . . . . . 6.4 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.4.1 Content Uploading and Splitting . . . . . . . . . . . . . . . . . . . 6.4.2 Fingerprint Generation . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.4.3 Content Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.4.4 Identifying and Preventing Illegal Redistribution . . . . . 6.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . An Integrated Facemask Detection with Face Recognition and Alert System Using MobileNetV2 . . . . . . . . . . . . . . . . . . . . . . . . . . . Gopinath Pranav Bhargav, Kancharla Shridhar Reddy, Alekhya Viswanath, BAbhi Teja, and Akshara Preethy Byju 7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.2 Related Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.3 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.3.1 Facemask Detection Model . . . . . . . . . . . . . . . . . . . . . . . 7.3.2 The Face-Recognition Model . . . . . . . . . . . . . . . . . . . . . .

xiii

55

55 56 57 58 58 59 59 60 63 63 65

65 66 67 67 68 69 69 69 70 70 71 71 71 72 72 74 74 77

77 79 79 81 83

xiv

8

Contents

7.3.3 Database . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.3.4 Web Application . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

84 84 84 85 86

Test Case Generation Using Metamorphic Relationship . . . . . . . . . . . Sampa Chau Pattnaik, Mitrabinda Ray, and Yahya Daood 8.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.2 Usability of Successful Test Cases . . . . . . . . . . . . . . . . . . . . . . . . . . 8.3 The Formalization and Insight of Metamorphic Testing . . . . . . . . 8.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

89

Part II 9

89 90 91 94 94

IoT/Network

IRHMP: IoT-Based Remote Health Monitoring and Prescriber System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Rahul Chakraborty, Asheesh Balotra, Sanika Agrawal, Tanishq Ige, and Sashikala Mishra 9.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.2 Previous Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.2.1 Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.2.2 Communication Protocol . . . . . . . . . . . . . . . . . . . . . . . . . 9.3 IRHMP—IoT-Based Remote Health Monitoring and Prescriber System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.3.1 Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.3.2 Functionalities Doctor . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.3.3 Implementation Working . . . . . . . . . . . . . . . . . . . . . . . . . 9.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

10 On Boundary-Effects at Cellular Automata-Based Road-Traffic Model Towards Uses in Smart City . . . . . . . . . . . . . . . . . Arnab Mitra 10.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10.2 Review of the State-of-the-Art Available Works . . . . . . . . . . . . . . . 10.3 Proposed Empirical Investigation . . . . . . . . . . . . . . . . . . . . . . . . . . . 10.3.1 Search for More ECA Rules Toward Uses in Road-Traffic Simulation . . . . . . . . . . . . . . . . . . . . . . . . 10.3.2 Investigations on ECA Dynamics Toward Road-Traffic Simulation at Various Fixed-Boundary Situations . . . . . . . . . . . . . . . . . . . . . . . . 10.4 Detailed Discussions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10.5 Conclusive Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

97

97 98 98 99 99 99 101 102 108 109 111 111 112 114 115

115 118 118 118

Contents

11 An Optimized Planning Model for Management of Distributed Microgrid Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Jagdeep Kaur, Simerpreet Singh, Manpreet Singh Manna, Inderpreet Kaur, and Debahuti Mishra 11.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.1.1 Components of a Micro Grid . . . . . . . . . . . . . . . . . . . . . . 11.1.2 Type of Micro Grid . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.2 Literature Survey . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.3 Proposed Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.3.1 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.4 Results and Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.4.1 For PV System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.4.2 For Wind System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.4.3 For Fuel Cells Power Generating Systems . . . . . . . . . . . 11.5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 Hard and Soft Fault Detection Using Cloud Based VANET . . . . . . . . Biswa Ranjan Senapati, Rakesh Ranjan Swain, and Pabitra Mohan Khilar 12.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12.2 Literature Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12.3 Proposed Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12.3.1 Hard Permanent Fault Detection . . . . . . . . . . . . . . . . . . . 12.3.2 Soft Fault Diagnosis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12.3.3 Fault Status Transmission Through Vehicular Cloud . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12.4 Simulation Experiments and Discussions . . . . . . . . . . . . . . . . . . . . 12.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 A Novel Intelligent Street Light Control System Using IoT . . . . . . . . Saumendra Pattnaik, Sayan Banerjee, Suprava Ranjan Laha, Binod Kumar Pattanayak, and Gouri Prasad Sahu 13.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13.2 Related Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13.3 State of Art . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13.3.1 Node MCU . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13.3.2 PIR Sensor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13.3.3 Channel Relay . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13.3.4 LDR . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13.4 Proposed System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13.5 Features of Smart Street Light System . . . . . . . . . . . . . . . . . . . . . . . 13.6 Results and Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13.7 Conclusion and Future Scope . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

xv

121

122 122 123 124 125 125 127 127 128 128 130 131 133

133 135 135 135 137 139 140 142 142 145

146 147 150 150 150 151 151 151 152 153 155 155

xvi

Contents

14 Enhancing QoS of Wireless Edge Video Distribution using Friend-Nodes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Khder Essa, Sachin Umrao, Kaberi Das, Shatarupa Dash, and Bharat J. R. Sahu 14.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14.2 System Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14.3 Video Dissemination Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . 14.4 Simulation and Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 Groundwater Monitoring and Irrigation Scheduling Using WSN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . S. Gilbert Rozario and V. Vasanthi 15.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15.2 Materials and Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15.2.1 System Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15.2.2 Data Collection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15.2.3 Hardware . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15.2.4 Software and Database . . . . . . . . . . . . . . . . . . . . . . . . . . . 15.2.5 Groundwater Level Measurement and Pumping Time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15.2.6 Crop Water Demand . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15.2.7 Irrigation Water Requirements . . . . . . . . . . . . . . . . . . . . . 15.2.8 Energy Consumption . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15.2.9 Process Flow Diagram . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15.2.10 Dashboard . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15.3.1 Discharge and Recharge of Groundwater . . . . . . . . . . . . 15.3.2 Electricity Consumption . . . . . . . . . . . . . . . . . . . . . . . . . . 15.3.3 Crop Water Demand . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 A Soft Coalition Algorithm for Interference Alignment Under Massive MIMO . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Wanying Guo, Hyungi Jeong, Nawab Muhammad Faseeh Qureshi, and Dong Ryeol Shin 16.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16.2 System Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16.3 Proposed Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16.4 Experiment Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

157

157 158 159 161 162 162 165 165 166 166 167 167 167 168 170 170 171 171 172 172 172 173 174 174 175 177

177 178 179 181 181 183

Contents

xvii

Part III Optimization and Nature Inspired Methods 17 Improved Grasshopper Optimization Algorithm Using Crazy Factor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Paulos Bekana, Archana Sarangi, Debahuti Mishra, and Shubhendu Kumar Sarangi 17.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17.2 Grasshopper Optimization Algorithms . . . . . . . . . . . . . . . . . . . . . . 17.2.1 Basic Grasshopper Optimization Algorithm (GOA) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17.2.2 Improved GOA with Crazy Factor (Crazy-GOA) . . . . . 17.3 Result Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 Reliability Estimation Using Fuzzy Failure Rate . . . . . . . . . . . . . . . . . Sampa ChauPattnaik, Mitrabinda Ray, and Mitali Madhusmita Nayak 18.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18.2 Reliability Using Fuzzy System . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18.3 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18.4 Illustration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 A Sine Cosine Learning Algorithm for Performance Improvement of a CPNN Based CCFD Model . . . . . . . . . . . . . . . . . . . . Rajashree Dash, Rasmita Rautray, and Rasmita Dash 19.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19.2 Proposed SCA-CPNN-Based CCFD Model . . . . . . . . . . . . . . . . . . 19.2.1 Evaluation Criteria . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19.3 Experimental Result Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 Quality Control Pipeline for Next Generation Sequencing Data Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Debasish Swapnesh Kumar Nayak, Jayashankar Das, and Tripti Swarnkar 20.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20.2 Background and Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20.3 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20.4 Methods and Material . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20.4.1 Duplicate Removal and Trimming . . . . . . . . . . . . . . . . . . 20.4.2 NGS Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20.4.3 NULL Value Removal . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20.4.4 Normalization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

187

187 188 188 190 191 194 196 199

199 200 201 202 204 205 207 207 208 209 211 213 213 215

215 216 217 217 218 218 219 219

xviii

Contents

20.4.5 Dimensionality Reduction . . . . . . . . . . . . . . . . . . . . . . . . 20.5 Result Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20.5.1 The Differentially Expressed Genes and Outlier Detection Visualization . . . . . . . . . . . . . . . . . . . . . . . . . . . 20.5.2 The Differentially Expressed Gene Selection Using Dimension Reduction . . . . . . . . . . . . . . . . . . . . . . . 20.6 Conclusion and Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 Fittest Secret Key Selection Using Genetic Algorithm in Modern Cryptosystem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Chukhu Chunka, Avinash Maurya, and Parashjyoti Borah 21.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21.1.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21.1.2 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21.1.3 Contribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21.2 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21.2.1 Illustration of AVK . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21.2.2 Genetic Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21.3 Proposed Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21.4 Experiment Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21.5 Experiment Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21.6 Performance Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21.7 Randomness Verification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21.8 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21.9 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 Variable Step Size Firefly Algorithm for Automatic Data Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Mandakini Priyadarshani Behera, Archana Sarangi, Debahuti Mishra, and Srikanta Kumar Mohapatra 22.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22.3 Variable Step Size Firefly Algorithm . . . . . . . . . . . . . . . . . . . . . . . . 22.4 Result Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 GWO Based Test Sequence Generation and Prioritization . . . . . . . . . Gayatri Nayak, Mitrabinda Ray, Swadhin Kumar Barisal, and Bichitrananda Patra 23.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23.1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23.2 Basic Concept . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23.2.1 GWO Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

219 220 220 220 224 224 227 227 228 229 229 229 230 230 231 234 235 235 237 239 239 239 243

243 244 245 247 250 252 255

255 256 257 257

Contents

23.2.2 Adjacency Matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23.2.3 Objective Function/Fitness Function . . . . . . . . . . . . . . . . 23.3 Proposed Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23.3.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23.3.2 Generation of Control Flow Graph . . . . . . . . . . . . . . . . . 23.4 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23.4.1 Experimental Result . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23.4.2 Prioritization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23.5 Comparison with Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23.6 Conclusion and Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

xix

258 259 259 259 260 262 263 263 264 265 265

Part IV Intelligent Computing 24 Artificial Intelligent Approach to Predict the Student Behavior and Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . G. Nagarajan, R. I. Minu, T. R. Saravanan, Samarjeet Borah, and Debahuti Mishra 24.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24.3 Existing System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24.4 Proposed System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24.5 Module Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24.5.1 Feature Extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24.5.2 Emotion Expression Classification . . . . . . . . . . . . . . . . . 24.5.3 Current Emotion Detection System . . . . . . . . . . . . . . . . . 24.5.4 Facial Expression Recognition . . . . . . . . . . . . . . . . . . . . . 24.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 Graph Based Automatic Keyword Extraction from Odia Text Document . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Mamata Nayak and Nilima Das 25.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25.2 Literature Survey . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25.3 Unsupervised Techniques for Ranking and Keyword Extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25.3.1 Graph Based Text-Rank Model . . . . . . . . . . . . . . . . . . . . 25.3.2 TF-IDF . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25.4 Implementation and Experimental Results . . . . . . . . . . . . . . . . . . . 25.5 Result Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

269

270 271 272 272 273 273 274 274 274 274 275 277 277 278 279 279 280 281 283 285 286

xx

Contents

26 An Attempt for Wordnet Construction for Odia Language . . . . . . . . Tulip Das and Smita Prava Mishra 26.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26.3 Methods for Wordnet Construction and Procedure for Odia Wordnet Construction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26.4 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26.5 Advantages and Disadvantages of Expansion Approach . . . . . . . . 26.6 Application . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26.7 Conclusion and Future Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 A Deep Learning Approach for Face Mask Detection . . . . . . . . . . . . . Dibya Ranjan Das Adhikary, Vishek Singh, and Pawan Singh 27.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27.2 Related Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27.3 Framework Used . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27.4 Proposed Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27.5 Results and Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27.6 Conclusions and Future Scope . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 A Computational Intelligence Approach Using SMOTE and Deep Neural Network (DNN) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Madhusmita Sahu and Rasmita Dash 28.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28.2 Methodology Used . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28.2.1 Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28.2.2 Over-Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28.2.3 DNN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28.2.4 Performance Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28.3 Proposed Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 Face Mask Detection in Public Places Using Small CNN Models . . . Prabira Kumar Sethy, Susmita Bag, Millee Panigrahi, Santi Kumari Behera, and Amiya Kumar Rath 29.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29.3 Small CNN Architectures: MobileNetv2 and ShuffleNet . . . . . . . 29.3.1 MobileNetv2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29.3.2 ShuffleNet . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29.4 Materials and Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29.4.1 Dataset Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

287 287 288 288 289 291 292 292 292 295 295 296 297 298 300 302 302 305 305 306 306 307 308 308 309 309 315 316 317

317 318 319 319 319 319 321

Contents

xxi

29.4.2 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29.5 Result and Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

321 322 323 325

30 LSTM-RNN-Based Automatic Music Generation Algorithm . . . . . . R. I. Minu, G. Nagarajan, Samarjeet Borah, and Debahuti Mishra 30.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30.2.1 Melody-RNN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30.2.2 Biaxial-RNN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30.2.3 WaveNet . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30.3 System Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30.3.1 LSTM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30.3.2 LSTM with Attention . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30.3.3 Encoder-Decoder . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30.3.4 Encoder-Decoder with Attention . . . . . . . . . . . . . . . . . . . 30.4 IV Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30.5 Performance Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

327

31 A KNN-PNN Decisioning Approach for Fault Detection in Photovoltaic Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Kirandeep Kaur, Simerpreet Singh, Manpreet Singh Manna, Inderpreet Kaur, and Debahuti Mishra 31.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31.1.1 Types of PV Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31.1.2 Effect of Weather on PV Power Generation . . . . . . . . . . 31.1.3 Faults in PV Modules . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31.2 Literature Survey . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31.3 Proposed Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31.3.1 Dataset Used . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31.3.2 Working . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31.4 Results and Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31.4.1 Performance Evaluation for Dataset A . . . . . . . . . . . . . . 31.4.2 Performance Evaluation for Dataset B . . . . . . . . . . . . . . 31.5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 Detecting Phishing Websites Using Machine Learning . . . . . . . . . . . . A. Sreenidhi, B. Shruti, Ambati Divya, and N. Subhashini 32.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32.2 Literature Survey . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32.3 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32.4 Results and Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

327 328 328 328 329 329 330 331 331 333 334 335 338 338 341

342 343 343 344 344 345 346 347 348 348 349 350 350 353 353 354 356 358

xxii

Contents

32.5 Conclusion and Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 361 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 361 33 Designing of Financial Time Series Forecasting Model Using Stochastic Algorithm Based Extreme Learning Machine . . . . . . . . . . Sarbeswara Hota, Arup Kumar Mohanty, Debahuti Mishra, Pranati Satapathy, and Biswaranjan Jena 33.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33.2 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33.2.1 SLFN with ELM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33.2.2 SCA Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33.2.3 ELM-SCA Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33.3 Experimental Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 Twin Support Vector Machines Classifier Based on Intuitionistic Fuzzy Number . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Parashjyoti Borah, Ranjan Phukan, and Chukhu Chunka 34.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34.2 Background Review . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34.2.1 Intuitionistic Fuzzy Membership Calculation . . . . . . . . 34.2.2 IFTSVM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34.3 Proposed Twin Support Vector Machines Classifier Based on Intuitionistic Fuzzy Number . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34.4 Numerical Experiment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 Automatic Detection of Epileptic Seizure Based on Differential Entropy, E-LS-TSVM, and AB-LS-SVM . . . . . . . . . . . . . . . . . . . . . . . . Sumant Kumar Mohapatra and Srikanta Patnaik 35.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35.2 Materials and Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35.2.1 Clinical Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35.2.2 Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35.2.3 Feature Extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35.2.4 Performance Computation . . . . . . . . . . . . . . . . . . . . . . . . 35.3 Classifiers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35.4 Results and Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

363

364 364 364 365 366 367 368 369 371 371 373 374 375 377 379 382 383 385 385 386 386 386 386 387 387 388 389 392

Contents

36 Classification of Arrhythmia ECG Signal Using EMD and Rule-Based Classifiers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Prakash Chandra Sahoo and Binod Kumar Pattanayak 36.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36.2 Proposed Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36.2.1 Clinical Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36.2.2 Feature Extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36.2.3 Classifiers Used . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36.2.4 Performance Measurements . . . . . . . . . . . . . . . . . . . . . . . 36.3 Results and Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36.4 Comparative Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 A Comparative Analysis of Data Standardization Methods on Stock Movement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Binita Kumari and Tripti Swarnkar 37.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37.3 Methods and Materials . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37.3.1 Data Sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37.3.2 Normalization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37.3.3 Technical Indicators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37.3.4 Support Vector Machines . . . . . . . . . . . . . . . . . . . . . . . . . 37.3.5 Artificial Neural Network . . . . . . . . . . . . . . . . . . . . . . . . . 37.3.6 K Nearest Neighbor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37.4 Proposed Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37.5 Results and Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37.6 Conclusion and Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 Implementation of Data Warehouse: An Improved Data-Driven Decision-Making Approach . . . . . . . . . . . . . . . . . . . . . . . . L. S. Abinash Nayak, Kaberi Das, Srilekha Hota, Bharat Jyoti Ranjan Sahu, and Deb ahuti Mishra 38.1 Motivation of the Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38.2 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38.3 Literature Review . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38.4 State of the Art . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38.5 Experimental Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38.6 Conclusion and Future Scope . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

xxiii

393 393 394 395 395 395 396 396 397 398 399 401 401 402 403 403 404 405 406 406 406 407 407 415 415 419

419 420 420 421 422 424 426

xxiv

Contents

39 An Empirical Comparison of TOPSIS and VIKOR for Ranking Decision-Making Models . . . . . . . . . . . . . . . . . . . . . . . . . . . Sidharth Samal and Rajashree Dash 39.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39.2 MCDM Process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39.2.1 A Brief Overview of TOPSIS . . . . . . . . . . . . . . . . . . . . . . 39.2.2 A Brief Overview of VIKOR . . . . . . . . . . . . . . . . . . . . . . 39.2.3 Comparative Study of TOPSIS and VIKOR . . . . . . . . . 39.3 Simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 An Efficient Learning Model Selection for Dengue Detection . . . . . . Miranji Katta, R. Sandanalakshmi, Gubbala Srilakshmi, and Ramkumar Adireddi 40.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40.2 Related Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40.3 Proposed Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40.3.1 Symptoms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40.3.2 Datasets Resources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40.3.3 Criteria for Percentage Selection of Trained and Tested Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40.3.4 Proposed Algorithm for Evolution of Best ML Technique . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40.3.5 Process Flow Chart . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40.3.6 ML Algorithms Used . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40.4 Results and Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 A Modified Convolution Neural Network for Covid-19 Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Rasmiranjan Mohakud and Rajashree Dash 41.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41.2 Modified CNN Model for Covid-19 Detection from Chest X-Ray Images . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41.2.1 Performance Matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41.3 Experimental Result Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

429 429 430 431 432 434 434 436 437 439

440 441 441 445 445 445 446 447 448 448 451 452 455 455 457 458 459 461 462

42 Bi-directional Long Short-Term Memory Network for Fake News Detection from Social Media . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 463 Suprakash Samantaray and Abhinav Kumar 42.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 463 42.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 464

Contents

xxv

42.3 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

466 467 467 470

43 Effect of Class Imbalanceness in Credit Card Fraud . . . . . . . . . . . . . . Adyasha Das and Sharmila Subudhi 43.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43.2 Preliminary Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43.3 Proposed Credit Card Fraud Detection Model . . . . . . . . . . . . . . . . 43.3.1 Proposed SMOTE_ML_CFDS Model . . . . . . . . . . . . . . 43.4 Results and Discussions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43.4.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43.4.2 Description of the Dataset . . . . . . . . . . . . . . . . . . . . . . . . . 43.4.3 Performance Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43.4.4 Result Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43.5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

471

44 Effect of Feature Selection on Software Fault Prediction . . . . . . . . . . Vinod Kumar Kulamala, Priyanka Das Sharma, Preetipunya Rout, Vanit a, Madhuri Rao, and Durga Prasad Mohapatra 44.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44.2 Literature Survey . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44.3 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44.3.1 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44.3.2 Classification Techniques . . . . . . . . . . . . . . . . . . . . . . . . . 44.3.3 Evaluation Criteria . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44.4 Results and Discussions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44.5 Conclusion and Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 Deep Learning-Based Cell Outage Detection in Next Generation Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Madiha Jamil, Batool Hassan, Syed Shayaan Ahmed, Mukesh Kumar Maheshwari, and Bharat Jyoti Ranjan Sahu 45.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45.1.1 Motivations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45.1.2 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45.2 Our Proposal . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45.3 Performance Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

471 473 474 474 474 476 476 477 478 479 479 481

481 482 483 484 484 485 486 488 489 491

491 493 494 494 495 496 500

xxvi

Contents

46 Image Processing and ArcSoft Based Data Acquisition and Extraction System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Yanglu, Asif Khan, Amit Yadav, Qinchao, and Digita Shrestha 46.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46.2 System Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46.2.1 System Function Structure . . . . . . . . . . . . . . . . . . . . . . . . 46.2.2 System Hardware Requirements . . . . . . . . . . . . . . . . . . . 46.3 System Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46.3.1 Process Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46.3.2 Data Extraction Design . . . . . . . . . . . . . . . . . . . . . . . . . . . 46.4 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46.4.1 Environment Deployment . . . . . . . . . . . . . . . . . . . . . . . . . 46.4.2 Function Realization of Monitoring Terminal . . . . . . . . 46.4.3 Function Realization of Data Extraction . . . . . . . . . . . . . 46.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Part V

501 501 502 502 503 505 505 505 507 507 508 511 513 514

Intelligent Computing (eHealth)

47 Machine Learning Model for Breast Cancer Tumor Risk Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Lambodar Jena, Lara Ammoun, and Bichitrananda Patra 47.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47.2 Literature Review . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47.3 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47.4 Result Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 Comparative Analysis of State-Of-the-Art Classifier with CNN for Cancer Microarray Data Classification . . . . . . . . . . . . . Swati Sucharita, Barnali Sahu, and Tripti Swarnkar 48.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48.3 Materials and Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48.3.1 Dataset Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48.3.2 Convolutional Neural Network . . . . . . . . . . . . . . . . . . . . 48.4 Proposed Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48.5 Result and Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

517 517 518 524 524 529 530 533 533 535 536 536 537 537 539 541 543

49 Comparative Study of Machine Learning Algorithms for Breast Cancer Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 545 Yashowardhan Shinde, Aryan Kenchappagol, and Sashikala Mishra 49.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 545

Contents

49.2 49.3 49.4 49.5

Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Overview of Machine Learning Models . . . . . . . . . . . . . . . . . . . . . . Dataset Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Proposed Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49.5.1 Exploratory Data Analysis and Data Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49.5.2 Model Evaluation and Results . . . . . . . . . . . . . . . . . . . . . 49.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 miRNAs as Biomarkers for Breast Cancer Classification Using Machine Learning Techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . . Subhra Mohanty, Saswati Mahapatra, and Tripti Swarnkar 50.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50.2 Dataset Used . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50.3 Proposed Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50.3.1 Data Preparation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50.3.2 Feature Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50.3.3 Classifier Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50.3.4 Evaluation Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50.4.1 Biological Relevance Analysis . . . . . . . . . . . . . . . . . . . . . 50.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 A Computational Intelligence Approach for Cancer Detection Using Artificial Neural Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Rasmita Dash, Rajashree Dash, and Rasmita Rautray 51.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51.2 Literature Survey . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51.3 Proposed Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51.4 Result Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51.4.1 Neural Network Model for Proposed Work . . . . . . . . . . 51.4.2 Dataset Description and Result Analysis . . . . . . . . . . . . 51.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 Brain MRI Classification for Detection of Brain Tumors Using Hybrid Feature Extraction and SVM . . . . . . . . . . . . . . . . . . . . . . G. Tesfaye Woldeyohannes and Sarada Prasanna Pati 52.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52.2 Proposed Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52.3 Experimental Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52.4 Result Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

xxvii

546 547 548 548 549 550 553 553 555 555 557 557 558 558 558 559 560 561 562 563 565 565 566 566 567 567 568 569 569 571 571 572 574 576 578 578

xxviii

Contents

53 Enhancing the Prediction of Breast Cancer Using Machine Learning and Deep Learning Techniques . . . . . . . . . . . . . . . . . . . . . . . . M. Thangavel, Rahul Patnaik, Chandan Kumar Mishra, and Smruti Ranjan Sahoo 53.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53.2 Related Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53.2.1 Machine Learning Techniques . . . . . . . . . . . . . . . . . . . . . 53.2.2 Deep Learning Techniques . . . . . . . . . . . . . . . . . . . . . . . . 53.3 Proposed System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53.4 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53.4.1 Machine Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53.4.2 Deep Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 Performance Analysis of Deep Learning Algorithms Toward Disease Detection: Tomato and Potato Plant as Use-Cases . . . . . . . . . Vijaya Eligar, Ujwala Patil, and Uma Mudenagudi 54.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54.1.1 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54.2 Performance Analysis of Deep Learning Algorithms . . . . . . . . . . 54.2.1 Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54.2.2 Data Augmentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54.2.3 Transfer Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54.3 Experimental Settings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54.4 Results and Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54.5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 Classification of Human Facial Portrait Using EEG Signal Processing and Deep Learning Algorithms . . . . . . . . . . . . . . . . . . . . . . . Jehangir Arshad, Saqib Salim, Amna Khokhar, Zanib Zulfiqar, Talha Younas, Ateeq Ur Rehman, Mohit Bajaj, and Subhashree Choudhury 55.1 Introduction and Related Studies . . . . . . . . . . . . . . . . . . . . . . . . . . . 55.2 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55.3 Performance Analysis and Results . . . . . . . . . . . . . . . . . . . . . . . . . . 55.4 Conclusion and Future Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

581

581 582 583 583 585 588 589 589 590 592 595 595 597 597 597 598 599 600 601 601 605 607

608 610 615 616 617

56 A Semantic-Based Input Model for Patient Symptoms Elicitation for Breast Cancer Expert System . . . . . . . . . . . . . . . . . . . . . 619 Chai Dakun, Naankang Garba, Salu George Thandekkattu, and Narasimha Rao Vajjhala 56.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 620 56.2 Review of Literature . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 620

Contents

56.3 Materials and Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56.4 Result and Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56.4.1 Evaluation Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56.4.2 Comparative Analysis of Symptoms Count . . . . . . . . . . 56.4.3 Comparative Analysis of Precision of Symptoms Generated on the Existing and Proposed Models . . . . . 56.4.4 Comparative Analysis of Precision and Accuracy of Diagnosis Using Modified ST Algorithm . . . . . . . . . 56.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

xxix

621 624 624 625 626 628 628 628

Appendix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 631 Author Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 633

About the Editors

Prof. (Dr.) Debahuti Mishra received her B.E. degree in Computer Science and Engineering from Utkal University, Bhubaneswar, India, in 1994; her M.Tech. degree in Computer Science and Engineering from KIIT Deemed to be University, Bhubaneswar, India, in 2006; and her Ph.D. degree in Computer Science and Engineering from Siksha ‘O’ Anusandhan Deemed to be University, Bhubaneswar, India, in 2011. She is currently working as Professor and Head of the Department of Computer Science and Engineering, at the same university. Dr. Rajkumar Buyya is Redmond Barry Distinguished Professor and Director of the Cloud Computing and Distributed Systems (CLOUDS) Laboratory at the University of Melbourne, Australia. He is also the founding CEO of Manjrasoft, a spinoff company that commercializes university innovations in cloud computing. He served as Future Fellow of the Australian Research Council from 2012 to 2016. He has authored over 625 publications and seven textbooks, including Mastering Cloud Computing, published by McGraw Hill, China Machine Press, and Morgan Kaufmann for Indian, Chinese and international markets, respectively. Dr. Prasant Mohapatra is Vice-Chancellor for Research at the University of California, Davis. He is also Professor at the Department of Computer Science and served as the Dean and Vice-Provost of Graduate Studies at the University of California, Davis, from 2016 to 2018. He was also Associate Chancellor in 2014–2016 and Interim Vice-Provost and CIO of UC Davis in 2013–2014. Further, he was Department Chair of Computer Science from 2007 to 2013 and held the Tim Bucher Family Endowed Chair Professorship during that period. He has also been a member of the faculty at Iowa State University and Michigan State University. Dr. Srikanta Patnaik is Professor at the Department of Computer Science and Engineering, Faculty of Engineering and Technology, SOA University, Bhubaneswar, India. Dr. Patnaik has published 100 research papers in international journals and conference proceedings. Dr. Patnaik is Editor-in-Chief of the International Journal

xxxi

xxxii

About the Editors

of Information and Communication Technology and International Journal of Computational Vision and Robotics, published by Inderscience Publishing House, England, and also Editor-in-Chief of a book series on Modeling and Optimization in Science and Technology, published by Springer, Germany.

Part I

Cloud Computing, Image Processing, and Software

Chapter 1

An Intelligent Online Attendance Tracking System Through Facial Recognition Technique Using Edge Computing Manoranjan Parhi, Abhinandan Roul, and Bravish Ghosh Abstract Recently, due to COVID-19 pandemic, the classes, seminars, meetings are scheduled on virtual platform. It is a need to keep track of the presence of attendees. Earlier online attendance involved extracting the list of attendees, which was inconvenient as a lot of people mute themselves and leave the meeting altogether. Therefore, a tool is required to capture attendance through facial recognition which can effectively identify the attendees who remain online for the whole duration of the lecture. In this paper, a method has been proposed to completely automate the attendance tracking system using the concept of edge computing. The tool runs alongside any video conference platform and tracks the faces of attendees in a random interval, and using face recognition technique, find out the people who remain present for the complete duration of the class. This novel method acts as a fail-proof method to monitor attendance and improve digital transparency.

1.1 Introduction Since the COVID-19 pandemic, all social gatherings, academic classes, corporate meetings, and numerous seminars have been held online via video conferencing software. In the current situation, the new classrooms are Google Meet, Google Classroom, Cisco WebEx, Microsoft Team, and Zoom. Students from all around the world network and learn together. The attendance system is one component of online teaching that is difficult to manage. A new strategy is being proposed to maintain academic integrity and maintain students’ engagement in classrooms at the same level as previously. Earlier, smart attendance required employing a radio frequency identification (RFID)-based attendance system in the classroom with compatible student ID cards [1]. Although it was correct, there was a larger risk of misuse because students may bring in ID cards from friends who did not attend the session. Biometric systems are widely used because they are both accurate and convenient. This method M. Parhi (B) · A. Roul · B. Ghosh Department of Computer Science and Engineering, Siksha ‘O’ Anusandhan (Deemed to be University), Bhubaneswar, Odisha, India e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2022 D. Mishra et al. (eds.), Intelligent and Cloud Computing, Smart Innovation, Systems and Technologies 286, https://doi.org/10.1007/978-981-16-9873-6_1

3

4

M. Parhi et al.

has almost no false positives or negatives [2]. RFID and biometric technologies are not practical in online classrooms since students attend from the comfort of home. In spite of having numerous software that track student’s login and logout behavior to know the attendance of the student, there is a need to verify the presence of students in front of the laptop or PC. Obtaining a list of attendees is not a reliable method of validating a student’s attendance. Furthermore, in online classrooms, leaving the webcam on all the time takes a lot of data. Many students in India, where high-speed Internet adoption is still in its early stages, rely on cellphone networks for their lectures. Because mobile data plans are limited, sluggish, and expensive, students cannot keep web cameras switched on for the duration of the session. It will result in excessive buffering and a large amount of data being utilized from the students’ monthly plan. Furthermore, it is very hard for faculty members to devote the time and effort required to call out individuals at random intervals to confirm their attendance. It causes a lot of disruption during lecture hours, lowering the quality of teaching. Recently, several approaches have been proposed by researchers for online attendance tracking system [3–5]. But, it has been found that there is lack of uniformity of strategies, and the accuracy of data is inconsistent. Similarly, more computational power, time, and space are required when the number of images stored in the database is large. Thus, there arises a necessity of a smart strategy to track the online attendance using face recognition in an efficient way using the concept of edge computing. A tool has been designed in this work to help instructors record attendance of students using facial recognition which will be precise and consistent. This application can assist the administration in completely and easily monitoring the students’ presence in a class. The entire procedure is based on continuous acquisition of students’ photographs from web cameras using OpenCV, localization of the obtained photograph’s face image and extraction of the face encodings, and conducting the verification procedure [6]. The video capture, image extraction, and facial recognition will all take place at the client end, on the student’s laptop or PC. For classes of any length, the anticipated data use will be less than 1 Mb. Usage of the Internet will only be during login to verify student’s enrollment, known image acquisition from the database, and upload of attendance at the central database in real time. Furthermore, the required Internet speed for the tool to work properly is 512Kbps. This will be a major advantage for students in areas of patchy and unreliable network connectivity. The tool focuses on students’ privacy by utilizing edge computing techniques [7]. Face recognition software is being scrutinized in an era when government policies are focusing on digital privacy. The fundamental goal of every application developer should be to prevent the abuse of facial data. It is enforced in this case by not uploading the pupils’ face data to any cloud database. All of the acquired data from the students, such as facial photographs, are kept on local systems for analysis. To avoid abuse and conserve storage space, all gathered facial image data are immediately destroyed at the completion of the class. This technology merely stores the most basic information on students in an online database, which includes their name, email address, registration ID, courses they are enrolled in, and a passport photograph.

1 An Intelligent Online Attendance Tracking System Through …

5

1.1.1 Motivation and Contribution Attendance capture in online classes has a lot of faults. Many students enter the classroom, switch off their camera, and mute themselves. Many of those people do not pay attention in class. Nonetheless, they are shown as present since they have logged in. Many students do not interact during class hours and then make up reasons like network problems, which may or may not be true all of the time. Some participants also employ a video loop that plays during the class period, persuading professors of their presence and attention in class. These students are noted as present as well. The key objectives for developing this tool are: 1.

2.

3. 4.

5.

To achieve a high-accuracy, fail-proof, and highly resilient method of attendance labeling with auto-verification of attendees’ participation. It would necessitate little or no human intervention in marking the participation of a specified list of students. To develop an intelligent system that is capable of detecting and recognizing faces in an online environment, track students’ attention and mood. It will help the host to identify whether the content being delivered is being received by the audience. To create a user-friendly way to register students and faculty members and students’ face images directly through a web component. By classifying images at the client end and then submitting the final attendance (present/absent) to the central archive, network utilization and application processing are minimized. Be able to display to the student/user the number of active and failed facial recognition attempts.

The key contributions of the paper are described as follows: 1.

2.

3.

4.

A state-of-the-art tool is developed which helps the instructors to mark attendance during online classes over any platform which utilizes biometric technique such as facial recognition to verify the students’ presence in the session. To protect the privacy and avoidance of data transmission of facial data over the Internet, we employ the technique of edge computing to protect the data recorded on the client side and automate the facial recognition process. A tool with good UI/UX is created for students and faculties to make the tool easy and intuitive to use, along with email authentication technique to prevent creation of spam accounts. Comparison of our proposed technique that uses dlib for face recognition is done with other models to verify the accuracy metrics.

1.1.2 Organization of the Paper The rest of this paper is organized as follows. Section 1.2 discusses the various online attendance systems in recent period. Section 1.3 describes the significance of edge

6

M. Parhi et al.

computing for the proposed online attendance tracking system. Section 1.4 outlines the proposed methodology step by step with graphical representation of the model. The implementation of the proposed attendance capturing tool is explained with necessary graphical user interface in Sect. 1.5. Section 1.6 emphasizes the results and discussion. Section 1.7 describes conclusion and future scope of this work.

1.2 Related Works In literature, there have been several approaches and techniques proposed by the researchers to address the problem of attendance tracking system with the use of face recognition. Here are some existing works highlighted in the year-wise chronological order.

1.2.1 In the Year 2017 Arsenovic et al. [8] proposed a technique where convolutional neural network (CNN) is used for face detection (CNN cascade) and face recognition. Due to the rapid advancement in deep learning, the accuracy of face recognition is exceedingly improved by deep CNNs. CNN is effective for large datasets as it provides more accurate mean results, smaller error margins and spots the outliers that skew the data in smaller datasets. Linear support vector machine (SVM) is used for the classification of data as the dataset is small. This was tested in an IT company where five people volunteered for research. Photographs were taken from five different positions, and augmented images using OpenCV were used to train the dataset for partially noised data. The overall accuracy was 95.02%. The face recognition model was integrated as a web API.

1.2.2 In the Year 2018 Yusof et al. [9] proposed a technique where Haar cascade classifier is used for the face detection part, and it is combined with LBPH for the face recognition part. The system requires a Raspberry Pi 3 Model B+ along with a 5MP camera module.

1.2.3 In the Year 2019 Suresh et al. [3] have proposed a technique where the face recognition is done using eigenfaces recognizer for training mechanisms using OpenCV. The system is limited

1 An Intelligent Online Attendance Tracking System Through …

7

to one person at a time face recognition. MicroSD is used to compensate for the volume of the face data stored into the database.

1.2.4 In the Year 2020 Kakarla et al. [10] proposed a technique, where face recognition has been implemented using CNN model. CNN is developed using data collection and data augmentation. Twenty-layered CNN architecture has been used along with data augmentation by manipulation of the existing data to generate new samples. Under testing circumstances, the accuracy of the model is 99.86% and the loss of the CNN model is recorded to be 0.0057%. Patil et al. [5] proposed face detection using Viola-Jones, Haar cascade algorithm, and face recognition is done by the use of linear discriminant analysis (LDA) as well as KNN and SVM. Sharanya et al. [11] proposed the face detection using Haar cascade that is included in the OpenCV library. CNN is used for the face recognition part that is used inside Keras library as a sequential model for image processing, face recognition, and liveness detection. Local binary pattern (LBP) and Haar cascade (Haar classifier) are two OpenCV classifiers. Based on the above research works, it has been found that there is lack of uniformity of strategies, and the accuracy of data is inconsistent. Similarly, more computational power and time are required when the number of images stored in the database is large. Thus, there arises a necessity of an intelligent strategy for tracking the online attendance using face recognition in an efficient way using the concept of edge computing.

1.3 Significance of Edge Computing for Online Attendance Tracking System Edge computing refers to the process in which all the possible computations take place at client side, thus limiting the data transfer, computational load on server and reducing bandwidth consumption [12]. This process utilizes the users’ device to compute collected data. It values privacy by not transferring critical private information over the network. This removes the risk of data fraud and breach of privacy from personal images. Furthermore, the speed and reliability are improved. Edge computing prevents a large chunk of data from leaving the system and focuses on processing them on device itself. This increases the speed of the process, as the network consumption and latency are not taken into account. This method is extremely beneficial to applications that rely on fast response time. The services remain online, even if the network speed drops to a critical level. The proposed tool as shown in Fig. 1.1 uses edge computing to detect and recog-

8

M. Parhi et al.

Fig. 1.1 Proposed attendance tracking system

nize faces on users’ device. At initial login, a reference face image is automatically downloaded from the database. Then, all the captured images during the class undergoes processing on client’s device. In each capture, it goes through face detection, retrieval of face encodings using a deep learning algorithm, and final verification of the student’s face. The benefits obtained by using edge computing are as follows: • Low network utilization: Images are not transmitted to the cloud. • Data privacy: No photographs are transmitted over the Internet. • Faster processing: No time is spent on sending the image to server and awaiting the inference. • Operational efficiency: Servers with low computational capacity can handle the tasks. • Reduced power consumption: Since the servers do not need to process expensive operations, they consume less power.

1 An Intelligent Online Attendance Tracking System Through …

9

1.4 Methodology This section describes the main workflow of the proposed attendance tracking system. The operational flow of the complete process, as depicted in Fig. 1.2, is discussed in a step-by-step approach in Algorithm 1. Based on this mechanism, an intelligent online attendance capturing tool has been designed which includes the following seven modules:

1.4.1 Student Registration The student registration module enables the students to register themselves in this system. Each student has to enter his/her name, email, and registration ID, set a password, and upload a passport photograph. It is automatically checked whether the Fig. 1.2 Working procedure of proposed model

10

M. Parhi et al.

entered reg. ID is unique, and if it is true, then the student is registered successfully in the database. Email verification is done simultaneously to prevent the usage of fake email IDs.

1.4.2 Faculty Registration The faculty registration module enables the faculty members to register themselves who take the various course at the university. Registering in the portal will enable them to gain access to analytics platform wherein they can access the overall class attendance along with individual students’ attendance.

1.4.3 Image Capture This module helps to capture images for further processing from the live video feed at specified intervals (say 10 s). It ensures whether the camera permission is given by the user and runs the capture device until the end of the class.

1.4.4 Face Detection After images are captured by webcam at certain intervals, they are now passed to the face recognition library powered by dlib to detect face locations. It uses histogram of oriented gradients (HOG) to detect face in the image.

1.4.5 Extraction of Face Encodings After a face is successfully detected, it is passed through residual networks (ResNet) deep learning model to extract the face encodings in a 128D vector format. The image passes through several convolutional layers which convolve over the image to extract features. This deep learning algorithm tries to find the facial pattern all over the given image. After a few convolutions, a pooling layer shrinks the information retaining the most important part. In ResNet, the connections between layers can be skipped having the same dimensions He et al. [13]. At the end after the average pooling, facial encodings are extracted.

1 An Intelligent Online Attendance Tracking System Through …

11

1.4.6 Face Recognition After the extraction of face encodings of known and unknown images captured from the camera, it compares both of them using Euclidian distance with the help of dlib’s machine learning model. The model is trained upon labeled faces in the wild benchmark with an accuracy of 99.38% Sharma et al. [14] and Huang et al. [15]. The model is derived from a ResNet-34 model. If the images are similar and match above a certain threshold, then the face is recognized.

1.4.7 Attendance Updation Finally, after face recognition, the student attendance is updated in the database either present (1) or absent (0) date wise. Algorithm 1 Online Attendance Tracking System Input: Receives class details such as Regd. No of the student, Course ID etc. Output: Attendance for current class (Present (1) or Absent(0)) function attendanceTracker(classdetails) Initialize FaceFound = 0 Compute TotalPictures = (durationinmins * 60=10) Compute Threshold = 0.75 *TotalPictures while (Is Class Not Over) do Take a photo through WebCam Find face locations if (Face locations are found) then Extract the facial encodings Compare with encodings of known image of student if (Student is recognized) then if (the student was present in previous class) then Draw a bounding box in Green else Draw a bounding box in Red Update the FaceFound counter by 1 end if else

12

M. Parhi et al.

Print "Another person found/ You aren’t recognized" end if else Print "Face Not Detected" end if Wait 10 seconds end while if (FaceFound > Threshold) then Attendance = Present else Attendance = Absent end if Update the attendance in the database end function

1.5 Practical Implementation The proposed approach has been implemented using Python 3.7 [16], and it uses multiple libraries such as Streamlit 0.79 [17] for web GUI development, OpenCV Python 4.5.1.48 [18] for continuous video capturing, dlib 19.21.1 [19] for face detection and face recognition. The implementation of this work has been carried out with the following hardware specifications: Intel Core i3 2 GHz processor (6th Gen), 4 GB RAM running under Windows 10 Professional. The tool has been built using PyCharm IDE. MySQL 8.0 [20] is used as backend database system for managing the student registration and attendance. Figure 1.3 shows the screenshot of proposed attendance capturing tool which is primary focus of this work. Students are required to login at their class time providing course ID and registration ID. The tool will automatically start capturing the images for analysis until the end of the class. If the student was absent in the previous class, then the face to be surrounded by red boundary, otherwise green boundary.

1.6 Results and Discussion For calculation of detection accuracy, we used the face recognition dataset [21] provided by Vasuki Patel from Kaggle. The dataset contains face data of 31 different classes and 2562 face images of celebrity faces. We compared the face encoding of the known labeled images and the face encoding of the unknown images with a tolerance of 0.6. We achieved an accuracy of 98.05% for our proposed method. We have considered the average accuracy percentage of each model. The performance on eigenface, Fisherface, and LBPH models is plotted in Fig. 1.4.

1 An Intelligent Online Attendance Tracking System Through …

13

Fig. 1.3 Attendance capture tool

Accuracy

100

Performance evaluation

98 96 94 Our proposed tool

Eigenface

Fisherface

Celebrity Face Dataset Fig. 1.4 Comparing accuracy of different face recognition methods

LBPH

14

M. Parhi et al.

1.7 Conclusion and Future Work The proposed attendance tracking system is both student and faculty centric. We hope this tool will help maintain academic honesty regarding attendance. It will allow institutes to enforce strict attendance policy even when classes are held online. It makes sure that all students who actually attended the class are marked present. As the face recognition tool runs locally, it does not use any network bandwidth during the class. It captures photographs continuously throughout the duration of the class and verifies the student’s face. This tool automates the attendance process and provides more teaching time to instructors. We have planned to extend the functionalities of the proposed tool by adding the following features in the future. • Automatic SMS notification regarding attendance • Improvements on UI/UX design • Providing support for mobile app user.

References 1. Lim, S., Sim, S.C., Mansor, M.M.: RFID based attendance system. In: IEEE Symposium on Industrial Electronics and Applications, pp. 778–782 (2009) 2. Ujan, A., Ismaili, I.A.: Biometric attendance system. In: IEEE/ICME International Conference on Complex Medical Engineering, pp. 499–501 (2011) 3. Suresh, V., Dumpa, S.C., Vankayala, C.D., Aduri, H., Rapa, J.: Facial recognition attendance system using python and OpenCv. Quest J. J. Softw. Eng. Simul. 5(2), 18–29 (2019) 4. Kakarla, S., Gangula, P., Rahul, M.S., Singh, C.S.C, Sarma, T.H.: Smart attendance management system based on face recognition using CNN. In: IEEE HYDCON, p. 15 (2020) 5. Patil, V., Narayan, A., Ausekar, V., Dinesh, A.: Automatic students attendance marking system using image processing and machine learning. In: IEEE International Conference on Smart Electronics and Communication, pp. 542–546 (2020) 6. Yang, H., Han, X.: Face recognition attendance system based on real-time video processing. IEEE Access 8, 159143–159150 (2020) 7. Khan, M.Z., Harous, S., Hassan, S.U., Khan, M.U.G., Iqbal, R., Mumtaz, S.: Deep unified model for face recognition based on convolution neural network and edge computing. IEEE Access 7, 72622–72633 (2019) 8. Arsenovic, M., Sladojevic, S., Anderla, A., Stefanovic, D.: Facetime-deep learning based face recognition attendance system. In: IEEE 15th International Symposium on Intelligent Systems and Informatics, pp. 53–57 (2017) 9. Yusof, Y.W.M., Nasir, M.A.M., Othman, K.A., Suliman, S.I., Shahbudin, S. Mohamad, R.: Real-time internet based attendance using face recognition. IJET Int. J. Eng. Technol. 7(3.15):174–178 (2018) 10. Kakarla, S., Gangula, P., Rahul, M.S., Singh, C.S.C, Sarma, T.H.: Smart attendance management system based on face recognition using CNN. In: IEEE-HYDCON, pp. 1–5 (2020) 11. Sharanya, T., Kasturi, U., Sucharith, P., Mahesh, T., Dhivya, V.: Online attendance using facial recognition. IJERT Int. J. Eng. Res. Technol. 9(6), 202–207 (2020) 12. Olaniyan, R., Fadahunsi, O., Maheswaran, M., Zhani, M.F.: Opportunistic edge computing: concepts, opportunities and research challenges. Future Gener. Comput. Syst. 89, 633–645 (2018)

1 An Intelligent Online Attendance Tracking System Through …

15

13. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp.770–778 (2016) 14. Sharma, S., Shanmugasundaram, K., Ramasamy, S.K.: FAREC-CNN based efficient face recognition technique using Dlib. In: International Conference on Advanced Communication Control and Computing Technologies (ICACCCT), pp. 192–195 (2016) 15. Huang, G.B., Ramesh, M., Berg, T., Learned-Miller, E.: Labeled faces in the wild: a database for studying face recognition in unconstrained environments. Technical Report, pp. 07–49. University of Massachusetts, Amherst (2007) 16. Python: https://www.python.org/downloads/release/python-370/. Last accessed on 25 Jan 2021 17. Streamlit: The fastest way to build and share data apps: https://streamlit.io/. Last accessed on 25 Jan 2021 18. OpenCV: Open-source computer vision library: https://opencv.org/. Last accessed on 25 Jan 2021 19. King, D.E.: Dlib-ml: a machine learning toolkit. J. Mach. Learn. Res. 10, 1755–1758 (2009) 20. MySQL: https://www.mysql.com/. Last accessed on 25 Jan 2021 21. Face Recognition Dataset: https://www.kaggle.com/vasukipatel/face-recognition-dataset. Last accessed on 22 Feb 2021

Chapter 2

Cloud Analytics: An Outline of Tools and Practices Gunseerat Kaur , Tejashwa Kumar Tiwari, and Apoorva Tyagi

Abstract This paper projects an overview of how cloud-based analytical tools can provide for varied purposes of data aggregation and analytics. Analytics-as-a-service is a forthcoming facility, extensively provided through multiple vendors. The aim is to elucidate which tools can serve better in what sort of data sources. As cloud computing has garnered much-needed growth in terms of data generation, there is a huge opportunity to explore the generated data. IT firms invest in creating, managing, and lending storages to handle this volume; it also challenges analyzing the data relevantly and effectively. It has called for an intelligent analysis of data and its storage in a sustainable and regulated environment. While various methods are employed to mine this heap of data, the industry is looking for newer technologies that can increase the rate at which they can analyze the data.

2.1 Introduction From the introduction of elastic compute cloud (EC2) by Amazon through its subsidiary in August 2006 to the public rollout of Google Compute Engine in December 2013, the technology of cloud computing has come a long way in terms of computational capability and data security [1, 2]. The advancements in cloud computing such as serverless computing, hybrid cloud options, excellent storage capacities have become a blessing for managing the business of a company. According to Gartner’s study, the global public cloud computing market is in a position to surpass $330 billion in 2020 alone, and another study research states that about a third of a company’s annual budget goes to cloud services which average to around $2.2 million in 2018 [3]. It underlines the importance of cloud computing in the business environment. The reason for this abrupt rise of the market for shifting traditional server-based businesses to cloud businesses is due to the increased amount of data that businesses generate these days [4]. The data generation capability of the IT sector has increased exponentially owing to increased access to the Internet [5]. It G. Kaur (B) · T. K. Tiwari · A. Tyagi Lovely Professional University, Punjab, India e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2022 D. Mishra et al. (eds.), Intelligent and Cloud Computing, Smart Innovation, Systems and Technologies 286, https://doi.org/10.1007/978-981-16-9873-6_2

17

18

G. Kaur et al.

has led to ever-increasing data consumption and data generation trend in the industry. According to a recent study, Internet users generate about 2.5 quintillion bytes of data each day, and by 2020, we would have 40 trillion gigabytes of data [3]. While the generation of data is a massive plus for the companies, it puts much burden on the existing infrastructure. Therefore, there is a need to analyze these data loads and store only the essential parts [6]. Here is where the cloud analysis platforms step in, and mainly it is relevant for the IT firms who critically spend or scale on the storage options each year. Cloud analytics is a service model in which a single or more than a single analytics element is executed in the cloud. The elements can be on-premise, part of a hybrid model, or entirely in the cloud [7]. The cloud model is helpful as the organization’s analytics capabilities can grow as the organization itself grows. Since management is completely shifted to the cloud and is digital, physical management is significantly reduced. Cloud analytics can be implemented on three types of clouds according to NIST. The types of the cloud are private, public, and hybrid [8]. Public clouds offer companies applications as a service. Services including storage, creation, and management of virtual machines and data processing are considered examples of applications as a service. They are available to the public, and the IT systems used in such a multi-tenant architecture are shared, but the data is not shared [9]. Private clouds are allocated to a single organization. They are secured and can be accessed only by a group of company employees. Firewalls protect the data in such clouds, and data privacy is considered a high priority [10]. Even though data security is high, the implementation of this type of cloud is costly. Hybrid clouds are a cross between public and private clouds. The main work of cloud analytics platforms is to divide the data stored or fed into its servers and to perform computations on the data so that only the most essential part of the data is extracted and stored, and the extracted data can help us predict various trends in the forecasting of the data [11, 12]. Blockchain is added advantage on the approach of cloud analytics, as it forms a digital ledger that is decentralized in nature and supports the idea of data integrity [13]. As any predictive analysis requires data to be truthful, it would eventually play a significant source in determining results in data. Cloud analytics is also known as analytics-as-a-service in some cases, a common practice to get data analyzed from third part resources [7, 14]. The difficulty arises at the authenticity of data sources or whether the data has been tampered with or not [15]. This paper aims to shed light on various tools that hold functions for the easy performance of cloud-based analytics; these decision support systems have been inculcated with features from machine learning, knowledge-based, attribute-based learning that enhances the results and provides better insights. Section 2.2 recognizes the relevant literature and discusses the related works. Section 2.3 elaborates on using big data as a part of cloud analytics tools to enhance the efficiency of handling larger volumes of data. Section 2.4 provides insights on tools and their facilities alongside features offered. Finally, Sect. 2.5 describes the advantages and emerging trends.

2 Cloud Analytics: An Outline of Tools and Practices

19

2.2 Related Work An initial example for using the cloud’s potential to analyze data has been explored by implementing retrospective analysis on climate data provisioned through NASA’s MERRA data collection using MapReduce, resulting in a proclivity toward cloud because of its ease of deployment, data transformation, agility, and adaptability [16]. Jayaraman et al. [17] have implemented a hierarchical data analytical model that considers multi-cloud environments and their capacity to integrate and demonstrate the feasibility by taking case studies from Google Cloud and Microsoft Azure platforms. Their prediction describes that a great advantage to this model is a method of sharing information as it eliminates the risk of sharing raw data and redundancy. Another process highlighted by Seol et al. [13] consists of downloading blockchain transactions from etherscan and performing analytics and processing them through Hadoop MapReduce, eventually controlling the analysis through bash shell scripting. Bhandari [12] have compared two widely used business analytics platforms, IBM’s Watson Analytics and SAP’s Lumeira cloud. Their study shows that Watson Analytics has an intelligent data discovering strategy, which helps users use the tool for predictive and cognitive analysis. One can explore and reshape the data effectively to predict, describe or diagnose any data. Lumeira cloud, on the other hand, provides a more visually appealing interface with a better ability to prepare and compose the data post-analysis. As a template for comparison, they have compared data gained from the education sector, mainly graduate and undergraduate students. For example, [18] have used data exploration techniques to perform retrospective analysis to simulate public health policy drafts to create an informative method to enhance the design strategies of smart cities further. Another case study from [19] shows the implementation of cloud-based analytics using IBM’s Watson Analytics for predicting a crop according to data received at the IoT mechanism to feature what crop would yield better. They implemented support vector machines to analyze the incoming data.

2.3 Exploring the Ideology in Cloud-Based Analytics Big data is a term used to collect extensive and complicated data sets, which are difficult to store in traditional storage devices or complex to process using traditional data processing mechanisms. The nature of data is dependent on five different versus: velocity, volume, veracity variety, and value [14, 20]. These components are interrelated and influence the way how data is stored and processed. Volume points to the amount of data that has been growing at a swift pace in today’s world. The volume of data helps us to select the size of the data storage facilities that are required. Velocity is the speed at which the data is generated each day with the advent of social networking sites such as Facebook, Twitter, and Instagram [21]. The data that has been generated each day has received a significant spike. It is crucial to analyze the data as they come as it gives us a time advantage. Variety refers to the different areas

20

G. Kaur et al.

of the industry from where the data comes. The different sources of the data can be images, videos, sensor data, web history, etc. [22]. Most of this data is unstructured, which created problems in the analysis of data. Veracity refers to uncertainty in the data due to the inconsistency present in the data. They may be missing values or inconsistency in the data, which causes problems in the computation of data. Value refers to how valuable the data output generated from the analysis of large chunks of data is. If the data output does not cause profit for one’s organization, then the data analysis is of no use. Machine learning helps improve the quality of data output by involving the forecasting of data. Machine learning algorithms can be divided into three types: • Supervised Learning: The data fed into the system has the question and the solution to the question. The question asked by the user is then compared to the earlier material, and the nearest solution is found out—examples: K-nearest neighbor, decision trees, Bayesian networks [21]. • Unsupervised Learning: It is different from supervised learning. In this type of learning, the input is only given, and the question’s output is not given. The answer is collected by grouping the given outputs. Examples: neural networks, k-means clustering [23]. • Reinforcement Learning: In this type of learning, neither the input data nor the output data of the question is given. The machine tries to answer the question using random answer which is then evaluated. Once a correct answer is found out, then the machine remembers it [24]. It is correction-based training—examples: genetic algorithms, simulation-based learning. According to Gartner, there are six elements of analytics: data sources, models, processing applications, raw computing power, analytics models, sharing or storing data [3]. Figure 2.1 lists the various tiers at each level in cloud-based services that can inculcate analytics services. Data sources refer to the different locations from where the data is generated. The different data sources can be images, videos, sensor data, web history, etc. Most of this data is unstructured, which created problems in the analysis of data. Data models help us understand how different data points are linked to each other [9, 14]. They can be created once the unstructured data is converted to structured or semi-structured data. Processing applications refer to software that helps us process the data. They are processed by storing the data in a data warehouse [25]. Computing power refers to how the data is stored, structured, cleaned, analyzed, and given output. Analytic models refer to the models which are part of mathematical functions which help us predict the data on a large scale. Data warehouses help us store the data and perform operations on them. Cloud analytics-based applications or analysis-as-a-service is primarily concerned with provisioning data processing applications via means of hosted cloud-based infrastructures that will include collecting data from multiple resources. Profiling of data categories will be offered with an on-demand scenario. Data warehousing has to be provided at secure data silos that can function across multiple cloud vendors or data owners. Supportive features of such services include machine-to-machine

2 Cloud Analytics: An Outline of Tools and Practices

21

Fig. 2.1 An architecture for cloud-based analytics [26, 27]

communication, interoperability via standard formats, and comprehensive data visualization. As unprocessed raw data is often bulkier, sharing conclusive data and results is much more compressed in size.

2.4 Cloud Analytics Platforms It is not always feasible that everyone can use machine learning in extensive data analysis; there are often trailing issues. However, multiple organizations understand the struggle and have deployed various vendor-based third-party solutions for customers who can opt for the on-demand services.

2.4.1 Qlik Sense Enterprise Qlik Sense is a cloud analytics platform that allows one to find insights easily using its various features. Hackathorn and Margolis [28] have shown implementation in immersive analytics using Qlik Sense alongside, working to build a system that can understand decision-making factors for human-based interactions in the virtual gaming environment. A descriptive analysis is aggregated with unsupervised learning clusters being analyzed at five levels. Qlik Sense and Qlik associative engine can be easily used as an offsite or onsite offering. It supports a wide range of applications on

22

G. Kaur et al.

its multi-cloud platforms. It includes self-service data visualizations, guided visualizations, embedded analysis to enhance websites and custom applications, cognitive insights. The Qlik Analytics platform provides mobile-ready Qlik Sense visualizations, direct API access to the Qlik engine, open and documented APIs, including mash-up and extension APIs [29]. Some of the product limitations are that it does not provide built-in data security, drag, and drop functionality, only limited customizations, customizations of visual objects are limited and do not support data governance capabilities [30].

2.4.2 Tableau Tableau Online is the software as a service (SaaS) form of Tableau Server with maintenance, upgrades, and security fully managed by Tableau. On-premise IT staff is not required to maintain the server. Support and updates are included in the price. Central management of data with sources and workbooks is provided. Straightforward exploration of analytics from any browser or mobile device is provided. Senousy et al. [29] have explored a case study on its functions using insurance-based analytics to predict the sales in an annum. It has efficient working with a large data set capacity, including complex analysis that requires high computation power. Product limitations include a steep learning curve for beginners due to extensive and complex functionality, lack of support for embedded BI, data preparation, metadata management, lack of support for predictive analysis to uncover hidden data relationships [31].

2.4.3 IBM Watson Analytics IBM Watson Analytics’ features include uploading spreadsheets to get visualization, building dashboards, support to relational databases, access to 19 data connectors, and complete permission to IBM Analytics Exchange data and other products. Chen et al. [32] have explored the potential of IBM’s Watson Analytics cognitive technology for life sciences research related to genomics and pharmacological data to determine the effects of drug and their repurposing; it has shown exemplary work in predicting the drug targets efficiently. Some product limitations include reduced support for data governance functionality, no support for real-time streaming analytics, and a high cost for add-ons [12].

2.4.4 Microsoft Power BI Microsoft Power BI is a suite of business intelligence (BI), reporting, and data visualization products and services for individuals and teams. Power BI stands out

2 Cloud Analytics: An Outline of Tools and Practices

23

with streamlined publication and distribution capabilities and integration with other Microsoft products and services [33]. The features for large deployments include dedicated capacity control to allocate and scale, distribution of content without a per-user license, and publishing reports using on-premise Power BI server. Microsoft Power BI also has some product limitations. One of them is that they have no support for the direct integration of certain marketing services. They also have fewer visualizations compared to other leading analytics platforms. Also, SQL queries cannot be performed in Power BI. They also have reduced personalized support for various things like user views, reports, and notifications [34].

2.4.5 TIBCO Spotfire X TIBCO Spotfire allows users to combine data in a single analysis and get a holistic view with interactive visualization. Spotfire software makes businesses bright, delivers AI-driven analytics, and makes it easier to plot interactive data on maps. The platform helps businesses transform their data into powerful insights with ease and in less time. It speeds up data analysis across an organization for faster, confident, and much accurate decision making [35] has shown a promising aspect of using Spotfire to analyze sentiments obtained from Twitter to analyze customers’ responses toward them. In contrast, some of the limitations of the product include that it is more complicated for the user to look into the minute date as it is harder to pass through the layers, comparing with excel it is hard to work on its custom fields, the SQL queries in the Spotfire cannot be customized as per user’s choice.

2.4.6 SAS Business Analytics SAS Business Analytics provides products enclosed such as SAS Visual Statistics, SAS Enterprise Guide, SAS Enterprise BI Server, SAS Visual Analytics, and SAS Office Analytics. SAS offering has shown a tremendous capacity to be used in diverse fields of study; visual analytics and thorough interface have provided user-friendly options [36]. Although SAS holds popularity as one of the initial players in the field of analytics, some of the product limitations include lack of support for API features, high license costs, the absence of automated report scheduling functionality, the steep learning curve for beginners, and lack of predictive analysis or trend indicators [37]. The set of features in cloud analytics services are often given according to market demands that serve essential purposes. Table 2.1 enlists the features available in market-based tools.

24

G. Kaur et al.

Table 2.1 Comprehensive review of tools according to features Features

Tools Qlik sense

Tableau

IBM Watson analytics

Microsoft power BI

TIBCO SAS business Spotfire X analytics

Data integration

˛

˛

˛

˛

˛

˛

Data linking

˛

˛

˛

˛

˛

Data literacy

˛

˛

˛

˛

˛

˛

Data reporting scheduling

˛

–

˛

–

–

˛

Data security

Low

Low

Medium

High

Medium

High

Multi-cloud Support

˛

˛

˛

˛

˛

˛

API extension and customization

˛

˛

˛

˛

–

˛

Smart analytics ˛

˛

˛

˛

Compatibility

˛

˛

˛

˛

˛

˛

Bug fixes

Low

Low

Low

High

Low

Low

Memory-space consumption

High

Very high

Medium

Very high

High

Medium

Interface

Comfortable

Bulky

Bulky and slow

Bulky and Crowded crowded and less organized

Comfortable

Capacitive of more extensive data volumes

Low

Medium

Medium

Low

Low

High

Yes

Yes

Yes

No

Yes

Cloud ontology No

˛

2.5 Discussions 2.5.1 Benefits to Cloud-Based Analytics The organizations tend to make informed decisions, streamlining the overall process of analytics, creating solutions from the data generated at their end. As the data keeps collecting via various modes of communication, a cloud-based environment will allow one to perform real-time data consolidation, track and manage resources and channels [16, 17]. It optimizes operations and enables real-time processing, as its benefits alongside minimizing cost incurred, which vary with the cloud deployment model used.

2 Cloud Analytics: An Outline of Tools and Practices

25

• In contrast to on-premise analytics, cloud-based tools have low latency, scale according to demand and establish a direct connection, and provide agility [38]. • A principle benefit of using cloud-based tools is that the utilization can be measured and charged accordingly. Therefore dynamically, the services can be used or discarded, according to the data collected. • Cloud-based services ensure higher availability, flexible scheduling, elastic infrastructure, and adaptable self-learning models that can understand and shape better decisions [39]. • Reliability and security are enhanced as cloud-based vendor security is provided and data breaches can be detected.

2.5.2 Overview of Emerging Trends Cloud-based applications, either as individual or as enterprise offerings in the analytics-as-a-service market, have seen a tremendous rise. However, more significant amounts of data to be handled have led organizations to cope with redundant data alongside dirty data [40]. It is as a result of this creating a sharp rise in mining data from blockchains. Akter et al. [41] and Pham et al. [42] have mentioned how important it is to maintain data integrity while performing analysis. Blockchain has been explored for multiple purposes, extensively finding its applications in cryptocurrencies, NFTs, and social media. As an exploration using the data from blockchain ledgers, withholding data integrity, it would be easier to perform predictions and statistics without irrelevant data. Consequently, as each ledger in a blockchain demarks any chances of tampering with it, this will serve as a significant advantage in this regard [43]. Some advantages of this approach include: • Data is decentralized and can be mined easily at various IoT edge nodes. • Data is validated, structured, immutable. • Data transfers are protected, and transactions are protected from issues like double transactions [44]. • Data tracking is easy.

2.6 Conclusion Cloud analytics is a model in which a single or more than a single analytics element is executed in the cloud. The elements can be on-premise, part of a hybrid model, or entirely in the cloud. The cloud model is helpful as the organization’s analytics capabilities can grow as the organization itself grows. Traditional analytics have been around in the industry for a long time. They have been heavily dependent on traditional means such as workforce management to extract valuable data from the given data. It involves using statistics to plot graphs and find the relations among the data sets. Most of these relations are simple and do not entirely exploit the data

26

G. Kaur et al.

sets. However, cloud computing analytics does not stop there. The cloud exploits the complex relationships between the variables of the data sets and helps increase profit. The benefits that cloud analytics provides are why various organizations have been adopting cloud analytics to improve the efficiency of their businesses and increase the profits of their business. The benefits of cloud analytics are enterprise data consolidation, ease of access, sharing, and consolidation, reduced operating costs, and scalability.

References 1. Rajan, A.P., Shanmugapriyaa, S.: Evolution of cloud storage as cloud computing infrastructure service. IOSR J. Comput. Eng. 1, 38–45 (2013) 2. Kulkarni, G., Sutar, R.: Cloud computing-infrastructure as service-amazon EC2 3. Gartner, I.: Gartner forecasts worldwide public cloud end-user spending to grow 23% in 2021. https://www.gartner.com/en/newsroom/press-releases/2021-04-21-gartner-forecastsworldwide-public-cloud-end-user-spending-to-grow-23-percent-in-2021. Last accessed 07 June 2021 4. Kim, Y., Lin, J.: Serverless data analytics with flint. In: IEEE International Conference on Cloud Computing, CLOUD, pp. 451–455. IEEE Computer Society (2018). https://doi.org/10. 1109/CLOUD.2018.00063 5. Revathi, R., Devi Satya Sri, V., Philip, B.J., Narendra, G., Kumar, S., Ahammad, S.H.: Cloud analytics application of big data analytics for creation of precise measures. Turk. J. Phys. Rehabil. 32 6. Delen, D., Demirkan, H.: Data, Information and Analytics as Services. Elsevier B.V. (2013). https://doi.org/10.1016/j.dss.2012.05.044 7. Khan, S., Shakil, K.A., Alam, M.: Cloud-based big data analytics—a survey of current research and future directions. In: Advances in Intelligent Systems and Computing, pp. 595–604. Springer (2018). https://doi.org/10.1007/978-981-10-6620-7_57 8. Böhm, M., Krcmar, H.: Cloud Computing and Computing Evolution DSR Methodology View Project Einführungs-und Kommunikationsstrategien für IT Infrastrukturprojekte. View Project (2011) 9. Bashari Rad, B., Diaby, T.: Cloud computing: a review of the concepts and deployment models. Artic. Int. J. Inf. Technol. Comput. Sci. 6, 50–58 (2017). https://doi.org/10.5815/ijitcs.2017. 06.07 10. Lee, J., Azamfar, M., Singh, J.: A blockchain enabled cyber-physical system architecture for industry 4.0 manufacturing systems. Manuf. Lett. 20, 34–39 (2019). https://doi.org/10.1016/j. mfglet.2019.05.003 11. Pirzadeh, P., Carey, M., Westmann, T.: A performance study of big data analytics platforms. In: Proceedings—2017 IEEE International Conference on Big Data, Big Data 2017, pp. 2911– 2920. Institute of Electrical and Electronics Engineers Inc. (2017). https://doi.org/10.1109/Big Data.2017.8258260 12. Bhandari, G.: A tale of two cloud analytics platforms for education. Int. J. Cloud Comput. 7, 237–247 (2018). https://doi.org/10.1504/IJCC.2018.095394 13. Seol, J., Kancharla, A., Park, N.N., Park, N.N., Park, I.N.: The dependability of crypto linked off-chain file systems in backend blockchain analytics engine. Int. J. Networked Distrib. Comput. 6, 210–215 (2018). https://doi.org/10.2991/ijndc.2018.6.4.3 14. Osman, A., El-Refaey, M., Elnaggar, A.: Towards real-time analytics in the cloud. In: Proceedings—2013 IEEE 9th World Congress on Services, Services 2013, pp. 428–435 (2013). https:// doi.org/10.1109/SERVICES.2013.36

2 Cloud Analytics: An Outline of Tools and Practices

27

15. Islam, T., Manivannan, D., Zeadally, S.: A classification and characterization of security threats in cloud computing. Int. J. Next Gen. Comput. 7, 1–17 (2016). https://doi.org/10.1073/pnas. 1004982107 16. Schnase, J.L., Duffy, D.Q., Tamkin, G.S., Nadeau, D., Thompson, J.H., Grieg, C.M., McInerney, M.A., Webster, W.P.: MERRA analytic services: meeting the big data challenges of climate science through cloud-enabled climate analytics-as-a-service. Comput. Environ. Urban Syst. 61, 198–211 (2017). https://doi.org/10.1016/j.compenvurbsys.2013.12.003 17. Jayaraman, P.P., Perera, C., Georgakopoulos, D., Dustdar, S., Thakker, D., Ranjan, R.: Analytics-as-a-service in a multi-cloud environment through semantically-enabled hierarchical data processing. In: Software—Practice and Experience, pp. 1139–1156. Wiley (2017). https:// doi.org/10.1002/spe.2432 18. Anisetti, M., Ardagna, C., Bellandi, V., Cremonini, M., Frati, F., Damiani, E.: Privacy-aware big data analytics as a service for public health policies in smart cities. Sustain. Cities Soc. 39, 68–77 (2018). https://doi.org/10.1016/j.scs.2017.12.019 19. Jaimahaprabhu, A., Praveen Kumar, V., Gangadharan, P.S., Latha, B.: Cloud analytics based farming with predictive analytics using artificial intelligence. In: 5th International Conference on Science Technology Engineering and Mathematics, ICONSTEM 2019, pp. 65–68. Institute of Electrical and Electronics Engineers Inc. (2019). https://doi.org/10.1109/ICONSTEM.2019. 8918785 20. Jam, M.R., Khanli, L.M., Javan, M.S., Akbari, M.K.: A survey on security of Hadoop. In: Proceedings of the 4th International Conference on Computing and Knowledge Engineering. ICCKE 2014, pp. 716–721 (2014). https://doi.org/10.1109/ICCKE.2014.6993455 21. Akoglu, L., Tong, H., Koutra, D.: Graph based anomaly detection and description: a survey (2015). https://doi.org/10.1007/s10618-014-0365-y 22. Massimino, B.: Accessing online data: web-crawling and information-scraping techniques to automate the assembly of research data. J. Bus. Logist. 37, 34–42 (2016). https://doi.org/10. 1111/jbl.12120 23. Goldstein, M., Uchida, S.: A comparative evaluation of unsupervised anomaly detection algorithms for multivariate data. PLoS ONE 11, e0152173 (2016). https://doi.org/10.1371/journal. pone.0152173 24. Wiering, M., van Martijn, O.: Reinforcement Learning: State-of-the-Art (2013). https://doi. org/10.1007/978-3-642-27645-3 25. Gupta, R., Gupta, H., Mohania, M.: Cloud computing and big data analytics: what is new from databases perspective? In: Lecture Notes in Computer Science (Including Subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), pp. 42–61. Springer, Berlin, Heidelberg (2012). https://doi.org/10.1007/978-3-642-35542-4_5 26. Chen, Y.Y., Lin, Y.H., Kung, C.C., Chung, M.H., Yen, I.H.: Design and implementation of cloud analytics-assisted smart power meters considering advanced artificial intelligence as edge analytics in demand-side management for smart homes. Sensors (Switzerland) 19, 2047 (2019). https://doi.org/10.3390/s19092047 27. Zulkernine, F., Martin, P., Zou, Y., Bauer, M., Gwadry-Sridhar, F., Aboulnaga, A.: Towards cloud-based analytics-as-a-service (CLAaaS) for big data analytics in the cloud. In: Proceedings—2013 IEEE International Congress on Big Data, BigData 2013, pp. 62–69 (2013). https:// doi.org/10.1109/BigData.Congress.2013.18 28. Hackathorn, R., Margolis, T.: Immersive analytics: building virtual data worlds for collaborative decision support. In: 2016 Workshop on Immersive Analytics, IA 2016, pp. 44–47. Institute of Electrical and Electronics Engineers Inc. (2017). https://doi.org/10.1109/IMMERSIVE.2016. 7932382 29. Senousy, Y., El-Din, A., Riad, M., Mohamed, Y., Senousy, B., El-Khamisy, N., Ghitany, M.: Recent Trends in Big Data Analytics Towards More Enhanced Insurance Business Models Article in (2018) 30. Poeng, R., Cheikhali, N., McGrory, M., Dave, J.K.: On the feasibility of epic electronic medical record system for tracking patient radiation doses following interventional fluoroscopy procedures. Presented at the 16 Mar 2019. https://doi.org/10.1117/12.2512910

28

G. Kaur et al.

31. Nair, L., Nair, L., Shetty, S., Shetty, S.: Interactive visual analytics on big data: Tableau vs D3.js. J. e-Learn. Knowl. Soc. 12 (2016) 32. Chen, Y., Argentinis, E., Weber, G.: IBM Watson: How Cognitive Computing Can Be Applied to Big Data Challenges in Life Sciences Research (2016). https://doi.org/10.1016/j.clinthera. 2015.12.001 33. Becker, L.T., Gould, E.M.: Microsoft power BI: extending excel to manipulate, analyze, and visualize diverse data. Ser. Rev. 45, 184–188 (2019). https://doi.org/10.1080/00987913.2019. 1644891 34. Carlisle, S.: Software: Tableau and Microsoft Power BI (2018). https://doi.org/10.1080/247 51448.2018.1497381 35. Sindhani, M., Parameswar, N., Dhir, S., Ongsakul, V.: Twitter analysis of founders of top 25 Indian startups. J. Glob. Bus. Adv. 12, 117–144 (2019). https://doi.org/10.1504/JGBA.2019. 099918 36. Abousalh-Neto, N.A., Kazgan, S.: Big data exploration through visual analytics. In: IEEE Conference on Visual Analytics Science and Technology 2012, VAST 2012—Proceedings, pp. 285–286 (2012). https://doi.org/10.1109/VAST.2012.6400514 37. Pendergrass, J.: The Architecture of the SAS® Cloud Analytic Services in SAS® ViyaTM 38. Pelle, I., Czentye, J., Doka, J., Sonkoly, B.: Towards latency sensitive cloud native applications: a performance study on AWS. In: IEEE International Conference on Cloud Computing. CLOUD 2019, pp. 272–280, July 2019. https://doi.org/10.1109/CLOUD.2019.00054 39. Tang, H., Li, C., Bai, J., Tang, J.H., Luo, Y.: Dynamic resource allocation strategy for latencycritical and computation-intensive applications in cloud–edge environment. Comput. Commun. 134, 70–82 (2019). https://doi.org/10.1016/j.comcom.2018.11.011 40. Jesmeen, M.Z.H., Hossen, J., Sayeed, S., Ho, C.K., Tawsif, K., Rahman, A., Arif, E.M.H.: A survey on cleaning dirty data using machine learning paradigm for big data analytics. A neurofuzzy controller based solar panel tracking system view project design and implementation of a sustainable micro-grid system with high penetration of renewable energy which has interaction with PHEVs view project. Indones. J. Electr. Eng. Comput. Sci. 10, 1234–1243 (2018). https:// doi.org/10.11591/ijeecs.v10.i3.pp1234-1243 41. Akter, S., Michael, K., Uddin, M.R., McCarthy, G., Rahman, M.: Transforming business using digital innovations: the application of AI, blockchain, cloud and data analytics. Ann. Oper. Res., 1–33 (2020). https://doi.org/10.1007/s10479-020-03620-w 42. Pham, Q.-V., Nguyen, D.C., Bhattacharya, S., Prabadevi, B., Reddy Gadekallu, T., Kumar Reddy Maddikunta, P., Fang, F., Pathirana, P.N., Deepa, N., Pham, Q.-V., Nguyen, D.C., Bhattacharya, S., Prabadevi, B., Gadekallu, T.R., Maddikunta, P.K.R., Fang, F., Pathirana, P.N.: A Survey on Blockchain for Big Data: Approaches, Opportunities, and Future Directions (2020) 43. Dillenberger, D.N., Novotny, P., Zhang, Q., Jayachandran, P., Gupta, H., Hans, S., Verma, D., Chakraborty, S., Thomas, J.J., Walli, M.M., Vaculin, R., Sarpatwar, K.: Blockchain analytics and artificial intelligence. IBM J. Res. Dev. 63 (2019). https://doi.org/10.1147/JRD.2019.290 0638 44. Harshavardhan, A., Vijayakumar, T., Mugunthan, S.R.: Blockchain technology in cloud computing to overcome security vulnerabilities. In: Proceedings of the International Conference on I-SMAC (IoT in Social, Mobile, Analytics and Cloud), I-SMAC 2018, pp. 408–414. Institute of Electrical and Electronics Engineers Inc. (2019). https://doi.org/10.1109/I-SMAC. 2018.8653690

Chapter 3

An Enhanced Fault-Tolerant Load Balancing Process for a Distributed System Deepak Kumar Patel and Chitaranjan Tripathy

Abstract Fault-tolerant load balancing makes an important part of a distributed system which typically consists of strongly varying and geographically distributed resources. From a customer point of view, the aim is always to improve the performance of this distributed system by reducing the overall execution time of tasks. Any resource failure will lead the delay in the overall execution of tasks. Therefore, a faulttolerant scheduling is always required in a distributed system so that the impact of resource failure can be reduced. In the paper, the proposed scheme “Enhanced Load Balancing based on Resource Failure Cost” (LBRF) presents a fault-tolerant load balancing method in which jobs are scheduled by considering the resource failure cost of the computational resources so that every job can finish its execution within the allotted timeframe. The performance metrics such as throughput and turnaround time are used to measure the performance of the proposed scheme LBRF.

3.1 Introduction Any conventional computing system has many numbers of heterogeneous resources from different places of the world to formulate a distributed computing system environment [1]. Any distributed system’s performance depends on the effectiveness of the scheduling and load balancing technique among its various computing resources [2, 3]. To achieve this goal, a resource should be always balanced and fault-tolerant so that any job’s execution on that resource will be time-efficient. Resources of any distributed systems such as computational grids or cloud computing are failure-prone. Because of this, a fault-tolerant system is necessary for all heterogeneous resources to minimize the resources failure’s impact on grid performance and to maintain the complete balance [4, 5]. D. K. Patel (B) Department of Computer Science and Information Technology, Siksha ‘O’ Anusandhan (Deemed to be University), Bhubaneswar, Khordha, Odisha 751030, India C. Tripathy Biju Patnaik University of Technology, Rourkela, Odisha 769015, India © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2022 D. Mishra et al. (eds.), Intelligent and Cloud Computing, Smart Innovation, Systems and Technologies 286, https://doi.org/10.1007/978-981-16-9873-6_3

29

30

D. K. Patel and C. Tripathy

Here, a load balancing scheme “LBRF” is proposed where the resource failure cost of the resources has been taken into account as a major deciding factor in the implementation of the selection method for fault-tolerant scheduling. This proposed scheme reduces the impact of resources failure over the load balancing procedure so that every assigned job will be executed on the available resources within the assigned timeframe. Our scheme increases throughput for distributed systems. In the remaining part of the paper, the background and the related work are discussed in Sect. 3.2. The next section describes the load balancing scheme LBRF and Sect. 3.4 presents the simulated results. Section 3.4 finishes the paper.

3.2 Background Grid computing and cloud computing represents more advanced form of distributed computing. Cloud computing always incorporates distributed computing concepts to provide shared resources through its own Datacenters, Virtual Machines, and servers. Figure 3.1 is the distributed architecture of cloud. Datacenter broker is the responsible entity for the successful scheduling of jobs among resources. A Datacenter can have many hosts and a single host can have many processing units. These are called as Virtual Machines (VMs) in this cloud architecture. But here, we have considered a grid architecture for the simulation of our scheme in the distributed system [6, 7] (Fig. 3.2). In this grid architecture, the broker handles the entire operation by making the resources liable for the successful execution of all the jobs and this execution happens using the processing elements of machines present under that resource [8–10].

Fig. 3.1 A distributed cloud architecture

3 An Enhanced Fault-Tolerant Load Balancing Process …

31

Fig. 3.2 A distributed grid architecture

3.2.1 Related Work Many load balancing schemes are developed for a distributed system [11–21]. The major schemes include list of journal publications and conference proceedings. Here, the mentioned load balancing scheme is LBFD [21]. The above scheme has done successful simulation of their scheduling procedure by taking into account the resource failure rate for safe scheduling. In reality, every resource of the distributed systems is failure-prone as suggested by many publications. Therefore, to minimize the resources failure’s impact on a distributed system’s performance, a much more efficient fault-tolerant scheduling system is required. Our load balancing scheme “LBRF” is proposed where the resource failure cost of the resources has been considered as a major deciding factor instead of resource fault index, in the implementation of the selection method for fault-tolerant scheduling. This proposed scheme reduces the impact of resources failure significantly over the load balancing procedure.

3.3 Proposed Enhanced Load Balancing Scheme with Resource Failure Rate Consideration A distributed system should be healthy means balanced for time-efficient execution. Therefore, the balancing situation is always predicted in the distributed system in a specific interval with the help of machine and resource load levels. Due to this procedure, all the high load and low load resources are highlighted. So, some jobs coming from high load resources can be rescheduled on the low load resources. This rescheduling should happen under certain conditions. In the existing LBFD, all the jobs waiting for rescheduled, sorted according to the lowest deadline and the low load resources according to the lowest resource fault index. To calculate the resource fault index, the fault occurrence history information is taken into account of a resource. When a resource is unable to do its duty of job execution within the

32

D. K. Patel and C. Tripathy

assigned timeframe, then the resource fault index is increased every time. When a resource is able to do its duty of job execution within the assigned timeframe, then the resource fault index is decreased every time (Fig. 3.3). But our load balancing scheme LBRF has been considered the resource failure cost of the resources as an alternative to resource fault index in deciding the resource selection factor. The resource failure cost parameter is more preferable to consider the resource failure history before deciding on job allocation. The resource having the lowest failure cost has a minimal chance to experience the failure (Fig. 3.4).

Fig. 3.3 The proposed model

Fig. 3.4 Model parameters

3 An Enhanced Fault-Tolerant Load Balancing Process …

33

In our scheme, each job will be assigned to the healthiest means most reliable resource to execute. Here, the fault index of a resource has taken an important role of finding out among the distributed system. The resource failure cost of any resource (NFR) is calculated as follows. NFR = NRF/(NRC + NRF).

(3.1)

Here, NRF and NRC are known as the no. of times the resource has unsuccessful and successful completion of the jobs, respectively. If any resource has an unsuccessful completion of a task, then the NRF’s value will be incremented by 1 every time. Also, that job will be rescheduled to other reliable resources in the grid that has lowest fault rate. If any resource has successful completion of a job, then the NRC’s value will be incremented by 1 every time. The above-calculated NFR value should be taken into account while making the decision regarding rescheduling. The most suitable resource for rescheduling will have the minimum NFR value. According to the given model in Fig. 3.2 and their parameters mentioned in Fig. 3.3, we can say that node A has high load, node B and E has normal load, node C and D has low load. The state can be known of a node as follows. If the result is 0, +ve, and −ve, then it will be normal load, low load, and high load, respectively. State = Load Capacity (LC) − Current Capacity (CC). So, the jobs from node A can be rescheduled to one of the low load nodes C and D by ignoring the normal load node B and E. The jobs will choose first the node C who has the minimum NFR than node D. Node C has the lowest chance of facing failure than node D. When the node C will reach to its Load Capacity (LC), then it will not further execute the jobs otherwise it will be high load. The remaining jobs will be given to the next node D for execution.

3.4 Simulated Results See Table 3.1. Table 3.1 Simulator parameters

Parameter

Values

PEs per resource

10

Bandwidth

0.5–10 Mbps

Job deadline

1–6 s

Job length

1–5 MIs

PE capacity

1–5 MIPS

34

D. K. Patel and C. Tripathy

Fig. 3.5 Turnaround time versus fault percentage for jobs = 50

3.4.1 Results Simulator GridSim 5.0 [22, 23] is used for the simulation of the scheme LBRF. In general, turnaround time (TAT) means the amount of time taken to complete a process (Figs. 3.5, 3.6, 3.7 and 3.8). TAT = Time of job submission − Time of job completion. Throughput is defined as: n/Tn ; where n is the total number of jobs submitted and T n is the total amount of time necessary to complete n jobs. Here, we have presented two results for the comparison of turnaround time and two results for the comparison of throughput with fault percentage which have ranged from 2 to 10% by taking number of fixed jobs as 50 and 200. In all cases, the throughput of proposed LBRF is higher than LBFD and the turnaround time of LBRF is lower than LBFD. When the fault of resources becomes higher means fault percentage increases, the throughput decreases. But the proposed LBRF’s incremental lead over LBFD is always maintained. Also, When the fault of resources becomes higher means fault percentage increases, the turnaround time increases. But the proposed LBRF’s decremental lead over LBFD is always maintained.

3 An Enhanced Fault-Tolerant Load Balancing Process …

Fig. 3.6 Turnaround time versus fault percentage for jobs = 200

Fig. 3.7 Throughput versus fault percentage for jobs = 50

35

36

D. K. Patel and C. Tripathy

Fig. 3.8 Throughput versus fault percentage for jobs = 200

3.5 Conclusion Here, the new load balancing scheme LBRF is discussed for the distributed system. To minimize the resources failure’s impact on a distributed system’s performance, a much more efficient fault-tolerant scheduling system is proposed. Our load balancing scheme LBRF is proposed where the resource failure cost of the resources has been considered as a major deciding factor for fault-tolerant scheduling. This proposed scheme reduces the impact of resources failure significantly over the load balancing procedure so that every job can be executed within the assigned timeframe. Throughput and Turnaround time performance metrics have been used to measure the performance of the proposed scheme LBRF on a distributed system.

References 1. Hao, Y., Liu, G., Wen, N.: An enhanced load balancing mechanism based on deadline control on GridSim. Futur. Gener. Comput. Syst. 28, 657–665 (2012) 2. Yagoubi, B., Slimani, Y.: Task load balancing strategy in grid environment. J. Comput. Sci. 3(3), 186–194 (2007) 3. Patel, D.K., Tripathy, D., Tripathy, C.R.: Survey of load balancing techniques for grid. J. Netw. Comput. Appl. 65, 103–119 (2016) 4. Bosilca, G., Delmas, R., Dongarra, J., Langou, J.: Algorithm-based fault tolerance applied to high performance computing. J. Parall. Distrib. Comput. 69(4), 410–416 (2009)

3 An Enhanced Fault-Tolerant Load Balancing Process …

37

5. Abdullah, A.M., Ali, H.A., Haikal, A.Y.: Reliable and efficient hierarchical organization model for computational grid. J. Parall. Distrib. Comput. 104, 191–205 (2017) 6. CloudSim Homepage: http://www.cloudbus.org/cloudsim/. Last accessed 25 Aug 2021 7. Calheiros, R.N., Ranjan, R., Beloglazov, A., De Rose, C.A., Buyya, R.: CloudSim: a toolkit for modeling and simulation of cloud computing environments and evaluation of resource provisioning algorithms. Softw. Pract. Exp. 41(1), 23–50 (2011) 8. Sulistio, A., Poduval, G., Buyya, R., Tham, C.K.: On incorporating differentiated levels of network service into GridSim. Futur. Gener. Comput. Syst. 23, 606–615 (2007) 9. Patel, D.K., Tripathy, C.R.: An improved approach for load balancing among heterogeneous resources in computational grids. Eng. Comput. 31, 825–839 (2015) 10. Buyya, R., Murshed, M.: GridSim: a toolkit for the modeling and simulation of distributed management and scheduling for grid computing. J. Concurr. Comput. Pract. Exp. 14, 13–15 (2002) 11. Patel, D.K., Tripathy, C.R.: An efficient load balancing mechanism with cost estimation on GridSim. In: Proceedings of the 15th IEEE International Conference on Information Technology (ICIT’16), pp. 75–80. Bhubaneswar, India (2016) 12. Patel, D.K., Tripathy, C.R.: On the design of an efficient load balancing mechanism on GridSim adapted to the computing environment of heterogeneity in both resources and networks. IET Netw. (IET) 7, 406–413 (2018) 13. Shukla, A., Kumar, S., Singh, H.: Fault tolerance based load balancing approach for web resources. J. Chin. Inst. Eng. 42(7), 583–592 (2019) 14. Abdullah, A.M., Ali, H.A., Haikal, A.Y.: A reliable, TOPSIS-based multi-criteria, and hierarchical load balancing method for computational grid. Clust. Comput., 1–22 (2016) 15. Patel, D.K., Tripathy, C.R.: A new approach for grid load balancing among heterogeneous resources with bandwidth consideration. In: Proceedings of the 4th IEEE International Conference on Information, Communication and Embedded Systems (ICICES), pp. 1–6. Bhubaneswar, India (2014) 16. Patel, D.K., Tripathy, C.R.: An effective selection method for scheduling of gridlets among heterogeneous resources with load balancing on GridSim. In: Proceedings of the 3rd International Conference on Computational Intelligence and Networks (CINE), pp. 68–72. Bhubaneswar, India (2017) 17. Qureshi, K., Rehman, A., Manuel, P.: Enhanced GridSim architecture with load balancing. J. Supercomput., 1–11 (2010) 18. Patel, D.K., Tripathy, C.R.: An efficient selection procedure with an enhanced load balancing scheme for GridSim. In: Proceedings of the 3rd International Conference on Advanced Computing and Intelligent Engineering (ICACIE’18), pp. 485–494. Bhubaneswar, India (2010) 19. Patel, D.K., Tripathy, D., Tripathy, C.R.: An improved load balancing mechanism based on deadline failure recovery on GridSim. Eng. Comput. 32, 173–188 (2016) 20. Yagoubi, B., Slimani, Y.: Dynamic load balancing strategy for grid computing. World Acad. Sci. Eng. Technol., 90–95 (2006) 21. Patel, D.K., Sahoo, D.K.: An efficient load balancing system with resource failure consideration for a distributed system. In: Proceedings of the 2nd International Conference on Advances in Distributed Computing and Machine Learning (ICADCML), pp. 1–7. Bhubaneswar, India (2021) 22. GridSim Homepage: http://www.buyya.com/GridSim/. Last accessed 25 Aug 2021 23. Murshed, M., Buyya, R., Abramson, D.: GridSim: a toolkit for the modeling and simulation of global grids. Technical Report, Monash, CSSE (2001)

Chapter 4

HeartFog: Fog Computing Enabled Ensemble Deep Learning Framework for Automatic Heart Disease Diagnosis Abhilash Pati, Manoranjan Parhi, and Binod Kumar Pattanayak

Abstract Recently IoT, i.e., Internet of Things, is gaining importance for various fields; e-Healthcare is one of the IoT applications that can be used in developing different kinds of services like storage of data, managing resources, and power, creating and computing services, etc. Fog Computing and Cloud Computing play vital roles in critical data and IoT implementations. The extension of Cloud Computing i.e., termed Fog Computing; is suitable in IoT implementations in the real world in recent days. Fog Computing has been established to overcome the limitations relating to Cloud Computing and its implementations. In this paper, HeartFog, an intelligent real-time decision support system framework, based on appropriate analysis and IoT for the remote detection of cardiovascular disease, is proposed to enhance the accuracy in diagnosis with unexplained data. This proposed work is evaluated in terms of training accuracy, test accuracy, arbitration time, latency, execution time, and power consumption. This proposed work can be employed in the development of next-generation of IoT models and services. From the experiments, it is found to be a suitable framework for the instantaneous diagnosis of heart patients remotely.

4.1 Introduction The Internet, the first digital revolution and interconnection of different networks globally, is considered to be the all-time stunning innovation. The evolutionary phase continues and now at the trend of IoT, the second digital revolution, which plays a vital role in remote communications [1]. The IoT enhanced smart devices are A. Pati (B) · M. Parhi · B. K. Pattanayak Department of Computer Science and Engineering, Faculty of Engineering and Technology, Siksha ‘O’ Anusandhan (Deemed to be University), Bhubaneswar, Odisha, India M. Parhi e-mail: [email protected] B. K. Pattanayak e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2022 D. Mishra et al. (eds.), Intelligent and Cloud Computing, Smart Innovation, Systems and Technologies 286, https://doi.org/10.1007/978-981-16-9873-6_4

39

40

A. Pati et al.

Fig. 4.1 The combination of IoT, Cloud, and Fog computing layers

playing a vital role in recent healthcare systems. Recently, Cloud Computing (CC) and its extended versions i.e., Edge Computing (EC) and Fog Computing (FC) play significant roles in real-world implementations. The concept of CC has come with IoT implementations. The primary objective of e-Healthcare is to provide services to make patients’ life smooth and convenient. The IoT and CC concepts have come out as the pillar of the Internet. A concept named FC, advancement added to CC, came to overcome the shortcomings related to CC and to work like gateway connecting various terminals and cloud servers [2]. The FC tries to optimize communication in between as well as save bandwidth. Besides, The FC can be employed in the fields like disease diagnosis and enhanced prediction accuracy. The combination of IoT, Cloud, and Fog Computing layers is depicted in Fig. 4.1. Fog-based architectures have efficiency in dealing with the problems related to e-Healthcare like scalability, readability, adaptability, energy awareness, etc. For maintaining a healthy lifestyle, the individual should be conscious of severe diseases like heart stroke, heart attack, heart failure, blindness in human beings, etc. Heart means cardio and the disease category is termed as cardiologist disease. Recently, the disease which is treated as the second top reason of demise is heart disease due to the improper functioning of the heart. That is why; proper and early detection is much required of heart diseases [3]. The various types of heart diseases embrace coronary artery disease, congenital heart disease, arrhythmia, hypertrophic cardiomyopathy, dilated cardiomyopathy, heart failure, myocardial infection, mitral regurgitation, etc. Deep learning (DL), an extension of Machine learning (ML), is gaining popularity day by day in classifying data points. Ensemble learning (EL), advancements added to ML and DL, can be employed in predictive analysis for enhancing performance as well as accuracy. In this paper, HeartFog, an intelligent real-time decision support system framework, based on appropriate analysis and IoT for the remote detection

4 HeartFog: Fog Computing Enabled Ensemble Deep Learning …

41

of cardiovascular disease, is proposed to enhance the accuracy in diagnosis with unexplained data. This proposed work is evaluated in terms of training accuracy, test accuracy, arbitration time, latency, execution time, and power consumption. The advantages of this proposed work are not limited; can be employed in the development of next-generation of IoT models and services. This proposed work can play an essential role in integrating data globally and making these accessible to the users in an automatic manner. The paper can be summarized as Sect. 4.2 for the discussion about the associated works done and a comparison with this proposed work; the architecture of this work is discussed under Sect. 4.3; Sect. 4.4 for a discussion on the design of the work; the performance evaluation and analysis are discussed under Sect. 4.5, and Sect. 4.6 for the conclusion and future scope of the work.

4.2 Related Works Caliskan and Yuksel [4] proposed DNN, i.e., deep neural network, classifier for coronary artery disease classification based on accuracy, sensitivity, and specificity as evaluative metrics on four different heart disease datasets (HDDs) obtained from the UCI-ML repository: Cleveland, Hungarian, Switzerland, and Long Beach, and claimed that this trained model is an easily accessible and cost-effective solution alternative to the existing techniques in the diagnosis. Gupta [5] proposed a hybrid technique combining KNN, i.e., k-nearest neighbors, and GA, i.e., genetic algorithm, based on evaluative metric accuracy on a Hungarian HDD taken from the UCI-ML warehouse, and found that the suggested hybrid method is more accurate and efficient than current approaches. Jan et al. [6] have proposed an ensemble model by combining five classifiers namely, SVM, i.e., support vector machine, ANN, i.e., artificial neural network, NB, i.e., naïve Bayes, RF, i.e., random forest, and RA, i.e., regression analysis, for the prediction and diagnosis of the recurrence of heart disease by considering the evaluative measures accuracy, precision, f -measure, ROC, i.e., receiver operating characteristics (ROC) curve, RMSE, i.e., root mean squared error, and kappa statistics, on Hungarian and Cleveland HDDs obtained from UCI-ML warehouse, claimed of achieving high predictive accuracy and readability. Zhenya and Zhang [7] have proposed a cost-effective ensemble technique based on 5-classifiers namely, RF, logistic regression (LR), SVM, KNN, and extreme learning machine (ELM) to improve the efficiency of diagnosing heart diseases by considering the evaluative metrics namely, recall, precision, specificity, E, G-mean, MC, and area under ROC curve (AUC) on Cleveland, Starlog, and Hungarian HDDs sourced from UCI-ML warehouse and concluded that of getting significant results in comparison with others individually and can be a promising alternative. Ali et al. [8] have introduced a smart healthcare system based on the approaches feature fusion and EDL for the prediction of cardiovascular disease by considering the evaluative metrics accuracy, recall, precision, f -measure, RMSE, and mean absolute error (MAE) on

42

A. Pati et al.

Cleveland and Hungarian HDDs, thereby claiming of achieving more accurate results that lead to more efficient in predicting the heart disease. Moghadas et al. [9] proposed a technique for remotely monitoring a patient’s health using an AD8232 sensor module and an Arduino board, based on evaluative metrics such as accuracy and macro-f1. They claimed that KNN was the best-suited classifier among NB, RF, KNN, and SVM-linear for the validation and classification of the type of cardiac arrhythmia. Baccouche et al. [10] have introduced an EL framework based on a CNN technique with BiGRU or BiLSTM technique for heart disease data classification of the dataset collected from Mexico’s Medica Norte Hospital by considering the evaluative metrics namely, accuracy and F1-score resulted that the proposed framework overcomes the problem of classification of unbalanced HDD and provides more accurate results. Sum et al. [11] introduced a fog-based framework termed as FogMed for the prediction of heart disease based on the evaluative metrics namely, accuracy and time efficiency on ECG dataset, thereby claimed of achieving better performance than that of the CC techniques. Sharma and Parmar [12] have introduced a neural network (NN) based DL technique for the prediction of cardiovascular disease by including accuracy as the evaluative metrics on the Cleveland HDD, thereby claiming of achieving maximum accuracy in comparison with traditional approaches by this proposed DNN. Uddin and Halder [13] introduced MLDS, an EM-based multilayer dynamic system, with the feature selection techniques namely, information gain attribute evaluator (IGAE), gain ratio attribute evaluator (GAIN), correlation attribute evaluator (CAE), lasso, and extra trees (ETs) based on evaluative metrics namely, accuracy, precision, ROC, and AUC on a realistic dataset collected from Kaggle and concluded that this technique can efficiently predict cardiovascular disease. A comparison study of HeartFog with various existing works done related to this proposed work is depicted in Table 4.1.

4.3 HeartFog Architecture The architecture of HeartFog, as depicted in Fig. 4.2, includes various methods, hardware components, and software components used in this framework, which is elaborated below.

4.3.1 Technology Used FogBus is used in this work [14]. It is a framework based on the combination of IoT, CC, and FC concepts. It uses the concept of blockchain to ensure security, privacy, data integrity, etc. in communications. Aneka is a cloud computing platform that provides an easy API implementation for developers [15]. Its core component is designed in a service-oriented manner. Aneka is used in this work to use Cloud resources, whereas, FogBus is to use Fog resources. ANN is a DL approach is used

4 HeartFog: Fog Computing Enabled Ensemble Deep Learning …

43

Table 4.1 Comparison of HeartFog some considered with existing works Author

ML

DL

EL

IoT

Fog computing

Dataset used Evaluative measures

Results

Caliskan No and Yuksel [4]

Yes

No

No

No

Cleveland, Hungarian, Switzerland, and Long Beach HDD

Accuracy, Sensitivity, and Specificity

The proposed classifier is easily accessible and cost-effective

Gupta [5]

Yes

No

No

No

No

Hungarian HDD

Accuracy

The proposed model provides more accurate and efficient results

Jan et al. [6]

Yes

No

Yes

No

No

Cleveland and Hungarian HDD

Accuracy, TP rate, FP rate, precision, F-measure, ROC curve, kappa statistics, and RMSE

This ensemble method achieves high predictive accuracy and readability

Zhenya and Yes Zhang [7]

No

Yes

No

No

Starlog, Cleveland, and Hungarian HDDs

E, precision, specificity, recall, MC, G-mean, and AUC

This ensemble method achieves significant results in comparison with others individually and can be a promising alternative

Ali et al. [8]

Yes

Yes

No

No

Cleveland, and Hungarian HDDs

Accuracy, Precision, Recall, F-measure, MAE, and RMSE

This ensemble method achieves more accurate results that lead to more efficient in predicting the heart disease

No

(continued)

44

A. Pati et al.

Table 4.1 (continued) Author

ML

DL

EL

IoT

Fog computing

Dataset used Evaluative measures

Results

Moghadas et al. [9]

Yes

No

No

Yes

Yes

Tele ECG

Baccouche et al. [10]

No

Yes

Yes

No

No

Dataset Accuracy and collected F1-score from Medica Norte Hospital in Mexico

Sum et al. [11]

No

Yes

No

Yes

Yes

ECG dataset

Accuracy and This proposed time FogMed efficiency framework achieves enhanced performance compared to the cloud computing model

Sharma Yes and Parmar [12]

Yes

No

No

No

Cleveland HDD

Accuracy

Achieved maximum accuracy in comparison with traditional approaches by this proposed deep neural network (DNN)

Uddin and Yes Halder [13]

No

Yes

No

No

The realistic dataset collected from Kaggle

Accuracy, precision, ROC, and AUC

The proposed MLDS can effectively predict heart disease

Accuracy and The system Macro-F1 using KNN classifier is best suited for the classification and validation The proposed framework overcomes the problem of classification of unbalanced heart disease dataset and provides more accurate results

(continued)

4 HeartFog: Fog Computing Enabled Ensemble Deep Learning …

45

Table 4.1 (continued) Author

ML

DL

EL

IoT

Fog computing

Dataset used Evaluative measures

Results

HeartFog (Proposed work)

No

Yes

Yes

Yes

Yes

Hungarian HDD

This proposed HeartFog framework achieves enhanced as well as optimized performance using fog computing concepts compared to the cloud computing model

Evaluative metrics accuracy and network parameters namely, latency, arbitration time, execution time, and power consumption

Fig. 4.2 The architecture of HeartFog

to train the dataset in this work. The ensemble techniques used along with ANN in this work to distribute data and collect results from different worker nodes [16].

4.3.2 Hardware Components Used HeartFog comprises the hardware components namely, IoT sensors, Gateway, Broker/Master node, Worker node, and Cloud data center (CDC). The IoT sensors like Pulse Oximeter, blood pressure sensor, ECG sensor, etc. are used to sense the data from cardiovascular patients and send it to the gateway devices. The gateway devices like Mobile phones, tablets, and laptops are used to accept the patient data and send these to either Broker or Worker nodes. The Broker node receives the job

46 Table 4.2 Short description of Hungarian HDD

A. Pati et al. Dataset used

Number of attributes

Number of instances

Hungarian heart disease dataset

76 (10 considered in this paper)

294

request from gateway devices and either process these or assigns the worker node for processing. The worker node processes the data as per requests either from gateway devices or Broker nodes. In case of need to access to Cloud resources, the CDC is used to avail the said resources.

4.3.3 Software Components Used HeartFog comprises the software components namely, Data pre-processing, Workload manager, Arbitration module, and Ensemble DL module. The first step is the data pre-processing which includes filtering and pre-processing of the received data. The Workload manager handles the job requests from received different devices. The arbitration module is responsible for the scheduling of various tasks with either fog or cloud resources. The Ensemble DL module classifies the data and predicts accordingly.

4.4 HeartFog Design 4.4.1 Dataset The dataset used in this work to train the model is the Hungarian HDD obtained from the UCI-ML warehouse that consists of 76 attributes and 294 instances, from which 10 attributes are used to train the model. The 10 attributes considered in this work include age, gender, painloc, painexer, reltest, resting blood pressure, smoke, famhist, max-heart-rate-achieved, and heart disease. A short explanation of the HDD used is depicted in Table 4.2, whereas the samples of the dataset are depicted in Table 4.3.

4.4.2 Heart Patient Data Pre-processing The first stage is pre-processing of data, which entails pre-processing and filtering of data received via gateway devices from various IoT sensors such as pulse oximeters, blood pressure sensors, and so on. There are five numbers of outcomes in the raw

Gender

1

1

1

1

1

Age

63

44

60

55

66

1

1

1

1

1

Painloc

0

1

1

1

1

Painexer

0

1

1

1

1

Reltest

Table 4.3 Samples from Hungarian heart disease dataset

110

142

132

130

140

Resting blood pressure

0

1

1

0

0

Smoke

1

2

0

0

0

Famhist

99

149

140

127

112

Max heart rate achieved

0

1

2

0

2

Heart disease

4 HeartFog: Fog Computing Enabled Ensemble Deep Learning … 47

48

A. Pati et al.

datasets: 0, 1, 2, 3, and 4, which indicate the severity of the disease. 1, 2, 3, and 4 are treated as 1 in binary classification, indicating patients with heart disorders and 0 indicating individuals without heart ailments.

4.4.3 Ensemble DL Application The Ensemble DL application is used as a model in this work, which is trained with the Hungarian heart disease dataset, for the predictive analysis of data collected from patients to provide real-time results remotely. The ensemble techniques are used in this research are boosting with averaging. The data is divided into training, testing, and validation set in the ratio of 14:3:3 in this work.

4.4.4 Android Interface and Communication HeartDiseaseTest.apk, an android interface, is developed for this work to be used in various gateway devices, android supported, to collect the sensed data from the patients. It works as an edge among IoT sensors and Broker or Worker nodes.

4.4.5 Experimental Set-Up For the experiments in this work, the set-up is implemented with some hardware configurations as evaluative hardware including, gateway device (Xiaomi A2 with Android Version 10), broker node (HP with Core i3, 64-bit Windows 10 OS and8 GB RAM), worker node/s (Five numbers of Raspberry four with 4 GB SDRAM), and Public Cloud (Amazon web services with Windows server).

4.4.6 Implementation The Ensemble DL and pre-processing of data are implemented in this work using Python language, whereas, the android application is developed using MIT AppInventor and the web communications are done using PHP language. The data attributes received from patients via various gateway devices are saved in a.csv file and sent to the Master node utilizing HTTP Post.

4 HeartFog: Fog Computing Enabled Ensemble Deep Learning …

49

4.5 Results and a Discussion The evaluation and analysis of HeartFog include prediction accuracies (both training and testing accuracies), time characteristics (latency, arbitration time, and execution time), and power consumption. Because of the rise in the quantity of data received, the training accuracy improves correspondingly with the number of Workers nodes, as shown in Fig. 4.3. But in the case of test accuracy, it decreases gradually to the increase in worker nodes, which is depicted in Fig. 4.4. The arbitration time steadily rises with the number of nodes, but it is observed comparatively very less when using Cloud is depicted in Fig. 4.5. But, the main

Fig. 4.3 The training accuracies of different nodes

Fig. 4.4 The test accuracies of different nodes

50

A. Pati et al.

Fig. 4.5 The arbitration time of different nodes

objective of Fog Computing, i.e., latency, as depicted in Fig. 4.6, is comparatively very low to that of using Cloud to any number of fog nodes used; as projected. The execution time, as depicted in Fig. 4.7, is comparatively very little in Cloud set-up because of the large resources of Cloud. On the other hand, the power consumption, as depicted in Fig. 4.8 is comparatively very high, as expected, when using Cloud set-up.

Fig. 4.6 The latency of different nodes

4 HeartFog: Fog Computing Enabled Ensemble Deep Learning …

51

Fig. 4.7 The execution time of different nodes

Fig. 4.8 Power consumption of different nodes

The HeartFog framework is based on the concept of FogBus to ensure low latency and high accuracy application in patients’ heart disease diagnosis and prediction remotely by combining the concepts of IoT, CC, and FC.

52

A. Pati et al.

4.6 Conclusion and Future Scope The concept of FC and CC with IoT Implementations plays a vital role in e-Healthcare to make an individual’s life smooth and consistent in recent days. In this paper, HeartFog, an intelligent real-time decision support system framework, based on appropriate analysis and IoT for the remote detection of cardiovascular disease, is proposed to enhance the accuracy in diagnosis with unexplained data. This proposed work is evaluated in terms of training accuracy, test accuracy, arbitration time, latency, execution time, and power consumption. This framework is based on the concept of FogBus to ensure low latency and high accuracy applications in patients’ heart disease diagnosis and prediction remotely by integrating the concepts of IoT, CC, and FC. From the experiments, it is found to be a suitable framework for the instantaneous diagnosis of heart patients. Furthermore, this proposed work can be enhanced by using other ensemble ML and DL concepts. Also, the extension of this proposed model should use other cloud computing platforms like Mist Computing and Surge Computing. We need to aware individuals of the importance of IoT, Cloud, Fog, Edge Computing and their implantations for the global in recent days.

References 1. Dubravac, S., Ratti, C.: The Internet of Things: Evolution or Revolution?, vol. 1. Wiley, Hoboken, NJ, USA (2015) 2. Rahmani, A.M., Gia, T.N., Nagesh, B., Anzanpour, A., Azimi, I., Jiang, M., Lijeberg, P.: Exploiting smart e-health gateways at the edge of healthcare Internet-of-things: a Fog computing approach. Future Gen. Comput. Syst. 78, 641–658 (2018) 3. Pati, A., Parhi, M., Pattanayak, B.K.: IDMS: an integrated decision making system for heart disease prediction. In: 2021 1st Odisha International Conference on Electrical Power Engineering, Communication and Computing Technology (ODICON). IEEE (2021), Jan 2021, pp. 1–6 4. Caliskan, A., Yuksel, M.: Classification of coronary artery disease datasets by using a deep neural network. EuroBiotech. J. 1, 271–277 (2017) 5. Gupta, S.: Classification of heart disease Hungarian data using entropy, Knnga based classifier and optimizer. Int. J. Eng. Technol. 7(4.5), 292–296 (2018) 6. Jan, M., Awan, A., Khalid, M., Nisar, S.: Ensemble approach for developing a smart heart disease prediction system using classification algorithms. Res. Rep. Clin. Cardiol. 9, 33–45 (2018) 7. Zhenya, Q., Zhang, Z.: A hybrid cost-sensitive ensemble for heart disease prediction. BMC Med. Inform.. Decis. Mak. 21(1), 73–94 (2021) 8. Ali, F., Sappagh, S.E., Ismal, S.M.R., Kwak, D., Ali, A., Imran, M., Kwak, K.S.: A Smart Healthcare Monitoring System for Heart Disease Prediction Based on Ensemble Deep Learning and Feature Fusion, vol. 63, 208–222 (2020) 9. Moghadas, E., Rezazadeh, J., Farahbakhsh, R.: An IoT patient monitoring based on fog computing and data mining: cardiac arrhythmia. Int. Things 11(100251), 1–11 (2020) 10. Baccouche, A., Zapirain, B.G., Olea, C.C., Elmaghraby, A.: Ensemble deep learning models for heart disease classification: a case study from Mexico. Information 11(207), 1–28 (2020)

4 HeartFog: Fog Computing Enabled Ensemble Deep Learning …

53

11. Sun, L., Yu, Q., Peng, D., Subramani, S., Wang, X.: FogMed: a fog-based framework for disease prognosis based medical sensor data streams. Comput. Mater. Contin. 66, 603–619 (2020) 12. Sharma, S., Parmar, M.: Heart diseases prediction using deep learning neural network model. Int. J. Innov. Technol. Explor. Eng. 9(3), 2244–2248 (2020) 13. Uddin, M.N., Halder, R.K.: An ensemble method based multilayer dynamic system to predict cardiovascular disease using machine learning approach. Inf. Med. Unlock 24(100584), 1–19 (2021) 14. Tuli, S., Mahmud, R., Tuli, S., Buyya, R.: FogBus: A blockchain-based lightweight framework for edge and fog computing. J. Syst. Softw. 154, 22–36 (2019) 15. Vecchiola, C., Chu, X., Buyya, R.: Aneka: a software platform for NET-based cloud computing. High Speed Large Scale Sci. Comput. 18, 267–295 (2009) 16. Tuli, S., Basumatary, N., Gill, S.S., Kahani, M., Arya, R.C., Wander, G.S., Buyya, R.: HealthFog: an ensemble deep learning based smart healthcare system for automatic diagnosis of heart diseases in integrated IoT and Fog computing environments. Futur. Gener. Comput. Syst. 104, 187–200 (2020)

Chapter 5

A Cloud Native SOS Alert System Model Using Distributed Data Grid and Distributed Messaging Platform Biswaranjan Jena, Sukant Kumar Sahoo, and Srikanta Kumar Mohapatra

Abstract Driver and passenger safety is a paramount concern now, backed by new safety laws and regulations as introduced by many countries. Although a lot has been achieved while building safety support system in modern vehicles, its merely a reactive approach to handle any unwanted life risk events. This paper proposes a model with basic implementation using Ignite distributed data grid and Kafka event streaming platform which will source data from various data sources for both static and dynamic data and generates alerts and notifications which can be consumed by various consumers like mobile apps to generate push notifications on time to avoid unwanted incidents. Law enforcing agencies can consume real time streaming data to build monitoring system and understand the pattern and behavior to take proactive measures. Rules and data from various Road Transport Offices (RTOs) can also be integrated and alert drivers on road conditions, traffic status, set speed limits, traffic diversions, etc. which at present is not available as an integrated system. It can also source data from various social platforms like Twitter based on various search terms and perform required calculation and analysis on top of streaming data using the capability of distributed data streaming platform. Being cloud native it can run on any cloud platform and provide anytime, anywhere accessibility while achieving scalability and real time performance.

5.1 Introduction Data is always the driving factor for any decision making which has been seen with the increased usage of various data storage and analysis ecosystems like Hadoop. With the growing usage of Internet and mobile, social platforms like Twitter and Facebook are market leaders in connecting people and passing real time information to targeted or subscribed users. Tweets on any relevant topics are retweeted by followers and it B. Jena (B) · S. K. Mohapatra Tata Consultancy Services, Bhubaneswar, India S. K. Sahoo HCL Technologies, Kolkata, India © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2022 D. Mishra et al. (eds.), Intelligent and Cloud Computing, Smart Innovation, Systems and Technologies 286, https://doi.org/10.1007/978-981-16-9873-6_5

55

56

B. Jena et al.

continues the same for Facebook where any message is automatically available to the subscribed groups. Followers, friends, groups are various concepts implemented by these social platforms provide an information chain where information is distributed to millions of users in near real time. Collecting social data, analyzing and monitoring it by using big data platform is one of the hot topics. Connected automobiles too generate lot amount of information which is used by various manufacturers to provide customized services like route information, vehicle status, etc. and push notifications are sent to customer mobile app to remind about the next service period. This model proposes an alert and notification framework based on Apache Kafka and Apache Ignite to generate proactive notification to travelers based on both real time data generated by various social platforms like Twitter, Facebook, connected vehicles and static data available with various road transport agencies related to traffic on specific routes and send alerts to traveler to avoid any unwanted incident well ahead of time. Continuous monitoring and analysis of real data is done where various producers provide real time data based on specific search terms and consumers subscribed to relevant topic automatically get notification. Scalability and availability are achieved by using cloud native application features which is deployable an application on top of container platform hosted in cloud. The proposed framework can be used by law enforcement agencies to monitor the activities with the help of real time dashboards and provide support to the public toward achieving safety and security.

5.2 Related Work In last few years, significant research activities are carried out in the area of data streaming framework using Kafka and data grid framework using Ignite. The main research directions are based on data forecasting on streaming data, messaging system for log processing, connecting data streams with machine learning, architecture and in-memory distributed computing. The work of Kanavos et al. [1] proposes machine learning forecasting model for predicting qualitative weather information using big data frameworks like Apache Kafka and Apache Spark where information is retrieved and processed in real time from various weather monitoring sensors to provide better forecasting performance. The work of Torres et al. [2] describes the usage of Internet of things (IOT) and how edge computing brings IOT devices closer to the network and provides real time performance improvement in transmitting the data to cloud network. Architecture based on Edge and Fog computing complements cloud-based strategy in bringing device and edge nodes closer and can be used to provide connected vehicles’ real time statistics for information processing. The work of Kreps et al. [3] explains building a distributed messaging system using Kafka for collecting and delivering high volume of log data with low latency and how it provides much better performance as compared to other messaging framework by

5 A Cloud Native SOS Alert System Model Using Distributed Data …

57

processing gigabytes of data. The work of Martín et al. [4] exploits container platforms like Docker and Kubernetes where the application framework can be deployed and orchestrated. It uses containerization to provide a highly scalable and faulttolerant system using TensorFlow, a very widely used AI/ML framework which uses the various data sources to train, improve and make predictions on standard algorithms. It explains in detail how raw information is used for predictions and recommendations. Processing streaming data in real time and handling challenges related to scalability, speeding up any failure detection from various network devices, low performance challenges are few of the non-functional requirements of any system. The work of Sgambelluri et al. [5] achieved performance gain using Apache Kafka framework by processing 4000 messages per second with very low latency of 50 ms and reduced CPU load. It proposes a reliable and scalable Kafka-based framework for optical network telemetry which provides lot of support in building a distributed messaging platform using Kafka. The study of Tapekhin et al. [6] explains the distributed data storage requirement on top of consistency, availability, partitioning (CAP) theorem. Ignite is an in-memory data grid system which can be configured to achieve CP or AP and provide key-value cache strategy with support to SQL type queries. In work of Bansod et al. [7] low latency and high throughput is achieved for a trade surveillance system which monitors large amount of transactional data in trading system to reduce discrepancies and fraud. It discusses briefly on design, implementation and tuning of frameworks like Kafka, Flink [8] and Ignite and explains effect of caching on streaming throughput. It is observed that disk-based processing led to high latency as compared to do processing using in-memory data grid of Ignite. From the above literature study its inferred that Apache Kafka [9] is one of the best messaging framework capable of handling streaming data and Apache Ignite [10] has its own advantage and performance gain for analyzing data using in-memory cluster.

5.3 Preliminaries A single server or node irrespective of how much computing resource it holds, have the limitation of scaling up to a certain extent only. Vertical scalability is limited to the maximum amount of resource a node can hold and hence horizontal scalability comes to rescue. Multiple nodes can share the load by distributing the request via a load balancer. This node once configured as part of cluster can provide tremendous computing power with support to fail-over.

58

B. Jena et al.

Fig. 5.1 Kafka and Ignite cluster architecture

5.3.1 ZooKeeper as Cluster Coordinator Distributed computing is a paradigm shift where a set of nodes run as part of single cluster providing capabilities of replication where data is distributed across nodes and failure to any of the node can easily be recovered by auto-switching which is managed by a node controller like ZooKeeper. ZooKeeper [11] is a distributed coordinated service for distributed applications to manage various nodes in the cluster and the replication exist across nodes. To provide high availability, ZooKeeper itself is replicated across set of hosts. In a distributed environment there will be multiple ZooKeeper instances with multiple server nodes and it is the zookeeper who acts as a master to manage available nodes within cluster and a client once connected to any of the node in cluster by default gets connected to all other nodes. The scenario is as depicted in Fig. 5.1.

5.3.2 Kafka as a Data Streaming Platform Apache Kafka is a distributed messaging platform having capabilities like event producer, event consumer, event processor and event connector in fault-tolerant way. It can process large stream of data as consumer and achieves reliability with the help of partition and replication strategy. Data can be filtered and stored in Kafka topics. Each topic is partitioned to allow multiple consumers to read from the topic with their own offset in place. Consumer within a consumer group has the ability to read from a specific partition and allows parallelism to read data concurrently from the topic which results in high performance data consumption. Replication strategy in Kafka allows the system to be fault-tolerant. It offers provision for deriving new data streams using the existing ones from producer. It achieves durability by committing messages to disks and ensures zero downtime and zero data loss.

5 A Cloud Native SOS Alert System Model Using Distributed Data …

59

5.3.3 Ignite as the In-Memory Data Grid Traditionally relational databases are best suitable for storing transactional data and good for OLTP operations, but its efficiency decreases for OLAP, whereas NoSQL databases are scalable and provide much better performance as compared to RDBMS. Also, data processing using disk is far more in-efficient as compared to in-memory processing. There are multiple NoSQL databases like Cassandra [12], Redis [13], MongoDB [14] which although provide in-memory storage but limited to enterprise versions only. On the contrary, Ignite is pure open source which can be used for in-memory database. Redis which is also available as open source is schema less but Ignite provides both table schema and distributed SQL queries making it possible to fire SQL queries on top of cache. Ignite supports multi-tier storage mode where data can be stored completely in-memory or along with external database or native file system. It supports distributed join required for advanced querying with colocated joins and stores both data and index in-memory for better performance. It is a distributed key-value store, where data is distributed across cluster nodes which can be accessed via its APIs. Ignite can also work as service grid, compute grid and data grid. In data grid scenario data is distributed across cluster with support of replication across nodes, in compute grid scenario each cluster node can participate in computing task and results are aggregated following the standard map-reduce pattern used in Hadoop [15]. Ignite also can work as L2 cache and supports write-through cache.

5.4 Proposed Architecture This section reports the detailed description of the proposed SOS (signal of distress) alert platform represented through Fig. 5.2, by exploiting the features of distributed

Fig. 5.2 Proposed architecture of SOS alert system

60

B. Jena et al.

messaging platform and distributed in-memory data grid using two of the best frameworks Apache Kafka and Apache Ignite by leveraging the strengths of each of the framework. Apache Kafka as the best data streaming platform and Apache Ignite as the best in-memory data grid for real time data analysis as per earlier studies as mentioned in related work section. Streaming data is collected from various data sources like Twitter and filtered on various search terms like #SOS, #accident, #help, #trafficalert. Relevant details are stored in Kafka topic using Twitter APIs. These alerts or messages are generated by various Twitter users and posted as part of alerts to other followers. Similar message can be streamed from Facebook messages where users keep on posting SOS messages to friends or groups. The data is already fetched from other static data sources like Road Transport Offices (RTO) via APIs exposed and written to Ignite and stored in cache. SOS data provided from connected vehicles can also be fed to IOT_CACHE as shown in Fig. 5.2. Kafka consumer APIs have the required capabilities to stream incoming data from multiple data sources and write to Kafka topics like T_TWITTER and T_CONNECTED. Once data is populated in various topics, it will be written to respective Ignite caches like IOT_CACHE, TWITTER_CACHE and RTO_CACHE. LOCATION_MASTER_CACHE is built up from the data pulled from LOCATION_MASTER table available in persistent database. All these caches are then joined by firing appropriate SQL queries generated by Ignite client node to generate datasets containing specific locations and distress signals generated by various users through either social channels or appropriate authorities. The resulted datasets are then written to respective Kafka topics for consumption via various consumers. Mobile applications used by travelers’ act as subscribers and push notification are generated to their cell phone, based on their current location and reference location data generated from such application. The data stored in target topics can also be consumed by data aggregation tools like Elasticsearch [16] which can index it and subsequently consumed by visualization tools like Kibana [17] to provide nice real time monitoring dashboard. The monitoring dashboard can be used by various supporting agencies to mobilize resources and provide appropriate support to the needy, as well as take appropriate actions like a particular location generating accident message in frequent interval need appropriate action in terms of traffic management and infrastructure development to reduce the same.

5.5 Implementation As this implementation need a lot of infrastructure and data in real time to generate actual alert, a limited implementation is done using fictitious search terms to test the validity of model. The implementation is done on a 4 core, 16 GB docker machine where various frameworks like Kafka, Ignite, Elasticsearch are run as containers on top of Docker [18]. Kibana is used for visualization which runs on top of Elasticsearch indices to give real time monitoring dashboard. Kafka producers are written using

5 A Cloud Native SOS Alert System Model Using Distributed Data …

61

Fig. 5.3 Kafka Twitter client API

both Kafka consumer API and Kafka connector [19] to pull details of various search terms from Twitter. Kafka producer API which searches specific term from Twitter using Twitter API is presented through screenshot in Fig. 5.3 for reference. Kakfa connector is used to connect to Twitter as a source and Ignite as a sink to create cache. As Ignite can treat the cache as a relational table, following data model is built from various source data retrieved. Required Twitter and Facebook API access is achieved by using appropriate access token and secret for respective APIs. Twitter sink configuration contains the data presented through screenshot in Fig. 5.4. Once caches are populated in Ignite they are joined to fetch specific data and written back to Kafka topics using SINK connectors. The fictious data model and respective query is as presented through screenshots in Figs. 5.5 and 5.6. The results of above queries return the location coordinates along with reporting time and are then written into relevant Kafka target topic so that it can be consumed by various consumers like mobile apps, web applications, Elasticsearch, etc. Elasticsearch consumer API used to fetch data from Kafka topic as presented through screenshot in Fig. 5.7. The sample screenshots for Elasticsearch and Kibana dashboard is shown in Figs. 5.8 and 5.9 for reference.

Fig. 5.4 Twitter Kafka source connector configuration

62

B. Jena et al.

Fig. 5.5 Ignite cache details using Gridgain control center plugin

Fig. 5.6 Join query on Ignite cache

Fig. 5.7 Elasticsearch client API

The above setup is performed in local machine with limited resource and each of the given framework runs as docker container. It briefly explains how data is fetched from Twitter using Kafka consumer API and written to Ignite cache and firing SQL complaint query on top of existing cache to fetch relevant location and alert information and write to Kafka target topics based on the incident type. Kafka topic is then consumed by Elasticsearch API and visualized using Kibana dashboard.

5 A Cloud Native SOS Alert System Model Using Distributed Data …

63

Fig. 5.8 Elasticsearch indices for Twitter and Kibana

Fig. 5.9 Kibana index and sample dashboard

5.6 Conclusion The cloud native SOS alert system described in this paper is a futuristic model validated by proof of concept (POC) implementation taking limited data source and sample static database. The proposed architecture can accommodate other streaming data sources like Facebook, Instagram and ready to use static data sources as fullgrown large implementation on top of cloud infrastructure. While the POC gives a scalable architecture to build a complete SOS system which will provide push notifications to various travelers well ahead of time and act as a preventive mechanism toward avoiding any future mishaps on the way. The implementation can further be extended and deployed in an orchestration platform like Kubernetes [20] for achieving cloud native capabilities like auto scaling and ease of deployment. Future work involves in the area of performance optimization, security and monitoring and tracing to make it a full-scale production ready framework with automation capabilities and providing some real implementation of application consumers to consume processed data from Kafka target topics.

References 1. Kanavos, A., Trigka, M., Dritsas, E., Vonitsanos, G., Mylonas, P.: A regularization-based big

64

2.

3. 4. 5. 6. 7.

8. 9. 10. 11. 12. 13. 14. 15. 16. 17. 18. 19. 20.

B. Jena et al. data framework for winter precipitation forecasting on streaming data. Electronics 10, 1872 (2021) Torres, D.R., Martín, C., Rubio, B., Díaz, M.: An open source framework based on Kafka-ML for distributed DNN inference over the cloud-to-things continuum. J. Syst. Archit. 118, 102214 (2021) Kreps, J., Narkhede, N., Rao, J.: Kafka: 2011, a distributed messaging system for log processing. In: NetDB Workshop (2011) Martín, C., Langendoerfer, P., Zarrin, P.S., Díaz, M., Rubio, B.: Kafka-ML: connecting the data stream with ML. AI frameworks. Arxiv, June 2020 Sgambelluri, A., Pacini, A., Paolucci, F., Castoldi, P., Valcarenghi, L.: Reliable and scalable Kafka-based framework for optical network telemetry. J. Opt. Commun. Netw. 13(10) Tapekhin, A., Bogomolov, I., Velikanov, O.: Analysis of consistency for in memory data grid Apache Ignite. In: Ivannikov Memorial Workshop (IVMEM) (2019) Bansod, R., Kadarkar, S., Virk, R., Raval, M., Rashinkar, R., Nambiar, M.: High performance distributed in-memory architectures for trade surveillance system. In: 17th International Symposium on Parallel and Distributed Computing (2018) Flink: Available online: https://flink.apache.org/ (2021) Apache Kafka: Available online: http://kafka.apache.org/ (2021) Ignite: Available online: https://ignite.apache.org/ (2021) Zookeeper: Available online: https://zookeeper.apache.org/ (2021) Cassandra: Available online: https://cassandra.apache.org/ (2021) Redis: Available online: https://redis.io/ (2021) MongoDB: Available online: https://www.mongodb.com/ (2021) Hadoop: Available online: https://hadoop.apache.org/ (2021) Elasticsearch: Available online: https://www.elastic.co/ (2021) Kibana: Available online: https://www.elastic.co/kibana/ (2021) Docker: Available online: https://www.docker.com/ (2021) Kafka Connector: Available online: https://github.com/lensesio/fast-data-dev (2021) Kubernetes: Available online: https://kubernetes.io/ (2021)

Chapter 6

Efficient Peer-to-Peer Content Dispersal by Spontaneously Newly Combined Fingerprints T. R. Saravanan, G. Nagarajan, R. I. Minu, Samarjeet Borah, and Debahuti Mishra Abstract A peer-to-peer network is a distributed network design made out of individuals that make a touch of their resources, for instance, taking care of force, circle accumulating or association information move limit clearly to organize individuals without the necessity for central coordination events. There are many mechanisms in which use time-consuming protocols and unicast attitude for dispersal that sort out not measure for a bulky quantity of consumers. To overcome these challenges, we accomplish recombined fingerprint with effectual, mountable, privacypreserving, and P2P-based fingerprinting system public-key encryption in dispersal and conspirator sketching technique is considered using binary fingerprints. The proposed method outperforms existing peer-to-peer content distribution mechanism.

6.1 Introduction Peer-to-Peer (P2P) network, a get-together of PCs is associated alongside identical assents and obligations in regards to planning data. Not under any condition like standard client laborer sorting out, are no devices in a P2P network relegated only to serve or to get data. In the peer-to-peer system, subsequently, the measure of clients has expanded, an uncertain dissemination of substance among dispatcher and beneficiary has additionally expanded. Under the peer-to-peer network, the sections of the document are moved by various clients from the vendor and this client will

T. R. Saravanan (B) · R. I. Minu SRM Institute of Science and Technology, Chennai, India R. I. Minu e-mail: [email protected] G. Nagarajan Sathyabama Institute of Science and Technology, Chennai, India e-mail: [email protected] S. Borah · D. Mishra Sikkim Manipal University, Gangtok, India © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2022 D. Mishra et al. (eds.), Intelligent and Cloud Computing, Smart Innovation, Systems and Technologies 286, https://doi.org/10.1007/978-981-16-9873-6_6

65

66

T. R. Saravanan et al.

appropriate the substance to some other client. There are many mechanisms available for distribution of multimedia content. The legitimate conveyance of interactive media contents [1] with copyright assurance while protecting the security of purchasers is an exceptionally difficult assignment. In the prevailing anonymous fingerprinting protocols [2–4] are unfeasible for two main motives. They are the use of complex dreary shows and homomorphism encryption of the substance and a unicast approach for spread that does not scale for a gigantic number of buyers. Regardless, the recombined unique mark approach [5] requires an erratic outline search for swindler following, which desires the interest of various buyers, and authentic go-betweens in its P2P movement circumstance. Various obscure fingerprinting [6] plans misuse the homomorphic property of public-key cryptography. These plans grant embedding the remarkable imprint in the encoded region so that lone the purchaser gets the decoded fingerprinted content subsequent to utilizing her private key. Be that as it may, building up a reasonable structure exploiting this alleged looks upsetting, on the surroundings that public-key encryption encompasses evidence and noticeably increases the messages diffusion capacity needed for moves. A buyer–seller watermarking protocol [7] is used for secure data transmission. The proposed methodology uses two protocols such as unidentified announcement rules and elementary traitor sketching protocol to accomplish effective peer-to-peer satisfied circulation.

6.2 Literature Review Deng et al. [8] carried out watermarking methods with cryptography, for copyright insurance, theft following, and security assurance. An efficient purchaser merchant watermarking convention dependent on holomorphic public-key cryptosystem is created in scrambled area to lessen both the computational overhead and the enormous correspondence data transmission which are because of the utilization of holomorphic public-key encryption plans. The reenactment results improve the effectiveness in useful applications. Comisso et al. [9] proposed an inclusion examination in distributed organizations by assessing the inclusion likelihood in two-dimension and three-dimension shared peer-to-peer millimeter-wave wireless linkages. The proposed work is done by embracing reasonable connection state models and sensible spread conditions, including way misfortune weakening, rakish scattering, mid-and limited scope blurring, which follow late channel estimations. Logical articulations for the measurement of the got force and basic recipes for the inclusion likelihood within the sight of obstruction and clamor are determined. The precision of the got assessments and of the existing estimates is checked by free Monte Carlo approvals. The presentation investigation of the unwavering quality of commotion restricted guess and to gauge the normal connection limit is gone through successfully.

6 Efficient Peer-to-Peer Content Dispersal by Spontaneously …

67

Camenisch [10] proposed a fingerprint plans that empower a trader to recognize the purchaser of a wrongfully disseminated computerized great by furnishing every purchaser with a marginally different form. Lopsided fingerprinting plans further keep the trader from outlining a purchaser by making the fingerprinted variant known to the purchaser as it were. As the proposed is efficient mysterious fingerprinting plan that utilizations bunch signature plans as a structure block. A side-effect of free interest is an uneven fingerprinting plan that permits purported two gathering preliminaries, which is neglected up until now. Devani et al. [11] proposed a shared substance dispersion utilizing network coding in vehicular ad-hoc networks. In the present life because of organizational elements and rapid versatility content distribution in Vehicular AD-HOC network is a highly challenging task. Powerful systems are required for solid and secure correspondence with quicker correspondence time. Changes are done in homomorphic hash function which uses network coding by making changes with AODV protocol. By doing the above said changes will possibly improve the quality of communication. Domingo-Ferrer et al. [12] proposed a Digital finger printing of mixed-media substance includes the age of a fingerprint, the implanting activity, and the acknowledgment of discernibility from rearranged substance. Topsy-turvy property in the exchange between a purchaser and a merchant should be accomplished utilizing a cryptographic convention. A strategy for carrying out the spread range watermarking method in the fingerprinting convention dependent on the homomorphic encryption conspire was introduced. An adjusting activity which changes over genuine qualities into whole number and its pay, and afterward investigate the tradeoff between the vigor and correspondence overhead was created to improve proficiency. Exploratory outcomes show that our framework can reproduce Cox’s spread range watermarking technique into uneven fingerprinting convention [13, 14]. Abudaqa et al. [15] proposed a content circulation framework for expanding the substance accessibility, quickening the transfer cycle, and heartiness in contrast to churn. In existing dense network coding because of its tremendous computational overhead, it is not practical for genuine frameworks. He has proposed Marvelous Generation Network Coding. MGNC augments the age extent so it is as nearby as conceivable to the ideal size lacking addition of computational overhead. MGNC outflanks old style and all past coding grounded plans for P2P content appropriation frameworks regarding content accessibility, transfer period, overhead, and decidability for entirely portion planning arrangements [16, 17].

6.3 Proposed Work 6.3.1 Overview Various baffling fingerprinting plans misuse the homomorphic assets of public-key cryptography. These plans license introducing the one of a kind imprint in the encoded

68

T. R. Saravanan et al.

territory with the goal that single the buyer gets the decoded fingerprinted content in the wake of using her private key. Nonetheless, constructing up a well-designed framework exploiting this belief appears niggling, on the grounds that public-key encryption grows statistics and bigheartedly increases the correspondence communication capacity needed for moves. There are many mechanisms for implementing asymmetric cryptographic protocol [18]. The tracing procedure might be lumbering and necessitates the interest of a couple of guiltless purchasers. The division of tried purchasers in a pursuit is not known (in spite of the statement that it is demonstrated to asymptotically diminish to zero as the quantity of purchasers increments if no backpedaling occurs). While it is suggested that supplementary than single intermediary is utilized for each download, the intermediaries could even now conspire to reproduce the entire fingerprinted duplicate of a purchaser and Illegally reallocate that duplicate, making it conceivable to outline an honest purchaser.

6.3.2 Framework Architecture The merchant appropriates the substance to seed purchasers and creates the particular unique mark during the dissemination. The youngster purchaser gives the solicitation to the dealer for download. The trader shows the rundown of seed purchasers to the kid purchasers which has the mentioned content. The kid purchaser downloads the mentioned content from the seed buyers. If the youngster purchaser attempts to rearrange, the dealer will screen and distinguish the kid buyer. The recognized kid purchaser is blocked. The entire process is illustrated in Fig. 6.1. All multimedia contents uploaded are splitted. Merchant generates random fingerprints before distributing to seed buyers. After distribution illegal distribution is identified using Anonymous communication and Basic traitor tracing. Distribution of multimedia content is facilitated by Fig. 6.1 Framework architecture

6 Efficient Peer-to-Peer Content Dispersal by Spontaneously …

69

providing fingerprint matching of merchant and seed buyers. Pseudorandom binary arrangement to be hand-me-down as a knob (primary database key) for k.

6.3.3 Content Uploading and Splitting Every purchaser can be renowned by their own fictitious name. After the entirety of what centers have been made transporter will be fitting the intelligent media substance to seed purchasers. For dispersion, vendors transfer any of the sight and sound substances from their folder. The mixed-media content has been splitted depending on substance size. The disseminated content is as frame. The splitted outline is put away in back finish of the program. The seed purchaser can see the circulated content casing by frame. The information stream from trader to the seed purchaser.

6.3.4 Fingerprint Detection Whenever content has been splitted, it necessity be disseminate to figure of seed purchasers. Dealer produces random unique mark for every mixed-media content earlier distributes. The unique finger impression produces automatically. The produced finger impression is put away in the database. The trader makes a finger impression for particular seed buyers. The vendor screens the unique finger impression which is put away in the database. The finger impression is kept up with the assistance of MYSQL. That unique mark should be kept up in information base for recognizing illicit reallocation. The data flow is from trader to information base while unique mark is produced. To deal with quite a few seed purchasers, while moving the content to seed buyer, the trader produces the fingerprints to the relating content. The seed purchaser which gets the substance, consequently the unique mark is created by shipper for the seed buyers. The produced fingerprints are put away in the information base. The information base made is observed by the merchant. The data set comprises of public key, private key, way of the substance alongside the fingerprints.

6.3.5 Content Distribution The child buyer selects the content to be downloaded. The child buyer receives the content from the seed buyers which is online. The content will be split into several frames. During this, the fingerprint of the content and the seed buyer is compared by merchant. If there is no illegal transmission, at that point they send the substance to mentioned youngster purchasers. The dataflow is from merchant and then check the status of the seed buyers and then distribute the matched fingerprint.

70

T. R. Saravanan et al.

6.3.6 Anonymous Communication Protocol Protocol for mysterious correspondence is additionally founded on utilizing intermediaries between the parent and the child purchaser. The substance that is moved preposterous is encoded, again, using symmetric cryptography, anyway the gathering key for scrambling the substance is shared by parents and youth using the trade screen as a brief key database. The parent buyer picks a symmetric (meeting) key k. The parent picks a pseudorandom equivalent social occasion r to be utilized as a handle (essential data set key) for k. The interplanetary for r ought to be sufficiently enormous (for instance 128 pieces) to avoid collisions. The parent buyer sends (r, k) to the trade screen, which stores it in a database. The parent buyer sends r to the middle person and the go-between progresses r to the child buyer. The kid buyer sends the handle r to the trade screen, who answers with the symmetric key k. The trade screen hinders the register (r, k) for a given period (clock). Exactly when the clock passes, the trade screen kills the register from the database. The parent buyer sends the referenced segments, encoded with k, to the proxy. The go-between progresses all pieces to the youth buyer, who can unscramble them using k.

6.3.7 Identification of Illegal Distribution To distinguish illicit rearrangement, exchange screen utilizes double crosser following convention. Utilizing this show we are perceiving the redistribution. For protection saving, I keep up the purchaser’s unique mark and pen names of every purchaser. In the event that the seed purchaser sends the insignificant substance to the kid purchaser, at that point that seed purchaser is an attacker. The child buyer gives the assault suggestion to the merchant. The dealer tracks the attacker seed buyer and stops the transmission. The illegal identification distribution is identified using Basic traitor tracing protocol as discussed below. Elementary traitor sketching The fingerprint f of the improperly improved delighted is eliminated by the accompanying force using the extraction technique and the extraction key given by the merchant. The fingerprint’s pieces gj are encoded using the public key of the trade screen: E c gj = E(gj , K c ). The encoded segments are accumulated in game plans of m consecutive segments which are mixed using the public key of the position, E f is (efficiently) glanced in the database of the trade screen to recover the nom de plume the unlawful redistributor. The broker checks his database of clients and recuperates the personality of the traitor contrasting with the alias in the past progresses.

6 Efficient Peer-to-Peer Content Dispersal by Spontaneously …

71

Fig. 6.2 Splitted image received by seed buyers

6.4 Experimental Results 6.4.1 Content Uploading and Splitting The proposed content distribution mechanism and illegal distribution detection protocol are implemented using JAVAfx and NetBeans. The merchant page is comprised of the substance alternatives, for example, audio, video, images. These substances are circulated to the seed purchasers as frames. The disseminated content appears on the seed buyer page. The circulated pictures are as casings. The dispersed substance can be seen on the seed buyer page. Figure 6.2 represent splitted image received by seed buyers. The child purchaser comprises of child purchaser ID, list of the child purchasers. The status of the seed purchasers is recorded. The child purchaser page comprises of request, download, save choice. The request option is used to give the solicitation to the trader. Production of shipper, seed purchasers, and youngster purchasers is done here. Dealer split the substance and convey it to the seed purchasers. The disseminated content is in encrypted form.

6.4.2 Fingerprint Generation The seed purchaser gets the splitted pictures from the merchant. While disseminating the arbitrary unique mark pieces are created. Figure 6.3 clearly illustrates the fingerprint generated. The fingerprint bits are put away in the database. While transmitting the substance, the particular finger impression is additionally communicated the information base comprises of the way of the content, fingerprint bit, hash code, and the seed purchaser id.

72

T. R. Saravanan et al.

Fig. 6.3 Generated fingerprint

Fig. 6.4 Content distribution

6.4.3 Content Distribution The seed buyer list shows the list of the seed buyers who has the requested content. The child buyer sends the request to the merchant. The list of seed buyers which has the content is shown to child buyer. Figure 6.4 illustrates the content distributed to child buyers. Then the kid bargain hunter refers the download request to the merchant. The requested content is distributed to the child buyer.

6.4.4 Identifying and Preventing Illegal Redistribution The child buyer tries to redistribute to another child buyer. Fingerprint matching is performed. If content and buyers’ fingerprints are matched the transmission is successful. If content and buyers fingerprints are not matched illegal redistributor is identified. Figure 6.5 clearly illustrates the illegal redistributor identified. The seed buyers will also act as attackers. This attacker seed purchaser sends the insignificant substance to the kid purchaser. Figure 6.6 clearly illustrates identification of seed buyers.

6 Efficient Peer-to-Peer Content Dispersal by Spontaneously …

73

Fig. 6.5 Identification of illegal redistributor

Fig. 6.6 Identification attacker seed buyers

The splitted pictures as edges. Merchant checking fingerprint for each seed purchaser. In the event that the illegal rearrangement of child purchaser is identified. The distinguished child purchaser is obstructed by merchant. So that the impeded child purchaser neither download nor give demand. If the seed purchaser neglects to circulate the mentioned content by child purchaser. At that point, the seed purchaser is called attacker. The child purchaser can suggest the attacker. Thus, the merchant can identify the attacker and blocked.

74

T. R. Saravanan et al.

6.5 Conclusion In this paper strategy for the peer-to-peer content distribution by routinely rearranged fingerprint is implemented. Using this method, illegal redistribution of content can be identified and attacked. Efficient and versatile circulation of mixed-media substance in the peer-to-peer systems, efficient trickster following of illicit re-wholesalers finished a typical information based exploration. Privacy protection and purchaser rough draft proofer’s shared secrecy for dealer and purchasers and between peer purchasers. Intrigue resistance, evading of fingerprint installing aside from a couple of seed purchasers. Shirking of homomorphic encryption of the interactive media content. Fragments of the gratified are encrypted by means of symmetric cryptography, which is abundant supplementary competent. Further examination can be centered around building up a proof of idea of this proposition on a genuine dispersion situation. As the future mechanism for this project, the identification process can still be improved and the attacker can be identified soon.

References 1. Medias, D.: Improved privacy-preserving P2P multimedia distribution based on recombined fingerprints. IEEE Trans. Dependable Secure Comput. 12(2), 179–189 (2015) 2. Boneh, D., Shaw, J.: Collusion-secure fingerprinting for digital data. In: Advances in cryptology (CRYPTO’95), LNCS 963. Springer, pp. 452–465 (1995) 3. Bo, Y., Piyuan, L., Wenzheng, Z.: An efficient anonymous fingerprinting protocol. In: Computational intelligence and security, LNCS, 4456, pp. 824–832. Springer, Berlin (2007) 4. Kuribayashi, M., Funabiki, N.: Decentralized tracing protocol for fingerprinting system. APSIPA Trans. Signal Inform. Process. 8, 813–819 (2019) 5. Megías, D., Qureshi, A.: Collusion-resistant and privacy-preserving P2P multimedia distribution based on recombined fingerprinting. Expert Syst. Appl. 71, 147–172 (2017) 6. Chang, C.-C., Tsai, H.-C., Hsieh, Y.-P.: An efficient and fair buyer-seller fingerprinting scheme for large scale networks. Comput. Secur. 29, 269–277 (2010) 7. Katzenbeisser, S., Lemma, A., Celik, M., van der Veen, M., Maas, M.: A buyer-seller watermarking protocol based on secure embedding. IEEE Trans. Inform. Forens. Secur. 3, 783–786 (2008) 8. Deng, M., Bianchi, T., Piva, A., Preneel, B.: Efficient implementation of a buyer-seller watermarking protocol using a composite signal representation. In: Proceedings of the 11th ACM Workshop on Multimedia and Security, pp. 9–18 (2009) 9. Comisso, M., Babich, F.: Coverage analysis for 2D/3D millimeter wave peer-to-peer networks. IEEE Trans. Wirel. Commun. 18(7), 3613–2627 (2019) 10. Camenisch, J.: Efficient anonymous fingerprinting with group signatures. In: Asiacrypt 2000, LNCS 1976, pp. 415–428. Springer, Berlin (2000) 11. Devan, M., Venkatesh, S.: Peer to peer content distribution using network coding in vehicular ad-hoc network. 1(10), 1–6 (2012) 12. Domingo-Ferrer, J., Megías, D.: Distributed multicast of fingerprinted content based on a rational peer-to-peer community. Computer Commun. 36, 542–550 (2013) 13. Nagarajan, G., Minu, R.I.: Fuzzy ontology based multi-modal semantic information retrieval. Procedia Computer Sci. 48, 101–106 (2015) 14. Nagarajan, G., Minu, R.I.: Wireless soil monitoring sensor for sprinkler irrigation automation system. Wirel. Pers. Commun. 98(2), 1835–1851 (2018)

6 Efficient Peer-to-Peer Content Dispersal by Spontaneously …

75

15. Abudaqa, A.A., Mahmoud, A., Abu-Amara, M., Sheltami, T.R.: Super generation network coding for peer-to-peer content distribution networks. IEEE Access 8, 195240–195252 (2020) 16. Nagarajan, G., Minu, R.I., Jayanthila Devi, A.: Optimal nonparametric bayesian model-based multimodal BoVW creation using multilayer pLSA. Circ. Syst. Signal Process. 39(2), 1123– 1132 (2020) 17. Nagarajan, G., Minu, R.I., Jayanthiladevi, A.: Brain computer interface for smart hardware device. Int. J. RF Technol. 10(3–4), 131–139 (2019) 18. Kuribayashi, M.: On the implementation of spread spectrum fingerprinting in asymmetric cryptographic protocol. EURASIP J. Inform. Secur. 2010, 1:1–1:11 (2010)

Chapter 7

An Integrated Facemask Detection with Face Recognition and Alert System Using MobileNetV2 Gopinath Pranav Bhargav, Kancharla Shridhar Reddy, Alekhya Viswanath, BAbhi Teja, and Akshara Preethy Byju Abstract COVID-19 pandemic has impacted the lives of individuals, organizations, markets, and the whole world in a way that has changed the functioning of all the systems. To get going, some try to adapt to working online, children started studying online and people started ordering food online. While this is still going on, there are many people whose jobs demand physical presence at workplaces and they have no choice but to be exposed to the virus while keeping our society functioning. People are needed to adapt to the new “normal” by practicing social distancing and wearing masks. Wearing masks is the most effective means of prevention of Covid19. To ensure this, we built a web application that aims at keeping people advised to wear masks constantly with the help of an integrated facemask detection and facerecognition system. The proposed system initially detects whether the person in the real-time video feed is wearing a mask or not and then recognizes the face of the person if they are not wearing a mask. Finally, the proposed system alerts that specific violator to wear a mask through an auto-generated email to his personal email id. The application also allows the admin and the violators to log in and access the list of fines levied along with photo evidence.

7.1 Introduction In 2020, Covid-19 has been declared a global pandemic and it mainly spreads through physical contact by means of an infected person’s saliva, respiratory droplets, etc. [1, 2]. A healthy person is infected when these droplets are inhaled or come in contact with the person’s eyes, nose, or mouth. As a precautionary measure to contain the spread of the virus, lockdowns were imposed throughout the world. Due to this educational institutions, universities and workplaces had to shut down. Students and employees could not attend their colleges or workplaces. This had a pretty drastic effect on the country’s economy. People had a hard time working from home. The World Health Organization (WHO) suggested wearing masks at all times and G. P. Bhargav (B) · K. Shridhar Reddy · A. Viswanath · B. Teja · A. P. Byju Amrita University, Kollam, Kerala 690546, India © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2022 D. Mishra et al. (eds.), Intelligent and Cloud Computing, Smart Innovation, Systems and Technologies 286, https://doi.org/10.1007/978-981-16-9873-6_7

77

78

G. P. Bhargav et al.

practicing social distancing as the most effective measures of staying safe against infection. Recently, vaccination drives have begun and several people are being vaccinated. Gradual efforts are being made worldwide to remove lockdown. Very soon institutions will be allowed to function at their normal pace and capacity. This means many people will come in contact with each other in such places, every day. Even now, there is a need to follow safety protocols, since the risk of being infected by the virus is still there due to the advent of multiple variants of Coronavirus [3]. Someone has to make sure that, even after places like universities, schools, offices, etc. are opened, the people in it still follow the safety measures. To make sure that everyone in a workplace or any institution follows safety protocols at all times, a team of people might have to monitor them continuously. For these people to be there at all times, to keep check of safety measures is very difficult and it further puts them at the risk of infection. To address this problem, our proposed system can be deployed and used in places where Closed-Circuit-Television (CCTV) cameras are installed. Fortunately, most organizations have cameras installed in many different locations in their infrastructure. Our proposed system detects people without masks via surveillance camera so if a person is found not wearing a facemask, their image is captured and stored in the database. Facial recognition is run on the image to identify the person. If the person’s details are present in the organization’s database, he or she will be identified and an email is sent to the registered email id of the identified person by the proposed system automatically, stating that they have violated the safety protocol by not wearing a mask. The proposed system consists of two models: (i) a Face Mask Detection Model and (ii) a Face-Recognition Model. In the proposed system, these two models are integrated together to identify the people who are not wearing the masks and alert them directly without human intervention. The Face Mask Detection Model, checks for a facemask in an image or live video stream from surveillance cameras using transfer learning in neural networks. It is then trained using a Convolutional Neural Network (CNN). The dataset used for face mask training consists of 3918 images of real people’s faces. The second model is the Face-Recognition Model, this model as the name suggests is used to identify faces with the help of an amazingly simple python library, which is called “face_recognition”. This library is built using dlib’s state-of-the-art face recognition which again employs Convolutional Neural Network (CNN). Dlib is a modern toolkit containing several machine learning algorithms and it is coded in C++. The face-recognition library is built with deep learning and has been trained to obtain an accuracy of 99.38% on the Labeled Faces in the Wild (LFW) benchmark. Labeled Faces in the Wild (LFW) [4] is a database created by the University of Massachusetts, the database was created for studying unconstrained face recognition. The database consists of 13,233 images of 5749 different people.

7 An Integrated Facemask Detection with Face Recognition …

79

7.2 Related Works In recent times, there has been a lot of Covid-19 related research in all the fields like biotechnology [5], data science [6], etc. suggesting various solutions to tackle this pandemic and for being ready for any such future adversities. Among them, one of the most researched areas is developing novel methodologies for facemask detection in several countries, as people are adjusting to the new normal after lifting the lockdowns, wearing masks is mandated in public places. It is a herculean task to physically achieve the monitoring of people wearing masks by security personnel. This task also puts them at risk of infection. Hence, recent progressive studies and optimization techniques have been researched and published to help face this problem. In [7], a Face Mask Detection Model is proposed that uses MobileNetV2 and single shot multibox detector (SSD) as a framework for classification that achieves a good accuracy of 0.9264. This model also employs Open-CV DNN for face detection. Their work also provides a contextual understanding of the convolution layer, pooling layer, linear bottleneck. In [8], a mask detection system was proposed that runs in real-time to detect faces and checks if they are wearing face masks using optimistic CNN. This model detects faces from an image and determines if it should be classified as “no_mask” or “mask” using Machine Learning (ML) libraries like Keras, TensorFlow, Scikit-Learn, and OpenCV. The training of the model was done using a Simulated Masked Face Dataset (SMFD). It also makes use of image augmentation technique training and testing for when there exists a problem of restricted data. In [9], a hybrid model built based on classical and deep machine learning consists of two components, one for feature extraction and the other is for classification processing using three approaches. For feature extraction, Resnet50 was used and for classification processing of mask approaches like (i) support vector machine (SVM) algorithms, (ii) decision tree, and iii) ensemble were used. Simulated Masked Face Dataset (SMFD) for SVM algorithms, Labeled Faces in the Wild (LFW) for and Real-World Masked Face Dataset (RMFD) are the three datasets used. SVM learning algorithm achieved 99.49% accuracy in SMFD. RMFD [10] achieved 99.64% of accuracy, LFW [4] achieved 100% of testing accuracy. In [11] the model was built by fine-tuning the pre-trained state-of-the-art deep learning model, InceptionV3 [12]. A Simulated Masked Face Dataset (SMFD) was used to train and test the InceptionV3 model. For better training and testing, the image augmentation technique was also implemented to meet the limited availability of data. This model achieved an accuracy of 99.9% during training and 100% during testing as claimed by the authors.

7.3 Methodology In a broader abstraction, there are four steps in the implementation of our proposed system. (i) Detecting faces from the live video feed, (ii) Facemask detection, (iii) Face recognition, and iv) sending alerts to the violator and admin, as shown in Fig. 7.1.

80

G. P. Bhargav et al.

Fig. 7.1 The architecture of the proposed system

Firstly, the video feed is obtained using OpenCV modules. Next, faces need to be detected. A face detection model called faceNet was used to accommodate face detection. These detected faces, i.e., the face locations are then used by the facemask detection model to detect whether the face has a mask or not. For the facemask detection model, we used libraries like TensorFlow, Keras and made use of CNN. The dataset is split into a training set and a test set (validation set) with the training set being 80% and the rest for the testing set. Tensorflow and Keras were used for the preprocessing and training of the model. For training the model, we used MobileNetV2, which is a Convolution Neural Networks architecture. MobilenetV2 [13] works as an inverted residual structure where the residual connections are between the bottleneck layers. After training the model, face detection and facemask detection are implemented on the real-time feed and based on the confidence, the probability percentage for the existence of the mask in the frame is displayed. In the Face-Recognition model, we have adopted the face recognition module from the python library. The library is built on dlib’s state-of-the-art face recognition. Though the library is originally written in C++, it has easy-to-use python bindings. The responsibility of the Face-Recognition model involves the conversion of the detected face’s training images into image encodings, and then they are stored in an encodings list. Then it converts the “to be identified” image into image encoding, then using the face recognition library, we compare the “to be identified” image encoding with encodings in the encoding list. By comparing the encodings, we can thus identify the person in the image. When the people in the live feed are detected without the mask, i.e., the violators, it is the job of the Face-Recognition model to identify and recognize the violator’s face. If the violator’s image was used in the training set while training the model, the person’s name is returned. This name is used to run MySQL query to retrieve the contact information of the person like Roll no, in case of University, or Employee ID, in case of corporate usage, etc. and email id. A mail is sent to this email id along with the photo for proof to the violator. If the face of the violators is unknown, then a notification email is sent to the administrator’s email id along with the photo. These

7 An Integrated Facemask Detection with Face Recognition …

81

details can also be checked by the admin on the application’s web page, through admin login. As we discussed, the proposed system consists of two main modules: (i) The Face Mask Detection Model and (ii) The Face-Recognition Model.

7.3.1 Facemask Detection Model The facemask detection model (Fig. 7.2) mainly has four steps: 1.

Dataset and preprocessing: The dataset used, curated from few sources like Kaggle.com, google images, etc. consists of two sub-datasets. One is “with_mask” which is a dataset of 1915 images of people with masks and the other is “without_mask” containing 1918 images of people without wearing masks. A sample of both datasets is shown in Fig. 7.3. During the preprocessing

Fig. 7.2 The architecture of the Facemask detection model

Fig. 7.3 Sample Images of “with_mask” and “without_mask” datasets

82

G. P. Bhargav et al.

Fig. 7.4 A sample of augmented images from the dataset

2.

3.

4.

phase, all the images are cropped to dimensions (224,224) for uniformity. The images are then converted into numeric encodings and set as NumPy arrays for ease of use and for faster calculation. Labels are also one-hot-encoded. Data augmentation: Data augmentation is a very beneficial tool in this regard and can be used to augment already present images to solve this issue. In this technique, methods like rotation, flipping, zooming, shifting, shearing, etc. are applied to the images. So, in short, the images from the dataset are used for generating numerous versions of similar images. In this model, ImageDataGenerator from Keras is used for the data augmentation process. Samples of the augmented or generated mages from our dataset are shown in Fig. 7.4. Training and testing the model: The dataset is split into training and testing sets in the ratio of 0.8, i.e., 80% of the data is training set and the rest is testing set. The testing set is used for validation. With the use of data augmentation, the total no of images is also increased aiding in training the model better. Training the model is done in two steps. Training a head model and a base model. The head model is trained with the help of the weights of the “ImageNet” database instead of starting from null. ImageNet is a large dataset consisting of over 14 million labeled images belonging to more than 20,000 classes. Hence, ImageNet weights are widely used for transfer learning models that are based on image classification. After the base model, a head model is trained on top of the base model. For the head model, there are a few notable layers involved like convolution layer, pooling layer, dropout layer, and nonlinear layer. The convolution layer, which is the fundamental layer of CNN, is used for feature extraction and uses sliding window techniques for the generation of feature maps. For the pooling layer, we used average pooling, which takes the average of all the values that are currently in the region under computation and takes this value as the output for the matrix value of that cell. Then a dropout layer is used for reducing the overfitting of the model. We specify a ratio for dropout which defines the likelihood for a neuron to be dropped. A nonlinear function such as Rectified Linear Unit (ReLU) is applied to 128 hidden layers and then Softmax activation function is applied to two output layers. Implementing the model: After successfully training, testing, and evaluating the accuracy of the model, we move to implement it on the live data, i.e., video stream. We use OpenCV to capture live video frames and detect the face in the frame with the help of a face detection module. Now, the frames of the detected

7 An Integrated Facemask Detection with Face Recognition …

83

faces are fed to the facemask detection model to classify the frame as either “with mask” or “without mask”.

7.3.2 The Face-Recognition Model 1.

2.

3.

Create training dataset: If the model (Fig. 7.5) is to be trained to recognize all the people in an institution, then one picture of each person with the name of the image labeled as their actual name is to be used to create the dataset as shown in Fig. 7.6. Converting the image to encodings: Once the image dataset is loaded, then, the images are converted to encodings using the “face_recoginition” library. Each encoding is 128-dimension face encoding for each face in the image. These encodings are stored in an encode_list. With this training, the part is over. Implementing the model: To recognize a face from an image, first it has to be converted to encoding and then use “face_recognition.compare_faces” to identify the person. As claimed in the documentation of the face-recognition

Fig. 7.5 Face-recognition architecture

Fig. 7.6 Image dataset for face recognition

84

G. P. Bhargav et al.

library it has an accuracy of 99.38% on the Labeled Faces in the Wild (LFW) benchmark.

7.3.3 Database The images captured during the face mask detection process need to be stored, thus we need to create a database. The details of all the users also need to be maintained in order to send them an email when they are detected not wearing the mask. To update the database in real-time we used python and PHP to connect to the database and add details.

7.3.4 Web Application Our web application has a very intuitive design and accommodates login and sign-up facilities for users and admin. This application helps admins to monitor the violators and allows users to check their fine details.

7.4 Results The MobileNetV2 model, with pre-trained weights of “ImageNet” and trained on the facemask dataset is evaluated based on the results of the validation set. The evaluation metrics used are accuracy and loss of training and validation sets (Fig. 7.7), classification report (Table 7.1), and confusion matrix (Fig. 7.8). The graph (in Fig. 7.7), shows the accuracies and losses of training and validation set with respect to epochs. During the first five epochs, training loss and validation loss were quite high and they progressively reduced. After ten epochs values started becoming more stable. Accuracy increased from 82.7 to 98% in the first four epochs and after 12 epochs, the accuracy gained from 99.3 to 99.44% at the end of 20 epochs. The classification report in Table 7.1 shows the values of precision, recall, f1score, accuracy, etc. Precision is the ratio of True Positives to the sum of True Positives and False Positives. The model gets a precision of 0.98 for predicting “with_mask” and a precision of 1.00 for “without_mask”. Similarly, Recall is the ratio of True Positives to the sum of True Positives and False Negatives. The model gets a recall value of 1.00 for “with_mask” and 0.98 for “without_mask”. F1-score is a weighted harmonic mean of recall and precision. It is used to check if we are correctly identifying real threats and are not disturbed by false alarms. F1-score for both “with_mask” and “without_mask” is 0.99 which is a very good score. The confusion matrix shown in Fig. 7.8, is plotted with the help of a heatmap from the seaborn library to represent the 2D matrix data. The model successfully

7 An Integrated Facemask Detection with Face Recognition …

85

Fig. 7.7 Training and validation losses and accuracies

Table 7.1 Classification report Precision

Recall

F1-score

support

with_mask

0.98

1.00

0.99

383

without_mask

1.00

0.98

0.99

384

0.99

767

Accuracy Macro avg

0.99

0.99

0.99

767

Weighted avg

0.99

0.99

0.99

767

identified 380+ True Positives and 380+ true negatives. 1 image is false positives and six images are False negatives.

7.5 Conclusion The Covid-19 pandemic has really changed the way we approach our daily jobs and tasks. Wearing masks is a mandatory safety protocol. To ensure the safety of the people of an organization, university, hospitals, and various workplaces which need the admission of certain people on a regular basis, we have proposed a system to automatically detect people not wearing masks and notify them and also the administrator of the organizations with the help of emails. We used Computer Vision, MobileNetV2, Myphpadmin, and face recognition to monitor, detect, store and alert people to help ensure that they wear masks all the time. The performance of our

86

G. P. Bhargav et al.

Fig. 7.8 Confusion matrix

proposed system with 99.44% accuracy clearly outperforms most of the existing works. The proposed system can also be implemented in public establishments, airports, railway stations, etc.

References 1. Rahmani, A., Mirmahaleh, S.:. Coronavirus disease (COVID-19) prevention and treatment methods and effective parameters: a systematic literature review. Sustain. Cities Soc. 64, 102568 (2021) 2. Leung, N., Chu, D., Shiu, E., Chan, K., McDevitt, J., Hau, B., Yen, H., Li, Y., Ip, D., Peiris, J., Seto, W., Leung, G., Milton, D., Cowling, B.: Respiratory virus shedding in exhaled breath and efficacy of face masks. Nat. Med. 26(5), 676–680 (2020) 3. Koyama, T., Weeraratne, D., Snowdon, J., Parida, L.: Emergence of drift variants that may affect COVID-19 vaccine development and antibody treatment. Pathogens 9(5), 324 (2020) 4. Huang, G.B., Ramesh, M., Berg, T., Learned-Miller, E.: Labeled Faces in the Wild: A Database for Studying Face Recognition in Unconstrained Environments, University of Massachusetts, Amherst, Technical Report 07-49, Oct 2007 5. Shohag, M., Khan, F., Tang, L., Wei, Y., He, Z., Yang, X.: COVID-19 crisis: how can plant biotechnology help? Plants 10(2), 352 (2021) 6. Alamo, t., Reina, D.G., Millán, P.: Data-driven methods to monitor, model, forecast and control Covid-19 pandemic: leveraging data science, epidemiology and control theory. arXiv:2006. 01731 [q-bio.PE] 7. Nagrath, P., Jain, R., Madan, A., Arora, R., Kataria, P., Hemanth, J.: SSDMNV2: a real time DNN-based face mask detection system using single shot multibox detector and MobileNetV2. Sustain. Cities Soc. 66, 102692 (2021)

7 An Integrated Facemask Detection with Face Recognition …

87

8. Suresh, K., Palangappa, M., Bhuvan, S.: Face mask detection by using optimistic convolutional neural network. In: 2021 6th International Conference on Inventive Computation Technologies (ICICT), pp. 1084–1089 (2021). https://doi.org/10.1109/ICICT50816.2021.9358653 9. Loey, M., Manogaran, G., Taha, M.H.N., Khalifa, N.E.M.: A hybrid deep transfer learning model with machine learning methods for face mask detection in the era of the COVID-19 pandemic. Measurement 167, 108288 (2021). ISSN 0263-2241. https://doi.org/10.1016/j.mea surement.2020.108288 10. Wang, Z., et al.: Masked face recognition dataset and application. arXiv preprint arXiv:2003. 09093. RMFD (2020) 11. Jignesh Chowdary, G., Punn, N.S., Sonbhadra, S.K., Agarwal, S.: Face Mask Detection Using Transfer Learning of InceptionV3. In: Bellatreche, L., Goyal, V., Fujita, H., Mondal, A., Reddy, P.K. (eds.) Big Data Analytics. BDA 2020. Lecture Notes in Computer Science, vol. 12581. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-66665-1_6 12. Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J., Wojna, Z.: Rethinking the inception architecture for computer vision. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2818–2826 (2016) 13. Sandler, M., Howard, A., Zhu, M., Zhmoginov, A., Chen, L.-C.: MobileNetV2: inverted residuals and linear bottlenecks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 4510–4520 (2018)

Chapter 8

Test Case Generation Using Metamorphic Relationship Sampa Chau Pattnaik, Mitrabinda Ray, and Yahya Daood

Abstract In software testing, there are two major problems that affect the fundamentals of testing, the oracle problem and the reliable test set problem. The oracle is the mechanism that checks the output correctness after running the program on the selected input data. The oracle absence creates the oracle problem. The oracle does not exist for non-testable programs, although it might be theoretically possible in some cases but too difficult in practice to determine the correct output. Metamorphic Testing is a method that has been proposed to reduce the intensity of the oracle problem. Metamorphic testing uses properties called metamorphic relations that are properties of the target function or program to test them automatically without human intervention.

8.1 Introduction Software testing is an important aspect of building any software no matter how much the developer thinks his work is complete. It still needs to be tested to discover any defects or bugs that the developer might have overlooked or missed. Those defects and bugs can vary from being unimportant to being expensive, dangerous, and in some cases, result in a system failure. Software testing is essential procedure because it ensures the quality, reliability, reduces the software maintenance costs, ensures accurate results, effective performance, and most importantly ensures that the software should not encounter failures. So, it is essential that the software should be tested before the delivery to the client to avoid huge costs of maintenance after the delivery. The automation of software testing is as much desirable as it is practical to make the testing process more reliable, less time-consuming, easier, and much cheaper to deploy. In order to do so, the typical procedure is by using the test oracle. S. Chau Pattnaik (B) · M. Ray · Y. Daood Department of Computer Science and Engineering, Siksha O Anusandan (Deemed to be) Unversity, Bhubaneswar, India M. Ray e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2022 D. Mishra et al. (eds.), Intelligent and Cloud Computing, Smart Innovation, Systems and Technologies 286, https://doi.org/10.1007/978-981-16-9873-6_8

89

90

S. Chau Pattnaik et al.

Fig. 8.1 A test oracle

We need the test oracle or in other words, the oracle which is a procedure that tells us the difference between the faulty or the correct output of the system or program by comparing the desired output with the produced output [1–3]. As shown in Fig. 8.1, a test oracle is a predicate that determines whether a given sequence is acceptable or not. An oracle will respond with a pass or a fail verdict on the acceptability of any test sequence for which it is defined. However, testing using the oracle still has some obstacles, one of which is a huge obstacle which is the oracle problem [4, 5]. Oracle problem is the lack of the desired output due to unavailability or due to the amount of resources required such as the time required to get the output or memory needed to produce the required output of the system [6–8]. In this case—when oracle problem exists—it is not practical or functional to produce the correct required output for every Test Case (TC). The paper is represented as follows: Sect. 2 describes the usability of successful test cases. Section 3 describes the basics of metamorphic testing and creating metamorphic relations for test case generation. Section 4 presents the conclusion.

8.2 Usability of Successful Test Cases Software testing without one hundred percent coverage to all the possible inputs and outputs shows only the occurrence of faults, and it cannot prove the complete absence of faults [3]. Almost in every situation, successful TCs are considered useless, that is because they do not detect failures, and therefore those test results are not looked at furthermore. The respective use of successful TCs started to change 25 years ago. A summary of it is that most of the generation strategies for the TCs serve a particular objective, therefore each produced TC would bring certain valuable materials regarding the tested program [1, 4]. This revisit of successful TCs whether it is useless or not has resulted in the creation of generating new test cases from the successful test case [5–7]. At first, a metamorphic property is required, property of the program or function in terms of inputs with their expected outputs. Metamorphic relations are using the metamorphic properties in creating a relation on the target function or program to generate additional TCs (follow-up TCs) from existing successful TCs (source) and the outputs of those new cases. Since the source TC determines the

8 Test Case Generation Using Metamorphic Relationship

91

follow-up TC, the follow-up TC will have some or all of the valuable materials for testing in them [4]. After applying the metamorphic relations, if we observe that the actual outputs of follow-up TC and source TC violate a metamorphic relation, and then the program we are testing is a faulty program and has some errors with regard to the property connected with that metamorphic relation. Even though metamorphic testing was at first proposed as a methodology to generate new TC from the successful TC we already have, it did not take a long time before it was obvious that it can be used no matter if the source TC was successful or it was not. Moreover, it provided a lightweight, but effective, mechanism for the verification of the test results, that is what has made the metamorphic testing a promising approach for oracle problem. Metamorphic testing is not the sole technique which is designed to use the successful TC. Adaptive Random Testing (ART) [2, 5] spreads the TC across the input domain evenly, it does that by using the location of successful TC to guide selection of subsequent ones. The issue of categorizing images using seemingly hiding indicators termed Artcodes is investigated and MT is used to verify and improve the trained classifiers [8]. To evaluate the effect of fault detection of MRs supervised classifiers are tested [9].

8.3 The Formalization and Insight of Metamorphic Testing The insight for the use of metamorphic testing is to reduce the effect of the oracle problem in this way: If the correctness of the outputs cannot be defined for some inputs, we probably still be able to deploy metamorphic relations among the expected outputs of several related inputs. We explain this by considering two examples, Shortest Path and Cosine (cos) function. Example 1 The first example is the Shortest Path (H, u, v) problem: we need to find the shortest path between two vertices, u and v in an in-directed graph H with positive edge weights and also output the length of that path. Let f be an algorithm that computes the shortest path between vertices u and v in H, and P is the program implementation of f , for any two vertices u and v in H, the problem obviously would get harder with H getting larger, to verify whether P (H, u, v) is really the shortest path between the two vertices u and b, a potential way to check the result is to find all potential paths from u to b, and then check against them if P (H, u, v) is really the shortest path there. Though, it may not be convenient to produce and find all paths from the vertex u to the vertex v as their number grows dramatically with the increase of the number of vertices. Even though the oracle problem is present while testing the program P, we can use some properties to attest the result to some extent. For example, we derive a metamorphic relation from this property: If we exchanged places between the vertices u and v, the shortest path length must remain the same, that is, f (H, u, v) = f (H, v, u) = p, according on this metamorphic relation, there is a need of two executions for the test, the first is the follow-up TC (H, v, u) and the second is the

92

S. Chau Pattnaik et al.

Table 8.1 TCs and metamorphic relations for the problem, shortest path Condition u1, x 1 , x 2 , …, x k , v is a feasible path

New TC Inputs

Expected results

(H, v, u)

Returns the same shortest distance, however it may be a different path

u, x 1 , x 2 , … x k , v is a feasible path (H, u, x i ) and (H, x i , v) For any 1¡ = i¡ = k

The sum of the two distances is equal to p

u, x 1 , x 2 , …, x k , v is a feasible path

(H, v, x i ) and (H, x i , v) for any 1 ¡ = i ¡ = k

The sum of the two distances is equal to p

u, x 1 , x 2 , …, x k , v is a feasible path

(H, x 1 , x k )

The sum of the distance plus lengths of (H, u, x 1 ) and (H, x k , v) should equal p

source TC (H, u, v). Let (H, v, u) is the shortest path from v to u in the graph H with length p, Table 8.1 shows some examples of TCs and metamorphic relations for this problem. As an alternative to verifying the result for the test execution in comparison to the desired output, the results of the multiple executions can be verified against the metamorphic relation. The relation P (H, u, v) = P (H, v, u) (simply f is replaced by P) is checked to be satisfied or violated. If it was not satisfied, then we can say that P is faulty. Example 2 The second example is finding the value of cos (x) function. Let’s consider a program that gives the value of the cos (x) function, where x there must be a cos (x). Here are some metamorphic relations that we can use for the testing and generating of new TCs: cos(x) = cos(−x)

(8.1)

cos(x) = cos(x + 2π )

(8.2)

cos(x) = cos(x − 2π )

(8.3)

cos(2x) = 2 cos2 (x) − 1

(8.4)

cos(x + π ) = − cos(x)

(8.5)

cos(x − π ) = − cos(x)

(8.6)

8 Test Case Generation Using Metamorphic Relationship

93

As shown in Fig. 8.2, cos () function consists of a number of well-known test formulas that can be used to test the function. Test data and corresponding expected results are expressed in terms of (θ, cos(θ )). It uses the cos function several times in the relation and repeatedly invokes the same function with different parameters. It provides a high chance to uncover program faults as each invocation with a different parameter could execute a different path. Table 8.2 shows the generation of new test cases based on a successful test case. Metamorphic testing is employed directly by definition. The generation of TC and the verification of test results both are applied depending on metamorphic relations. Research papers on metamorphic testing [5, 6] and metamorphic relations [7] and their applications have identified a big quantity of metamorphic relations and have suggested that metamorphic relations are not very hard to identify despite the fact that it cannot be automated. By using an extremely modest metamorphic relation hundreds Fig. 8.2 Generating new test case from a given Input

Table 8.2 Some possible TCs and metamorphic relations for function, cos (x) Condition

New TC Inputs

Expected result

The output of cos(x) is a real number

cos (−x)

Must return the same value

cos (2π + x) cos (−2π + x) cos (2x) cos (−2x) cos (x + π )

Must return: 2 cos2 (x) − 1 Must returns the negative value of the initial output: −cos (x)

94

S. Chau Pattnaik et al.

of real-life defects have been detected in two very widespread compilers [1] which exhibit a solid indication that it is not hard at all to identify decent metamorphic relations. The developing process of metamorphic testing should be that it should also be easy and direct for users, meeting their own special demands [3, 7].

8.4 Conclusion Metamorphic Testing was invented in year 1998 to generate new test cases (TCs). It uses some necessary properties for the tested system which are the Metamorphic Relations. However, from that time Metamorphic Testing has been used as a way to reduce the effect and alleviate the Oracle problem, because of its simplicity and high effectiveness by using Metamorphic Relations as a method to verify the test results. Metamorphic Testing is very successful and detects different kinds of errors in various domains, advanced techniques, and software. Finding new test cases automatically and efficiently is still an ongoing research subject. New approaches for the automation process of finding new test cases is by using artificial intelligence and machine learning techniques to extract the Metamorphic Relations and Metamorphic properties from the program under test would be our next approach.

References 1. Cao, Y., Zhou, Z.Q., Chen, T.Y.: On the correlation between the effectiveness of metamorphic relations and dissimilarities of test case executions. In: 2013 13th International Conference on Quality Software, pp. 153–162. IEEE, China (2013) 2. Chan, W.K., Chen, T.Y., Lu, H., Tse, T.H., Yau, S.S.: Integration testing of context-sensitive middleware-based applications: a metamorphic approach. Int. J. Software Eng. Knowl. Eng. 16(05), 677–703 (2006) 3. Chan, W.K., Tse, T.H.: Oracles are hardly attained, and hardly understood: confessions of software testing researchers. In 2013 13th International Conference on Quality Software, pp. 245–252. IEEE, China (2013) 4. Chen, L., Cai, L., Liu, J., Liu, Z., Wei, S., Liu, P.: An optimized method for generating cases of metamorphic testing. In 2012 6th International Conference on New Trends in Information Science, Service Science and Data Mining (ISSDM2012), pp. 439–443. IEEE, Taiwan (2012) 5. Chen, T.Y., Cheung, S.C., Yiu, S.M.: Metamorphic testing: a new approach for generating next test cases. arXiv preprint arXiv:2002.12543 (2020) 6. Chen, T.Y., Ho, J.W.K., Liu, H., Xie, X.: An innovative approach for testing bioinformatics programs using metamorphic testing. BMC Bioinform. 10(1), 1–12 (2009) 7. Chen, T.Y., Poon, P.-L., Xie, X.: Metric: metamorphic relation identification based on the category-choice framework. J. Syst. Software 116, 177–190 (2016) 8. Xu, L., Towey, D., French, A.P., Benford, S., Zhou, Z.Q., Chen, T.Y.: Using metamorphic relations to verify and enhance Artcode classification. J. Syst. Software 111060 (2021) 9. Saha, P., Kanewala, U.: Fault detection effectiveness of metamorphic relations developed for testing supervised classifiers. In: 2019 IEEE International Conference on Artificial Intelligence Testing (AITest), pp. 157–164. IEEE (2019)

Part II

IoT/Network

Chapter 9

IRHMP: IoT-Based Remote Health Monitoring and Prescriber System Rahul Chakraborty, Asheesh Balotra, Sanika Agrawal, Tanishq Ige, and Sashikala Mishra

Abstract The Novel Corona Virus Disease-2019 (COVID-19), which created this pandemic, makes us realize the importance of universal social and health care systems. The frontline workers worked restlessly during the pandemic and few of them also lost their lives. There is a need for a remote IoT health monitoring system that takes care of the health of infected patients, conducts regular health checks, and reduces contact between an infected person and health workers. This especially helps the patients with mild symptoms who are home quarantined. The IoT system monitors a person 24/7 and a report can be generated and sent to the doctor at the same time. However, such a procedure will produce a large amount of data. A major research challenge addressed in this paper is to effectively transfer health care data with the help of existing network infrastructure and transfer it to the cloud. In this paper, we have identified the key network and infrastructure requirements for a standard health monitoring system based on real-time event updates, bandwidth requirements, data collection, and data analysis. After that, we propose IRHMPIoT-based remote healthcare device that delivers health care data efficiently to the cloud and the web portal. Finally, we have proposed a machine-learning algorithm to provide and predict future health risks with the help of recorded data.

9.1 Introduction The COVID-19 pandemic has taken a toll on all of us, but the worst affected parts of our society are the ones with a dense population. These areas get affected the most because of the active spread of the virus and lack of enough intensive care units. Old age people will be easily getting the help through our system. This growing population of people who are not able to take care of themselves calls for a smart R. Chakraborty (B) · A. Balotra · S. Agrawal · T. Ige International Institute of Information Technology, Pune, India S. Mishra Symbiosis Institute of Technology, Symbiosis International (Deemed) University Pune, Pune, India © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2022 D. Mishra et al. (eds.), Intelligent and Cloud Computing, Smart Innovation, Systems and Technologies 286, https://doi.org/10.1007/978-981-16-9873-6_9

97

98

R. Chakraborty et al.

solution that will automate the process of daily health checks and get rid of the hassle of going to a clinic on a daily basis. In conditions of Alzheimer’s a remote health monitoring system can be very beneficial. If integrated with an IoT-based smart home system it will enable a caretaker to monitor a patient 24/7 and also monitor home conditions like which door has been opened or if the stove has been left on, etc. It can be deployed with a smart home system with door and motion sensors. It can also be integrated with a live video feed of particular rooms for better and intensive care. This can be paired with any smart device to monitor the vital signs of a patient and also the physical state using the video feed. For such people who require constant care and are not able to take care of themselves, including people getting infected with contagious and deadly viruses, poses a requirement of an efficient system that can perform continuous monitoring with integrated machine-learning solutions for future health risk detection. The primary features of the expected system should be to alert guardian or the doctor when certain parameters break the threshold, give detailed analysis of the recorded data to the doctor and predict any future health risk. The parameters to be measured are heart rate, body temperature, blood SPO2 level, ECG, and blood pressure and the parameters to be manually entered are Blood cholesterol, Age, chest pain (if any). Our work includes: • Analyzing the network communication requirements of the system and choosing the best for our needs. • Proposing a remote health monitoring architecture based on cloud which can further be used to create a prototype. • Proposing a machine-learning model for predicting further health risks. The paper focuses specifically on the model IRHMP.

9.2 Previous Work 9.2.1 Architecture In the papers [1–22] several types of IoT-based healthcare systems have been proposed with several types of their implementations. Generally, three types of viewpoints help to construct the IoT model [4] network-centric IoT, cloud-centric IoT, and data-centric IoT. Although the domains of implementations maybe different but they have some common basic components like communication layers and data flow strategy. For our system, we have chosen a cloud-centric IoT system with some inspiration from the data-centric IoT system.

9 IRHMP: IoT-Based Remote Health Monitoring and Prescriber System

99

9.2.2 Communication Protocol In the application layer for our system, there are many networks communication protocols available for inter IoT device communication and IoT communication with cloud-like HTTP [5] or protocols dedicated to the field of IoT like MQTT [6] or CoAP [7]. The comparison between the protocols is discussed in recent literature work. In [8] paper it has been determined that MQTT may be better when some special functionalities are required like different levels of QoS and multicast. However, CoAP performs better than MQTT in terms of bandwidth and speed most of the time. It may also depend on the particular scenario; MQTT gives better results when communication between devices is very frequent but the changes are most of the time negligible. While CoAP may perform better in the case of large messages as it fragments the data. We have also studied the comparison between CoAP and HTTP in [9]. CoAP generated 85% fewer data as compared to HTTP which makes it storage efficient and turns out to be much more power-efficient than HTTP. In [10], the author gave a qualitative analysis on the state of art IoT protocols (HTTP, MQTT, CoAP, AMQP). The study shows that every protocol comes with its pros and cons and it entirely depends on the use case and for the user to choose the most optimize protocol for their use case. Further, it is possible to use multiple protocols in a single complex system. These conclusions were drawn after considering many factors like QoS schemes, communication type, etc.

9.3 IRHMP—IoT-Based Remote Health Monitoring and Prescriber System 9.3.1 Architecture Below we present the architecture called IoT-based Remote Health Monitoring and Prescriber (IRHMP), which consists of four layers namely, sensing layer, gateway layer, cloud layer, and application layer, as described in Fig. 9.1. The application layer is not represented in the diagram but is discussed below. The sensing layer will contain all the sensors required for remote health monitoring. With progress of fabrication technologies, today’s market has many biosensors available which are very small in size and can be easily integrated with our system like BP sensor, Heart Rate sensor, Temperature sensor, etc. If we integrate our system with smartphones we can create another source of healthcare data which may include number of steps per day, screen on–off time, and probable sleep hours. The data collected in this layer is now forwarded to gateway layer for further processing. In the gateway layer, the data collected from the sensors will be filtered, preprocessed, and encrypted and then transferred to cloud using Wi-Fi module present in the system, for different algorithms like RSA, AVT statistical filtering, etc. The

100

R. Chakraborty et al.

Fig. 9.1 The IRHMP Architecture

entire sensor will send the data into a local database which after processing will be transferred to cloud. In the gateway layer, these algorithms will prepare the data for the next stage of efficient data transmission. These activities help in efficient data transfer as the bandwidth and data volume to be transferred is reduced and also the client-side encryption using RSA algorithm provides transmission security. The data is temporarily stored in local database and after processing transferred to a database in the cloud. Encryption and decryption are carried out in this layer, as health data is sensitive and needs to be protected. The gateway layer acts as a mediator between the main two layers which are cloud and sensing layers. The gateway layer is responsible for efficient and secure transmission between the two layers. The data gets transferred from the device using ISP Wi-Fi module which is then uploaded to cloud for further processing. For efficient bandwidth management of the network, CoAP protocol is preferred for transmission. The cloud receives the data from the gateway layer and first stores it in the database. Using different processing algorithms and plotting techniques, the raw data is converted into graphs and meaningful tables which are further shown to a guardian or a doctor on an online portal for self-analysis and can be used by the doctor for prescribing medicines. The processing of the data and for prediction, XG-Boost classification algorithm is used. To train this Machine-Learning model dataset from https://archive.ics.uci.edu was used. The prediction can be reviewed by the doctor and further action can be taken. On receiving permission from a doctor our system will be able to auto prescribe FDA-approved medicine to a patient. Portal can be

9 IRHMP: IoT-Based Remote Health Monitoring and Prescriber System

101

customized by the doctor and the guardian to set alerts for medication reminders and also set alarms in case of abnormal readings. The Application layer has the web application/portal designed for interaction with the recorded data and will contain the visualized version of the data for easier perception. This application can have many features like medication reminders, viewing health risk predictions, interactive data representation, etc. Our system will further allow a guardian/doctor to set medication alerts and also set alerts for sudden drop/rise of any of the recorded parameters.

9.3.2 Functionalities Doctor Login/logout—Login credentials are provided to the doctors. Prescribe medication—We have given an option for the doctor to prescribe medication for a specific patient; they can also specify the dose and time for medicine and set reminders for patients. Monitor patient—We have provided a dashboard to monitor multiple patients at a time and set customizable alerts for the readings for every patient so that if any reading is abnormal the doctor will get the notification for it. Check input from ML model—The system constantly monitors the health of a patient, and records the data that can be used for prediction of future heart risks and doctors can prescribe the meditation as per necessity. Patient: Login/logout—After registration patients get accepted by the admin and a doctor gets allotted to the patient. Their credentials are stored in a local SQL database server which is used to verify the next time patient wants to login. Register—Patients need to first, register through our portal and then admin allots a doctor to the patient. Patients can also be registered directly by admin in case of intra-hospital patients. Dashboard—Patients will have a dashboard for monitoring and also have medicine reminders on the dashboard. Guardians can also monitor these readings using the patient’s credentials. Book Appointment—Patients will be given an option to book an online or clinical appointment with the doctor to discuss at both their convenience. This feature can further be used to book hospital beds in case of an emergency. Admin:

102

R. Chakraborty et al.

Manage doctors—Admin will have the access to add, remove and edit credentials of all the doctors. Doctors can send a request for password change through our portal to the admin in case they forget their password. Manage patients—Admin will have the access to add, remove and edit credentials of all the patients. Further, the admin also has the responsibility to allot a doctor to every patient and register new patients into the system. Patients also need to generate a password change request to the admin in case they forget their password. Update availability—Admin will manage the appointment schedule of doctors and availability of beds in the hospital. Admin will have to update these appointments schedules and bed availability regularly for the system to work effectively.

9.3.3 Implementation Working Front End, for building the online portal for the doctor and the patient we are going to use MERN stack to implement it. The Manual input expected from the user or doctor will be: Age, Gender, Smoking habits, Serum cholesterol (in mg/l), Blood sugar level, and chest pain type (if any). These Inputs will be further processed to feed into our model and get possible future health risks. Other than that, our portal will show live time feed from the sensors which will include body temperature, SPo2 level, and real-time ECG plot. These readings will be used for manual and automatic monitoring by the doctor/ guardian. Hardware, our system is heavily dependent on accurate and optimal readings from the sensors. So, the sensors chosen for our system are state-of-the-art sensors which are widely available in the market and have been used for a long time and are thus dependable [20]. The hardware components and sensors used are: • LM35: The LM35 sensor is a very precise IC in which the voltage varies with the temperature. The LM35 sensor is perfect for our system as it does not require any external calibration and typically provides accuracies of ± ¼ °C at room temperature and ± ¾ °C over a full − 55 to 150 °C temperature range (Fig. 9.2). • MAX30100: The MAX30100 serves a dual purpose in our project as it is an oximeter as well as a heart rate sensor. It functions from 1.8 and 3.3 V a power supply which makes it an optimal power-efficient device that does two jobs at a time which are recording SPo2 level and heart rate (Fig. 9.3). • AD8232: It is a fully combined single-lead ECG front end sensor block which will be very important in our project as it generates the ECG plot. It is a Singlesupply operation sensor operating from 2.0 to 3.5 V (Fig. 9.4). • Arduino Uno: It is the most important part of our device as it is the microcontroller board based on the ATmega328P (datasheet) we are going to use in our project. It is perfect for small scale projects like ours which does not require much power and does not have too many components (Fig. 9.5).

9 IRHMP: IoT-Based Remote Health Monitoring and Prescriber System Fig. 9.2 LM35—temperature sensor

Fig. 9.3 MAX30100—heart rate and pulse oximeter sensor

Fig. 9.4 AD8232-ECG

103

104

R. Chakraborty et al.

Fig. 9.5 Arduino Uno

• ESP8266: It is the Wi-Fi module for our system and it is the components that will be used by the microcontroller to connect to the internet and upload readings to the cloud or local storage (Fig. 9.6). For our prototype instead of using complex cloud services or platforms, we decided we can make our own cloud as we can turn our own space dynamic and globally accessed with ease, using the NGROK service we turned our computer space to the cloud or globally accessed server with secure tunneling and port forwarding. All the data fetched through the sensors are stored on the local machine and by NGROK connection to the database enables the secure tunneling to all the users efficiently, as it robustly performs cloud functionality and all the data received from the microcontroller with the help of the various sensors is available for the respective user at data-dashboard on our site. As a result, we end up with the cost reduction and efficiency, as it requires high cost to purchase the external complex cloud services. Fig. 9.6 ESP8266 Wi-Fi module

9 IRHMP: IoT-Based Remote Health Monitoring and Prescriber System

105

This is feasible for the first prototype and initial stages of our project. We will opt for professional cloud services if we scale in future. Dataset: The dataset that we have used has the following attributes for calibration of CAD risks- age, sex, the chest pain, resting blood pressure, serum cholesterol in mg/dl, fasting blood sugar > 120 mg/dl, resting electrocardiographic results (Fig. 9.7). We have used this dataset to predict chances of getting CAD (Coronary Heart Disease) in the next ten years. It can be found at https://www.kaggle.com/ronitf/ heart-disease-uci and authored by the Hungarian Institute of Cardiology (Fig. 9.8). Model Training: We have tested a 12 different machine-learning classification algorithms to check for the best algorithm for our use cases. We found out that the Bernoulli Naive Bayes algorithm gives the maximum accuracy of 84% out of 12 different algorithms. But after doing some Grid searching and cross-validation we see the XGBoost classifier gives an accuracy of more than 86% after tuning the hyperparameters given by Grid search (Fig. 9.9).

Fig. 9.7 Dataset description

106

R. Chakraborty et al.

Fig. 9.8 Correlation matrix for exploratory data analysis

We will use XGBoost classifier for our project as it gives an accuracy of more than 86% with [(base_score = 0.5, booster = “gbtree”,colsample_bylevel = 1, colsample_bynode = 1, colsample_bytree = 1, gamma = 0,learning_rate = 0.1, max_delta_step = 0, max_depth = 3, min_child_weight = 1, missing = None, n_estimators = 100, n_jobs = 1, nthread = None, objective = “binary:logistic”, random_state = 0, reg_alpha = 0, reg_lambda = 1, scale_pos_weight = 1, seed = None, silent = None, subsample = 1, verbosity = 1)]. These hyperparameters, these parameters were achieved using GridSearchCV() function of sklearn.model_selection library, which generates candidate grids of hyper parameters and the best score is provided by grid.best_core_object. As we can see from n_estimatorsattributes the model has 100 decision trees but showing each and every one in this paper is not possible so here is the first tree of the model (Fig. 9.10). Strong aspects of our system We have done meticulous research on the technology available at the moment in the market for remote health analysis and have found out that even the state-of-theart systems available right now have not been able to catch up with the technology available today. With no emphasis on AI and outdated interfaces, the systems need a serious improvement especially in these days of pandemic. We have provided a

9 IRHMP: IoT-Based Remote Health Monitoring and Prescriber System

107

Fig. 9.9 Model comparison

Fig. 9.10 Model tree

smart system to monitor multiple patients at a time with the functionality of getting alerts to doctors and guardians in case of abnormal readings from our sensors. We have also integrated an ML Algorithm to predict CAD in the next ten years with the help of which doctors can suggest preventive measures. Our system also uses the faster and less bandwidth-heavy algorithm CoAP [14] for IoT communication which

108

R. Chakraborty et al.

Fig. 9.11 SHAP feature analysis

is very power efficient [21] as compared to previously used MQTT [22] and HTTP [25]. SHAPAnalysis: We have used SHAP values to plot the importance of every attribute in the model for the decision-making process. As we can see the model gives ca (number of major vessels colored by fluoroscopy) the most priority (Fig. 9.11).

9.4 Conclusion The paper analyzed previous work on network protocols and determined that CoAP will be the best for our system and volume of the data generated is small which makes it ideal for a country like India where the network infrastructure is not strong yet. Our paper also proposes a 4-teir architecture called IRHMP. IRHMP is capable to incorporate several types of healthcare sensors in the sensing layer and canal so be integrated with smart home systems for a better result. An IRHMP implementation using CoAP will significantly reduce the bandwidth necessities. Due to the low bandwidth, it would be easily deployed in countryside areas. It can be used to monitor the patients regularly the future improvement in our project can be using Data Mining to Predict Disease Outbreaks, Faster and Better Health Diagnosis using larger dataset, Location Tracking for emergency ambulance services.

9 IRHMP: IoT-Based Remote Health Monitoring and Prescriber System

109

References 1. Rohokale, V.M., Prasad, N.R., Prasad, R.: A cooperative internet of things (IoT) for rural healthcare monitoring and control. In: Second International Conference on Wireless Communication, Vehicular Technology, Information Theory and Aerospace & Electronics Systems Technology (Wireless VITAE), pp. 1–6. IEEE (2011) 2. Doukas, C., Maglogiannis, I.: Bringing IoT and cloud computing towards pervasive healthcare. In: Sixth International Conference on Innovative Mobile and Internet Ser- vices in Ubiquitous Computing (IMIS) , pp. 922–926. IEEE (2012) 3. Amendola, S., Lodato, R., Manzari, S., Occhiuzzi, C., Marrocco, G.: RFID technology for IoT-based personal healthcare in smart spaces. IEEE Internet Things J. 1(2), 144–152 (2014) 4. Jin, J., Gubbi, J., Marusic, S., Palaniswami, M.: An information framework for creating a smart city through internet of things. IEEE Internet Things J. 1(2), 112–121 (2014) 5. Janosi, A., Steinbrunn, W., Pfisterer, M., Detrano, R.: Heart disease. UCI. Available: https:// www.kaggle.com/ronitf/heart-disease-uci 6. Mansor, H., Meskam, S.S., Zamery, N.S., Mohd Rusle, N.Q.A.: Portable heart rate measurement for remote health monitoring system (IIUM) (2015) 7. Dias. D., Silva Cunha, J.P.: Wearable health devices—vital sign monitoring, systems and technologies. Biomedical Research and Innovation (BRAIN) (2018) 8. Asaduzzaman Miah, Md., Kabir, M.H., Siddiqur Rahman Tanveer, Md., Akhand, M.A.H.: Continuous Heart Rate and Body Temperature Monitoring System using Arduino UNO and Android Device (2015) 9. Parihar, V.R., Tonge, A.Y., Ganorkar, P.D.: Heartbeat and temperature monitoring system for remote patients using Arduino (2017) 10. Fielding, R., Gettys, J., Mogul, J., Frystyk, H., Masinter, L., Leach, P., Berners-Lee, T.: Hypertext transfer protocol (1999) 11. Stanford-Clark, A.N.A.: MQTT Version 3.1.1. OASIS Std., Oct 2014. [Online]. Available: http://docs.oasis-open.org/mqtt/mqtt/v3.1.1/mqtt-v3.1.1.html 12. Patierno, P.: IoT Protocols Landscape. (2014, June) [Online]. Available: http://www.slideshare. net/paolopat/io-t-protocols-landscape 13. Abdelwahab, S., Hamdaoui, B., Guizani, M., Rayes, A.: Enabling smart cloud services through remote sensing: an internet of everything enabler. Internet Things J. 1(3), 276–288 (2014) 14. Hui, J., Kelsey, R., Levis,P., Pister, K., Struik, R., Vasseur, J.P., Alexander, R.: RFC6550—RPL: routing protocol for low power and lossy networks. IETF Trust (2012) 15. Aazam, M., Huh, E.-N.: Fog computing and smart gateway based communication for cloud of things. In: 2014 International Conference on Future Internet of Things and Cloud (FiCloud), pp. 464–470 (2014) 16. Khanna, A., Misra, P.: White paper life sciences, the internet of things for medical devices— prospects, challenges and the way forward. Tata Consultancy Services, 1 July 2014 17. Shelby, Z.: Embedded web services. IEEE Wireless Commun. (2010) 18. Montenegro, G, Kushalnagar, N., Culler, D.: RFC4944—transmission of IPv6 packets over IEEE 802.15.4 networks. IETF Trust (2007) 19. Shelby, Z., Vial, M.: CoRE interfaces: draft-shelby-core-interfaces-3, IETF Trust (2013) 20. Shelby, Z., Hartke, K., Bormann, C.: The constrained application protocol (CoAP) (2014) 21. Stanford-Clark, A, Truong, H.L.: MQTT for Sensor Networks (MQTT-S) Protocol Specification Version 1.1. IBM Corporation, Armonk (2008) 22. Li, S.T., Hoebeke, J., Jara, A.J.: Conditional Observe in CoAP: draft-li-core-conditionalobserve-03, IETF Trust (2012) 23. UN: World Population Ageing 2019. https://un.org/en/development/desa/population/publicati ons/pdf/aegeing/WorldPopulationAgeing2019-Highlights.pdf

Chapter 10

On Boundary-Effects at Cellular Automata-Based Road-Traffic Model Towards Uses in Smart City Arnab Mitra

Abstract Infrastructure support capability is one of the key parameters toward the success in Smart City. Past research presented several intelligent transportation systems (ITS) to efficiently manage the traffic congestion and enhance system efficiency. Among several others, several simple solutions with Cellular Automata (CA) were presented by researchers to simulate and study the road-traffic scenarios. In this regard, we presented a systematic study toward CA-based road traffic modelling targeting truly cost-effective physical modelling in view of Smart City applications. Presented study examines several CA rules with respect to its dynamics observed at different CA boundary conditions and efficiently identify the compatible CA rules toward the modelling for a true cost-effective road-traffic simulation in Smart City scenario.

10.1 Introduction Information and communication technology (ICT) has triggered the transformation of earlier cities into Smart Cities [1–4]. ICT has reformed the way “in which cities organize policymaking and urban growth” [1]; they (Smart Cities) “base their strategy on the use of information and communication technologies in several fields such as economy, environment, mobility and governance to transform the city infrastructure and services” [1]. For this motivation, numerous researchers have concentrated on the development and advancements in the Smart City applications. State-of-the-art literatures targeting different application areas in Smart City e.g., Urban mapping, Traffic congestion, Smart Lighting, Waste Management, Smart Logistics, Domestic and Home automation etc. may be found over several scientific databases. We find that researchers have also focused on performance evaluation for several smart applications. In this regard, triple helix model-based performance evaluation in Smart Cities was presented in [4]. As it is impractical to discuss about all the literatures available A. Mitra (B) Department of Computer Science and Engineering, SRM University-AP, Amaravati, Andhra Pradesh 522240, India e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2022 D. Mishra et al. (eds.), Intelligent and Cloud Computing, Smart Innovation, Systems and Technologies 286, https://doi.org/10.1007/978-981-16-9873-6_10

111

112

A. Mitra

Fig. 10.1 Few important applications towards smart cities

toward Smart City, we choose to briefly present few important areas in Smart Cities. A typical diagram is followed next (refer Fig. 10.1) which presents few important applications targeting Smart Cities. Several important areas those are in focus among researchers with reference to Smart Cities, are presented in Fig. 10.1. Among those mentioned areas, we are particularly interested in modelling of Smart Traffic in view of Smart Cities. In this regard we find that ISVN (Infrastructure supporting vehicular network) was described as an important factor toward successful implementation for ITS (Intelligent Transport System) targeting Smart City applications [5]. For this reason, several important research with reference to modelling of Smart Traffic have been reviewed and are briefly presented in Sect. 2. Rest of the paper architecture is as following: review for state-of-the-art literature is presented in Sect. 2; proposed empirical investigation is presented in Sect. 3; detailed discussions are presented in Sect. 4; conclusive remarks are presented in Sect. 5.

10.2 Review of the State-of-the-Art Available Works It is already presented in [4] that the design of ISVN is considered as one of the important factors toward the enhancements in Smart City applications. We found, modelling, and control of traffic control play a significant role toward an enhanced transport efficiency in ITS [4]. To facilitate real-time control in urban traffic, the design of a DSS (Decision Support System) was presented in [6]. In another research,

10 On Boundary-Effects at Cellular Automata-Based …

113

WSN (Wireless Sensor Networks)-based low-cost intelligent traffic system was presented in [7] that ensured privacy preservations for users. Several other low-cost modelling efforts were also presented by researchers. Among others, Cellular Automata (CA)-based models toward car-traffic in street network were presented by several researchers. We find that CA-based modelling is a widespread choice among scholars as CA are considered as an effective dynamic mathematical tool toward several complex problems. Additionally, CA-based model supports inherent parallel computing, VLSI (Very-Large Scale-Integration) integration compatibility at the price of D flip-flops [8], and low-power consumption [9]. The simple CA configuration is recognized as ECA (Elementary CA) configuration, having either fixed boundary or, periodic boundary scenario at single-dimension grid structure. CA (including ECA) dynamics progress over discrete time and discrete space. Further, it is important to state that ECA progresses are dependent on the binary values [i.e., zero (0) and one (1)] of the cells and corresponding transition functions are referred to as ECA rules (alternatively acknowledged as Wolfram CA rules; please note total 256 CA rules are present in the set of one-dimensional ECA rules). The next state of the ith cell at any future time (t + 1) at an ECA configuration, may be presented by a function (i.e., CA rule) which has the following variables: the present states of the (i − 1)th cell, the ith cell, and (i + 1)th cell at time t that is, t t , xit , xi+1 ). We request the reader(s) of this manuscript to look into [8, xit+1 = f (xi−1 9] to know more about basics of ECA and its characteristics. Among several presented CA-based traffic model, we find a road-traffic CA model in [10] toward Manhattan-like urban city environment. In [10], car displacement at four possible directions (e.g., east, south, west, and north) were considered at naturally implemented rotary junctions (i.e., cross-roads) which presented non-uniform complex dynamics at networks. In another research, CA-based microscopic simulation was presented in [11] toward urban traffic modelling. In another research, CAbased city traffic modelling and its dynamics was presented in [12] as “the transition from free to congested flow, lane inversion and platoon formation can be accurately reproduced using cellular automata” [12]. In 2005, a detailed review on different traffic CA models was presented in [13] by Maerivoet et al. We find that Maerivoet et al. have described it as “computationally efficient microscopic traffic flow model” [13]. Another CA-based traffic model was presented in [14] which presented the use of randomization probability with reference to the “stopped time of the vehicle” [14] to effectively simulate and predict the drivers’ sensitivity. In a different research, CA-based cost-effective query processing and data servicing in mobile network was presented in [15], which we think is crucial and believe may have further scope for uses in monitoring of traffic congestions. Besides, the mechanical and physical characteristics for several vehicles at heterogenous and complex traffic were modelled with CA in [16]; CA-based control toward cycle-duration, green-split, and coordination of traffic lights at traffic modelling was presented in [17]; a heavy-trafficbased traffic modelling with CA at congested cities presented in [18]. Many other efforts with CA toward traffic simulation may also be observed in different literatures [19–28]. In recent times, we find that CA-based traffic model involving vehicle-toinfrastructure communication technology was introduced in [29], CA-based traffic

114

A. Mitra

model with spatial variation in cell width was presented in [30], another CA-based traffic model using array DBMS was presented in [31]. Besides, uses of CA-based models may also be found toward enhanced prediction for driving behavior [32]. From our studies, we observed that ECA rules 136, 184, 226, 252 were used by researchers in [18–24] to model city-traffic dynamics. With ECA rules 136, 184, and 252, road traffic simulation model was at single intersection, considering velocitydensity and flux-density, and traffic light co-ordination for very large systems [22]. It has already been discussed that CA-based modellings are one of the popular choices among researchers toward several complex problems as an enriched set of dynamics are found with CA-based models. Thus, several researchers have engaged themselves to explore the dynamics of CA [33–38]. Investigations on the dynamics for ECA-based models at different fixed boundaries were presented in [33] and further continued in [34–38], those further explored enriched ECA dynamics at various scenarios. The various fixed boundary situations of [34–38] were 0…0 (aka null-boundary scenario), 0…1, 1…0, and 1…1. Unfortunately, we found that investigations on the boundary effects on said ECA rules [18–24] were not considered. We believe, such investigation may explore the true potential for said ECA rules toward its uses in view of road-traffic simulation in Smart City applications. For this reason, proposed boundary-based investigations (i.e., consideration of 0…0, 0…1, 1…0, and 1…1 as various fixed boundary situations) are discussed next. We restricted our choice with ECA dynamics at various fixed boundary situations, as past research explored potential enriched ECA dynamics at different fixed boundary conditions which may further facilitate true cost-effective implementation [34–38].

10.3 Proposed Empirical Investigation Though several CA-based traffic modellings were introduced by researchers, we preferred elementary CA (ECA)-based modelling as it is most simple and may be modelled at the cost of D-flip-flops. It is already presented in Sect. 2 that ECA rules 136, 184, 226, and 252 exhibit their potential toward uses in road-traffic simulation. We believe more ECA rule(s) may exhibit such potential. For this reason, a systematic investigation on set of ECA rules is presented first (refer Sect. 3.1) and thereafter, investigations on dynamics at several boundaries are presented next (refer Sect. 3.2). It may be noted as already traffic modellings with ECA rules 136, 184, 226, and 252 were presented by past researchers, we are not repeating the same in present manuscript. We just have tried to examine the entire set of all ECA rules to find out more ECA rules capable to support traffic-simulation.

10 On Boundary-Effects at Cellular Automata-Based …

115

10.3.1 Search for More ECA Rules Toward Uses in Road-Traffic Simulation Initially we extended our investigations toward the selection of potentially capable larger set of ECA rules by incorporating the complementary ECA pairs of explored ECA rules i.e., 136, 184, 226, and 252. Thus, we received four complementary pairs of ECA rules 136, 119, 184, 71, 226, 29, and 252, 3, those could be potentially applied for modelling road-traffic. Detailed modelling for ECA-based road-traffic simulation is already presented in Sect. 2 (refer [22]). For this reason, we continued for a systematic investigation toward explored set of rules. The space–time diagrams for said rules (i.e., ECA rules 3, 29, 71, and 119) as explored from “https://plato.sta ndford.edu/entries/cellular-automata/supplement.html” (accessed on February 01, 2021) helped us to consider those complemented ECA rules for potential selection for proposed modelling. We continued to obtain the space–time diagram for all four potential complementary pairs of ECA rules. We continued generation of space–time diagrams for four complementary pairs of ECA rules at cell size 40 (arbitrarily chosen) and fixed seed initialization at 1 (arbitrarily chosen) at uniform (homogeneous) null-boundary ECA configuration. For simulation we used C-Free IDE (Integrated Development Environment) version 4.0-standard with Intel® Celeron® x64-based CPU N2840 @ 2.16GhZ, 2.16 GHz, with 2.00 GB installed memory at 64 bit Windows® 10 Home environment. A sample simulation code for one such investigation with ECA rule 102 and 153 at 0…0 (null) boundary condition is presented in [39]. We found that “cells using rules 252 and 136 could be simplified to depend only on two cells, as their state is not affected by the intersection cell” [22] and “to model flow in different directions, we could mirror rules (e.g., 226 is equivalent to rule 184 by reflection, which vehicles flowing to the left), but it is simpler to invert neighborhood” [22]. Thus, we observed identical space–time diagrams for ECA rules 136, 184, and 252 in Fig. 10.2. Additionally, we found in [22] that ECA “(rules 3, 17, 63 and 119) do not conserve density, so they are not useful for traffic modelling. This is also the case for other two rules in the fully asymmetric cluster of rules 184 and 226 (rules 71 and 29)” [22]. For this reason, we did not consider ECA rules 3, 29, 71, and 119 for further investigations toward exploration of the dynamics at several fixed boundary scenarios.

10.3.2 Investigations on ECA Dynamics Toward Road-Traffic Simulation at Various Fixed-Boundary Situations We continued the investigations of the ECA dynamics for ECA rules 136, 184, and 252. We followed the same notation at the various fixed boundary notations i.e., 0…0, 0…1, 1…0, and 1…1 as presented in [33, 35–38]. Dynamics for ECA rule 136 at

116

A. Mitra

Fig. 10.2 Observed ECA dynamics at 0…0 fixed boundary

uniform (homogeneous) CA configuration at several fixed boundary conditions are presented in Fig. 10.3. It is observed form Fig. 10.3 that identical state-space diagrams are obtained for the space–time diagram of ECA rule 136. We further continued our investigations

Fig. 10.3 Observed dynamics for ECA rule 136 at several fixed boundary

10 On Boundary-Effects at Cellular Automata-Based …

117

related to the dynamics for other two ECA rules (i.e., rule 184 and 252). Dynamics for said ECA rules are presented in following Fig. 10.4 and Fig. 10.5. It is further observed form Figs. 10.4. and 10.5 that identical state-space diagrams are also achieved for other two ECA rules i.e., rules 184 and 252 with respect to the space–time diagram of ECA rule 136. Please note that same cell size 40 (arbitrarily chosen) and fixed seed initialization at 1 (arbitrarily chosen) were considered at each different fixed boundary condition for each different ECA rules.

Fig. 10.4 Observed dynamics for ECA rule 184 at several fixed boundary

Fig. 10.5 Observed dynamics for ECA rule 252 at several fixed boundary

118

A. Mitra

10.4 Detailed Discussions It is observed from Figs. 10.2, 10.3, and 10.4 that identical space–time diagrams are achieved for each ECA rule at different fixed boundary conditions. We found that homogeneous ECA configurations with rule 136, 184, or 252 do not depend on the left and right boundary values of the ECA configurations. For this reason, the respective boundary values i.e., 0…0, 0…1, 1…0, or 1…1 may be “fixed, periodical or randomly” [37] assigned. Thus, it may be concluded that “there is no need for assigning program memory for software implementations with these rules” [37]. Additionally, we found that a low-cost (with D-Flip Flops only) design is feasible with CA which requires very power consumption (a lowest of 1.20E−05 watt to a highest of 1.17E−07 watt) [9, 35]. Thus, we examined and explored the truecost effectiveness for ECA rules for selection as a cost-efficient modelling tool in road-traffic simulation in view of Smart City applications.

10.5 Conclusive Remarks Presented study effectively examines a set of ECA rules with respect to its dynamics for possible selection toward modelling of the road-traffic simulation. It may be concluded from our investigations that ECA rules 136, 184, or 252 may be effectively used to model road-traffic simulation in view of Smart City applications, which further ensure true cost effectiveness both at physical and software implementation. Acknowledgements Author sincerely acknowledges Prof. H.-N. Teodorescu for his guidance toward presented research design and active supervisions toward advancing research skills during an EU sponsored Erasmus Mundus “cLink” research mobility at his laboratory at “Gheorghe Asachi” Technical University of Iasi, Romania in 2015–2016. Author also acknowledges the anonymous reviewers for the useful comments those have further improved the quality of the present manuscript.

References 1. Bakıcı, T., Almirall, E., Wareham, J.: A smart city initiative: the case of Barcelona. J. Knowl. Econ. 4(2), 135–148 (2013) 2. Hall, R.E., Bowerman, B., Braverman, J., Taylor, J., Todosow, H., Von Wimmersperg, U.: The vision of a smart city (No. BNL-67902; 04042). Brookhaven National Labaratory, Upton, NY (US) (2000) 3. Su, K., Li, J., Fu, H.: Smart city and the applications. In: Proceedings of the 2011. International conference on electronics, communications and control (ICECC), September 2011, pp. 1028– 1031. IEEE (2011) 4. Lombardi, P., Giordano, S., Farouh, H., Yousef, W.: Modelling the smart city performance. Innov. Eur. J. Social Sci. Res. 25(2), 137–149 (2012) 5. Lee, W.H., Chiu, C.Y.: Design and implementation of a smart traffic signal control system for smart city applications. Sensors 20(2), 508 (2020)

10 On Boundary-Effects at Cellular Automata-Based …

119

6. Chan, F., Dridi, M., Mesghouni, K., Borne, P.: Traffic control in transportation systems. J. Manuf. Technol. Manage. (2005) 7. Handscombe, J., Yu, H.Q.: Low-cost and data anonymised city traffic flow data collection to support intelligent traffic system. Sensors 19(2), 347 (2019) 8. Chaudhuri, P. P., Chowdhury, D.R., Nandi, S., Chattopadhyay, S.: Additive Cellular Automata: Theory and Applications, vol. 1. Wiley, New York(1997). 9. Mitra, A., Kundu, A.: Energy efficient CA based page rank validation model: a green approach in cloud. Int. J. Green Comput. (IJGC) 8(2), 59–76 (2017) 10. Chopard, B., Luthi, P.O., Queloz, P.A.: Cellular automata model of car traffic in a twodimensional street network. J. Phys. A Math. Gen. 29(10), 2325 (1996) 11. Esser, J., Schreckenberg, M.: Microscopic simulation of urban traffic based on cellular automata. Int. J. Mod. Phys. C 8(05), 1025–1036 (1997) 12. Wolf, D.E.: Cellular automata for traffic simulations. Physica A 263(1–4), 438–451 (1999) 13. Maerivoet, S., De Moor, B.: Cellular automata models of road traffic. Phys. Rep. 419(1), 1–64 (2005) 14. Jiang, R., Wu, Q.S.: A stopped time dependent randomization cellular automata model for traffic flow controlled by traffic light. Physica A 364, 493–496 (2006) 15. Das, S., Sipra, D., Sikdar, B. K.: CA based data servicing in cellular mobile network. In: Proceedings of the ICWN, pp. 280–286 (2007) 16. Mallikarjuna, C., Rao, K.R.: Cellular automata model for heterogeneous traffic. J. Adv. Transp. 43(3), 321–345 (2009) 17. Tonguz, O. K., Viriyasitavat, W., Bai, F.: Modeling urban traffic: a cellular automata approach. IEEE Commun. Mag. 47(5), 142–150 (2009) 18. Das, S., Saha, M., Sikdar, B. K.: A cellular automata based model for traffic in congested city. In: Proceedings of the 2009 IEEE International Conference on Systems, Man and Cybernetics, October 2009, pp. 2397–2402. IEEE (2009) 19. Yuen, A., Kay, R.: Applications of Cellular Automata, pp. 1–20 (2009). Available at http://www.cs.bham.ac.uk/~rjh/courses/NatureInspiredDesign/2009-10/StudentWork/ Group2/design-report.pdf. Accessed on 17 Aug 2019) 20. Lo, S.C., Hsu, C.H.: Cellular automata simulation for mixed manual and automated control traffic. Math. Comput. Model. 51(7–8), 1000–1007 (2010) 21. Lan, L. W., Chiou, Y. C., Lin, Z. S., Hsu, C. C.: Cellular automaton simulations for mixed traffic with erratic motorcycles’ behaviours. Physica A Statistical Mech Its Appl. 389(10), 2077–2089 (2010) 22. Rosenblueth, D.A., Gershenson, C.: A model of city traffic based on elementary cellular automata. Complex Syst. 19(4), 305–321 (2011) 23. Das, S.: Cellular automata based traffic model that allows the cars to move with a small velocity during congestion. Chaos Solitons Fractals 44(4–5), 185–190 (2011) 24. Meng, Q., Weng, J.: An improved cellular automata model for heterogeneous work zone traffic. Transportation research part C: emerging technologies 19(6), 1263–1275 (2011) 25. Han, Y.S., Ko, S.K.: Analysis of a cellular automaton model for car traffic with a junction. Theoret. Comput. Sci. 450(2012), 54–67 (2012) 26. Vasic, J., Ruskin, H.J.: Cellular automata simulation of traffic including cars and bicycles. Physica A 391(8), 2720–2729 (2012) 27. Zhao, H.T., Yang, S., Chen, X.X.: Cellular automata model for urban road traffic flow considering pedestrian crossing street. Physica A 462, 1301–1313 (2016) 28. Zapotecatl, J. L., Rosenblueth, D. A., Gershenson, C.: Deliberative self-organizing traffic lights with elementary cellular automata. Complexity (2017) 29. Jiang, Y., Wang, S., Yao, Z., Zhao, B., Wang, Y.: A cellular automata model for mixed traffic flow considering the driving behavior of connected automated vehicle platoons. Physica A Statistical Mech. Appl. 126262 (2021) 30. Rodriges Zalipynis, R.A. (2021). Convergence of array DBMS and cellular automata: a road traffic simulation case. In: Proceedings of the 2021 International Conference on Management of Data, June 2021, pp. 2399–2403. ACM (2021)

120

A. Mitra

31. Hua, W., Yue, Y., Wei, Z., Chen, J., Wang, W.: A cellular automata traffic flow model with spatial variation in the cell width. Physica A Statistical Mech. Its Appl. 556, 124777, (2020) 32. Małecki, K., Gabry´s, M.: The computer simulation of cellular automata traffic model with the consideration of vehicle-to-infrastructure communication technology. SIMULATION 96(11), 911–923 (2020) 33. Aguiar, I., Severino, R.: Two elementary cellular automata with a new kind of dynamic. Complex Syst. 24(2), 113–125 (2015) 34. Mitra, A., Saha, S.: An investigation for Cellular Automata-based lightweight data security model towards possible uses in Fog networks. In: Examining the Impact of Deep Learning and IoT on Multi-Industry Applications, pp. 209–226. IGI Global (2021) 35. Mitra, A.: On investigating energy stability for cellular automata based pagerank validation model in green cloud. Int. J. Cloud Appl. Comput. (IJCAC) 9(4), 66–85 (2019) 36. Mitra, A.: An Investigation for CA-based pagerank validation in view of power-law distribution of web data to enhance trustworthiness and safety for green cloud. In: Advanced Models and Tools for Effective Decision Making Under Uncertainty and Risk Contexts, pp. 368–378. IGI Global (2021) 37. Mitra, A.: On the selection of cellular automata based PRNG in code division multiple access communications. Stud. Inform. Control 25(2), 217–226 (2016) 38. Mitra, A., Teodorescu, H.N.: Detailed analysis of equal length cellular automata with fixed boundaries. J. Cell. Autom. 11(5–6), 425–448 (2016) 39. Mitra, A.: Annex to “On Type-D Fuzzy Cellular Automata-based MapReduce model in Industry 4.0”, Mendeley Data, V1 (2021). https://doi.org/10.17632/wkkttw7mgk.1. (2021)

Chapter 11

An Optimized Planning Model for Management of Distributed Microgrid Systems Jagdeep Kaur, Simerpreet Singh, Manpreet Singh Manna, Inderpreet Kaur, and Debahuti Mishra Abstract As energy consumption rises, so do greenhouse gas emissions which result in harming our environment. As a result of the integration of renewable energy sources into power networks monitoring and managing the increasing load, demand is crucial. Microgrids have been identified as a possible structure for integrating the rapidly expanding distributed power generating paradigm. Various academics already have offered a significant number of strategies to fulfill the rising load requirement. The major disadvantage of these models was the scheduling of devices and was power generation capabilities which make traditional schemes complex and timeconsuming. In order to overcome these problems, this paper proposed an effective approach for three power generating systems; these are PV systems, wind systems, and fuel cell systems. The Grey Wolf Optimization (GWO) algorithm optimizes the scheduling process by analyzing and selecting the optimal fitness value for various device combinations. The fitness value that is closest to the load requirement is picked as the best. In terms of power generating capacities, the performance of the suggested GWO model is determined and compared to the standard model in MATLAB simulation system. The simulated experiments demonstrated that the proposed approach is much more productive and efficient to meet cost and load demands with reduced time processing and complexity.

J. Kaur (B) · S. Singh Bhai Gurdas Institute of Engineering and Technology, Sangrur, India M. S. Manna SLIET Longowal, Punjab, India I. Kaur Great Alliance Foundation, Punjab, India D. Mishra Department of Computer Sc. and Engineering, Siksha O Anusandhan (Deemed to be) University, Bhubaneswar, Odisha, India e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2022 D. Mishra et al. (eds.), Intelligent and Cloud Computing, Smart Innovation, Systems and Technologies 286, https://doi.org/10.1007/978-981-16-9873-6_11

121

122

J. Kaur et al.

11.1 Introduction Due to a variety of circumstances, electricity demand has risen dramatically in recent years [1]. To date, nuclear, hydro and thermal, power facilities have been the primary suppliers of electricity. However, due to the lack of resources such as coal and nuclear power, we were focused on renewable energy resources like sunlight and hydropower. Although there has been fast growth in photovoltaic and wind energy technologies in order to ensure optimal utilization of renewable resources and boost system reliability, these time- and environment-dependent resources have a number of issues. Due to fluctuations in characteristics like sunshine quality, and so on, the energy collected from these resources fluctuates as well, posing several issues for the mains power system’s reliability. Imbalances in power output and load demand, distortions, frequency variations, and voltage level variations, are only a few of the issues. In order to solve these issues, microgrids came into existence. MGs might be a promising architecture for addressing technological difficulties associated with the expanding utilization of renewable energy sources (RES) via distributed power generation (DG) techniques in order to fulfill the rising demands for power quality and reliability (PQR). Since the prices of RES systems have declined dramatically in the last thirty years, migration to RES-based power systems becomes more attractive and feasible [2]. The three “E” requirements of energy, environment, and economic development have created new possibilities for on-site energy production in renewable energy conversion techniques [3]. A microgrid is a network of RES and energy storage systems that are managed by a monitoring system to supply electricity to the loads for which it was built [4]. A microgrid includes control, distribution, and power generation units for voltage management. The fundamental distinction between a traditional grid and MG is the closest proximity of energy production to end customers. Microgrids have received a huge interest in recent times of developments in green power technology. When there are problems, it might go into an islanded mode, which means it’s cut off from the main grid [5]. The MG’s sources are known as micro sources, and they can include diesel generators, solar, wind, solid oxide fuel cells, and battery pack, etc. Every source can be connected to the distribution system in its way. Loads were attached to a distributed system, with micro sources and the primary grid meeting the energy needs. Diesel generators may be employed as a standby energy resource or as a constant energy supplier in conjunction with RES. The mentioned control system is applied to control energy from different resources to the load (Fig. 11.1).

11.1.1 Components of a Micro Grid An MG is made up of a control system, power electronics, energy storage devices, distributed generators, and the main grid source, for controlling the generator’s power supply [6]. To achieve versatile power monitoring, microgrids typically

11 An Optimized Planning Model for Management …

123

Fig. 11.1 Simple micro grid diagram Concept of a microgrid

comprise energy management systems, controllers, transmission systems, power conversion equipment, and distributed energy resources [7]. Another important factor in promoting and implementing microgrids is the consumer [8]. • Distributed energy resources include distributed storage and distributed generators (DG) and they distribute power to fulfill needs. • MG controllers are required to impose requirements on the distribution system and regulate characteristics including power quality, voltage, and frequency. • To determine the MG’s operational condition, power conversion devices like a current and voltage converter is used. • In MGs, a communication system is used to transmit surveillance and controlling data. • The energy management system is utilized for power system reliability analysis, state estimation, system regulation, and information collection. It may also be used for energy management, load forecasting, and renewable energy power forecasting. • Consumers, who may also be suppliers, may have an impact on MG’s functioning, load control, and technology selection, due to price and performance considerations. MGs may be used in customer-driven demand response [9].

11.1.2 Type of Micro Grid MGs may be classified as Hybrid, AC, and DC Microgrids depending on the previously mentioned concepts. AC microgrids: The word AC Microgrids denotes MGs that use alternating current. AC MGs are able to operate effectively owing to increased grid stability and improved PE converters, that can regulate active and reactive power. DC microgrids: PV systems, a large part of the resources in Distributed generators, produce power in direct current. Direct current MGs may be achieved by the usage of direct current generators in the micro wind and micro-hydro systems.

124

J. Kaur et al.

Hybrid microgrids: Both alternating and direct current loads and resources may be found in hybrid microgrids. Point of coupling capacity will play a key part in power flow control. Smart Home Management Systems or Microgrid Control Centers utilize WAN “Wide Area Networks” and HAN “Home Area Networks” as examples of advanced MG architectures. Because of the rise in factors such as frequency, phase correction, and reactive power, the method of control of hybrid grids is more complicated than that of a simple DC and AC Microgrid [9–16]. Various researchers have investigated the hybrid Microgrid’s real-time functioning and evaluated the control systems which are discussed in the next section.

11.2 Literature Survey A large number of methods have been proposed by different researchers in order to increase the power generation capability of the system, some of them are discussed here, [17] proposes a unique multi-agent network-based model for the optimum management of an MG combined with RERs at a distributed level by using MAS methodology. Different control methods depending on DMPC were presented [18] to provide an ideal process for the economical scheduling of a system of linked MGs with hybrid ESSs. It was proposed a novel strategy for scaling of DG to reduce power costs in MGs [19] where optimization is done by using the PSO. A method was further proposed [20] based on an MG architecture with two DGs, one of which was a solar energy system with ESS and another was a diesel synchronous generator. Different strategies were proposed [21–25], for restoring a distribution network into several MGs while considering three-phase demand-side management (T-DSM) to resolve these difficulties. An efficient method was proposed monitoring of DERs in a housing building as an MGs which employed DGs like PV, WT and fuel cells, and batteries for this goal with the help of a cuckoo search algorithm [26], while the novel multi-objective solution was provided for microgrid energy management MGEM [50] was defined as MILP. A unique method was suggested for the distributed event-based method for optimum MGEM to decrease necessary capacities for data interchange in MGs [27]. Liang et al. [28] investigated the typical droop control approach used in MGs, which also has shortcomings, and proposes a distributed optimum drop control algorithm to achieve the MG’s reliability, which has strong benefits on the upper limit of an overload condition. Ghofrani et al. [29] provided a novel technique for bridging the distance among DSM and MG range, scaling, and location optimization. From the literature survey conducted, it is analyzed that currently, the whole world is facing an energy crisis that leads to escalating electrical energy demands and growing emissions of hazardous gases like carbon dioxide. In order to reduce this demand, RERs have been discovered that are environment friendly and inexhaustible in meeting energy demands while also reducing hazardous gas emissions. A large number of methods have already been proposed by different researchers in

11 An Optimized Planning Model for Management …

125

order to enhance the power generation capacity of these grids. However, due to the dispersed nature of RERs management of these microgrids is difficult and complex. Furthermore, scheduling in these systems was not optimum and needs improvement. In traditional systems scheduling of devices was done manually which makes them less efficient and time-consuming methods. Hence it becomes extremely important to develop an approach that can effectively control and manage these applications. Inspired by these findings, a system will be developed in order to provide a solution that can schedule the power generation process dynamically and in a quick way while fulfilling the demand.

11.3 Proposed Model In order to overcome the limitations of the traditional model, this paper proposed a model for three power generating systems i.e. PV array system, wind system, and fuel cells, whose main focus is to solve these problems. In conventional models, the scheduling was done manually to track the demand and load variation. Whereas, in the proposed model the scheduling is done through GWO algorithm in order to reduce the time processing and complexity. The main motive of using an optimization algorithm is that the optimization algorithm is capable to solve NP-hard problems in an efficient way and in less time. The GWO algorithm is capable of solving the scheduling issue with an iterative process that can have the objective to schedule the generation unit in such a way so that both the cost and load-related requirements can be achieved. The working mechanism of the proposed GWO model is given in this section.

11.3.1 Methodology The process that is opted in the proposed GWO model in order to increase their power generation capacity is discussed in brief below:

1.

2.

The first step is to define the information set of Microgrid consists of solar photovoltaic (PV), Wind turbine (WT), and Fuel Cell (FC). This consist of information related to how much power can be obtained from these sources during the 24 h of a single day. Once it is defined next step is to initialize the optimization algorithm in which various parameters like the population in terms of the amount of power within lower and upper power limits of components of grids, number of iterations, and dimensions, etc. are taken into consideration. The exact value of these configuration parameters in the proposed GWO based system is given in Table 11.1.

126

J. Kaur et al.

Table 11.1 Different Configuration parameters for proposed GWO model S. No.

Performance parameters

Values

1

Population

20

2

Iteration

50

3

Dimensions

3

4

A coefficient

2 to 0

5

PV Power limitation

[0 5.35]

6

Wind Power limitation

[0 11]

7

Fuel Cell Power limitation

[0 52.88]

Fig. 11.2 Graphical representation of load demand

In addition to these parameters, the load demand of 24 h is analyzed and represented in graphical form and is shown in Fig. 11.2. 3.

4.

5.

Once the network is initialized the next step is to generate population randomly for the given three power generation systems i.e. PV generation, wind power generation, and fuel cell generation on the basis of their upper and lower bound values. After generating the population, the fitness value of all the three power generation systems is calculated. The output produced by the three power generation systems is then compared with the demanded load in order to select the best fitness value. The best fitness value is the one that is closer to the demanded load value or we can say whose fitness value is minimum. Once the fitness values are calculated, the next step is to perform optimization. For this, the proposed model utilizes the GWO algorithm. The GWO performs optimization by updating the iterations and selects the best fitness value among all the available fitness values.

11 An Optimized Planning Model for Management …

6.

127

Finally, the performance of the proposed GWO based system is evaluated and compared with the traditional models to evaluate its efficiency.

11.4 Results and Discussion The performance of the proposed GWO model is analyzed and evaluated for three power generation systems in MATLAB software. The simulated outcomes were obtained and compared with the conventional model in which scheduling is done manually in terms of total power generation by each system. A detailed description of the simulated results is given in this section. The performance of the proposed GWO based algorithm is analyzed and compared with the traditional optimal model for three power generation systems in terms of their power generation capability.

11.4.1 For PV System The performance of the proposed GWO model is evaluated and compared with the standard model for PV power generation systems in terms of their generation capacity and is shown in Fig. 11.3. Figure 11.3 illustrates the comparison graph for the traditional PV system in which optimization is done manually and the proposed GWO system in which optimization is done by GWO algorithm. From the graph, it is observed that the power generation ability in standard optimal systems is only about 5.2 MW at peak hours. Figure 11.3 depicts that while as, the power generation capability in the proposed GWO model is around 6 MW during peak hours. Moreover, the power generation capability in the proposed model lasts up to 20 h which makes it more efficient and reliable. PV Array Power Generation

6 5

Optimal-Management

Power(MW)

GWO-Management

4 3 2 1 0 0

5

10

15

Time(Hr)

Fig. 11.3 Comparison graph for power generation

20

25

128

J. Kaur et al. Wind Power Generation

13

Optimal-Management

12

GWO-Management

11

Power(MW)

10 9 8 7 6 5 4 3 0

5

10

15

20

25

Time(Hr)

Fig. 11.4 Comparison graph for wind power generation

11.4.2 For Wind System Similarly, the performance of the proposed system is evaluated and compared with the traditional optimal model for wind generation systems in terms of their power generation ability and is shown in Fig. 11.4. Figure 11.4 represents the comparison graph for power generation capability in the standard optimal system and proposed GWO wind power generation systems. From the graph, it is observed that the power generated by the standard optimal wind model is lower than the proposed GWO wind model. The maximum load produced by the standard optimal model came out to be only 11 MW. Whereas, the maximum power produced by the proposed GWO system is 12 MW. This increased power generated makes the proposed model more efficient and reliable.

11.4.3 For Fuel Cells Power Generating Systems Furthermore, the performance of the proposed GWO and traditional optimal model for fuel cells generation system is evaluated and shown in Fig. 11.5. Figure 11.5 depicts the performance of the proposed GWO optimized fuel cell system and traditional optimal fuel cells system in terms of their power production ability. From the graph, it is observed that the power generated by the conventional optimal systems is 4 MW maximum whereas the maximum power generated by the proposed wind GWO system is 12 MW and is working efficiently even after 24 h.

11 An Optimized Planning Model for Management …

129

Fuel Cell Power Generation

14

Optimal-Management GWO-Management

12

Power(MW)

10 8 6 4 2 0 0

5

10

15

20

25

Time(Hr)

Fig. 11.5 Comparison graph for power generation by fuel cells

In addition to this, the performance of the proposed GWO system is analyzed and compared with the traditional optimal systems in terms of their total power consumption capability and is depicted in Fig. 11.6. Figure 11.6 depicts the comparison graph for power generation capability in conventional optimal systems and proposed GWO system. From the graph, it is observed that the power generated by conventional optimal system is quite lower than the power generated by the proposed GWO optimized systems. The maximum power generated by the conventional optimal system reaches up to16MW whereas the power generated by the proposed GWO optimized system reaches up to 25 MW. Total Power Generation

30

Optimal-Management GWO-Management

Power(MW)

25

20

15

10

5

0 0

5

10

15

Time(Hr)

Fig. 11.6 Comparison graph for total power generation

20

25

130

J. Kaur et al. Daily average Load

9

Actual Average Load Managed Average Load

8

Power(MW)

7 6 5 4 3 2 1 0

5

10

15

20

25

Time(Hr)

Fig. 11.7 Average actual load in comparison to managed load

This proves that the proposed GWO optimized system is more efficient and generates electricity more effectively to meet 24 h load demand. Figure 11.7 represents the average demanded load on an hourly basis in comparison to the managed load from the proposed scheme that shows that the power achieved from the proposed model is above the demanded load, this will fulfill the requirements of the load from the users. As the proposed scheme’s output power is a bit more than the demand, therefore it can be concluded that the proposed scheme is capable to handle the load even if there will be fluctuation in demanded load with some margin. Other than this the extra generated power can be used for sale purposes to get a financial advantage. From the graphs, it is analyzed that the total power generated by the GWO optimized systems is very high and can satisfy the increasing load demands more effectively.

11.5 Conclusions This paper presents an effective method that is based on GWO optimization algorithm so that the complexity of the PV systems, wind systems and fuel cell systems can be reduced while its power generation ability can be enhanced. The performance of the proposed GWO model is analyzed and compared with the conventional systems in the MATLAB simulation software in terms of their power generation ability. From analyzing the results, it is observed that the maximum power produced by the traditional PV systems came out to be just 5.5 MW during peak hours whereas, the maximum power generated by the proposed GWO optimized PV systems came out to be 6 MW. Similarly, the power generated by the traditional wind generation systems and proposed GWO optimized wind generation systems is also evaluated, whose

11 An Optimized Planning Model for Management …

131

values came out to be 11 MW and 12 MW respectively. This means that there is an improvement of 1 MW power in proposed GWO system. Moreover, the traditional fuel cell system was able to produce only 4 MW of power during peak hours while as the maximum power generated by the suggested GWO optimized fuel cell system came out to be more by around 8 MW. In addition to this, the proposed GWO optimized model is able to generate a maximum power of 25 MW, which in itself is a huge number. This proves that the proposed GWO optimized approach is more effective in fulfilling the load demands of user with reduced cost, time processing, and complexity.

References 1. Chamana, M., Bayne S.B.: Modeling and control of directly connected and inverter interfaced sources in a microgrid. In: North American Power Symposium (NAPS), 2011, vol., no., pp.1–7, 4–6, 2011. 2. Herzog A.V., kammen D. M, Edwards J. L., Lipman T. E., “Renewable energy: a viable choice. Environment”, pp.8–20,2001 3. Int. Energy Agency. World Energy Outlook 2002. Paris: IEA2002 4. Cheriti, A., Bouzid, A.M., Guerrero, J., M, Benghanem M., Bouhamida M. & Sicard P,: A survey on control of electric power distributed generation systems for microgrid applications. Renew. Sustain. Energy Rev. 44, 751–766 (2015). https://doi.org/10.1016/j.rser.2015.01.016 5. About Microgrids.: Microgrids at Berkeley Lab, Web 5 May 2015. https://building-microgrid. lbl.gov/about-microgrids 6. Hossain, E., Kabalci, E., Bayindir, R., Perez, R.: A comprehensive study on microgrid technology. Int. J. Renew. Energy Res. 4, 1094–1104 (2014) 7. Zhang, X.-Y., Sun, Z.: Advances on distributed generation technology. Energy Procedia (2012) 8. Kohler, J., Arnold, R.J,. Li, R., Abu-Sharkh, S., Markvart, T., et al.: Can microgrids make a major contribution to UK energy supply. Renew. Sustain. Energy Rev. (2006) 9. Kaur, I., Sharma, K., et al.: Power system statbiliy for the islanding operation of microgrids. Indian J. Sci. Technol. 9(38), 1–5 (2016) 10. Kaur, I., Sharma, K.: Issues in stability of micro grids. Int. J. Recent Trends Eng. Res. 2(6), 1–7 11. Wada, K., Nakamura, N., Shimizu, T.: Flyback-type singlephase utility interactive inverter with power pulsation decoupling on the dc input for an ac photovoltaic module system. IEEE Trans. PEs 21, 12641272 (2006) 12. Nguyen, P.H., Ribeiro, P.F., Kling, W.L.: Smart power router: a flexible agent-based converter interface in active distribution networks, IEEE Trans. Smart Grid 2(3), 487–495 (2011) 13. Kaur, I., Kaur, H.: Design and implementation of multi-junction PV cell for MPPT to improve the transformation efficiency. Int. J. Recent Technol. Eng. (IJRTE) 7(6S4), 248–253 (2019). ISSN: 2277–3878 14. Peerzadah, E.H., Kaur, I.: Multifaceted aspects of advanced innovations in engineering and technology. IJRECE 7(2), 1395–1397 (2019). ISSN:2348-2281 15. Thakur, G., Kaur, I., Sharma, K.K.: Power management in hybrid micro grid system. Ind. J. Sci. Technol. 10(16), 1–5 (2017). ISSN: 0974-5645 16. Kaur, I., Sharma, K.K., Singh, S.B.: Power system stability for the islanding operation of micro grids. Ind. J. Sci. Technol. 9(38), 1–5 (2016). ISSN: 0974-5645 17. Radwan, A.A.A., Mohamed, Y.A.R.I.: Networked control and power management of AC/DC hybrid Microgrids. IEEE Syst. J. 1662–1673 (2017)

132

J. Kaur et al.

18. Hably A., Milano F., Mahmud M.A, Hossain M. J., Bacha S., & “Design of robust distributed control for interconnected Microgrids,” IEEE Transactions on Smart Grid, pp.2724–2735, 2016. 19. Wang, J., Xiong, L., Ma, M., Khan, M.W.: Modelling and optimal management of distributed microgrid using multi-agent systems. Sustain. Cities Soc. 41, 154–169 (2018) 20. Bordons, C., Torres, F.G., Ridao, M. A.: Optimal economic schedule for a network of microgrids with hybrid energy storage system using distributed model predictive control. IEEE Trans. Ind. Electron. 66(3), 1919–1929 (2019) 21. Kaur, I.: A performance analysis of microgrid solar photovoltaic system. Int. J. Adv. Sci. Technol. (IJAST) 29(9s), 5172–5180 (2020). ISSN:2005-4238, E-ISSN:2207-6360 22. Hasankhani, A., Bagheritabar, H., Catalão, J.P.S., Shafie-khah, M., Lotfi, M., Hakimi, S.M.: Planning of smart microgrids with high renewable penetration considering electricity market conditions. In: 2019 IEEE International Conference on Environment and Electrical Engineering and 2019 IEEE Industrial and Commercial Power Systems Europe (EEEIC/I&CPS Europe), pp. 1–5 (2019) 23. Alzahrani, A., Alismail, F., Alshehri, J., Khalid, M.: Optimal control of a microgrid with distributed renewable generation and battery energy storage. In: 2020 IEEE Power & Energy Society Innovative Smart Grid Technologies Conference (ISGT) pp. 1–5 (2020) 24. Liu, B., Meng, K., Fu, L., Dong, Z.Y.: Optimal restoration of an unbalanced distribution system into multiple microgrids considering three-phase demand-side management. IEEE Trans. Power Syst. 36, 1350–1361 (2021) 25. Mahmoudi, A., Shadman, M., Nikkhah, S.: Optimal management of distributed generations in a residential complex using cuckoo search algorithm. In: 2019 Iranian Conference on Renewable Energy & Distributed Generation (ICREDG), pp. 1–7 (2019) 26. Kumar, A., Murty, V.V.S.N.: Multi-objective energy management in microgrids with hybrid energy sources and battery energy storage systems. Prot Control Mod. Power Syst. (2020) 27. Zhao, T., Li, Z., Ding, Z.: Consensus-Based distributed optimal energy management with less communication in a microgrid. IEEE Trans. Industr. Inf. 15(6), 3356–3367 (2019) 28. Liang, B.: Research of microgrid technology control strategy in distributed generation. In: 2019 IEEE Long Island Systems, Applications and Technology Conference (LISAT), pp. 1–6 (2019) 29. Ghofrani, A., Kose, B.E., Mahani, K., Amini, M., Jafari, M.A., Nazemi, S.D.: TechnoEconomic analysis and optimization of a microgrid considering demand-side management. In: 2020 IEEE Texas Power and Energy Conference (TPEC), pp. 1–6 (2020)

Chapter 12

Hard and Soft Fault Detection Using Cloud Based VANET Biswa Ranjan Senapati, Rakesh Ranjan Swain, and Pabitra Mohan Khilar

Abstract Vehicular Ad hoc NETwork (VANET), popularly called the network on wheels, is used successfully for various applications. The successful implementation of the application of VANET depends on the quick and correct message dissemination between the source and destination. If the sensor unit of the vehicle is faulty, then the use of VANET for any applications is not going to be successful. Looking at the faster-growing interest of researchers in academics and industry, successful implementation of VANET for numerous applications, this paper proposed the hard and soft fault detection of the sensor unit of the vehicle using the vehicular cloud. Hard fault detection is carried out at the static node of the VANET. In the hard fault detection phase, the permanent hard fault is identified using a time-out mechanism. The soft fault detection is accomplished by the observation center, where the status of vehicular sensor information is transmitted using a vehicular cloud. In the soft fault diagnosis, the permanent soft fault is detected using a comparative-based approach. The validation of the proposed approach is carried out by two performance metrics.

12.1 Introduction Any instruments or equipment widely used for numerous applications are subject to failure because of excessive usage. The component may be utterly faulty in which the component does not work at all. This situation refers to a hard fault. The component may be partially faulty. In this case, the device will work, but not correctly. This B. R. Senapati · R. R. Swain (B) Department of Computer Science and Engineering, Institute of Technical Education and Research, Siksha ‘O’ Anusandhan (Deemed to be) University, Bhubaneswar, India e-mail: [email protected] B. R. Senapati e-mail: [email protected] P. M. Khilar Department of Computer Science and Engineering, National Institute of Technology, Rourkela, India e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2022 D. Mishra et al. (eds.), Intelligent and Cloud Computing, Smart Innovation, Systems and Technologies 286, https://doi.org/10.1007/978-981-16-9873-6_12

133

134

B. R. Senapati et al.

category of fault is called a soft fault. Either the device is associated with a hard fault or related to a soft fault, detecting the fault as quickly as possible is essential. Otherwise, the overall performance will degrade. VANET, a network formed by connected vehicles, is used for wide applications because of successful communication among the nodes [1]. In the year 2025, it is expected that around 116 million cars are going to be connected on the road, and around 25 gigabytes of data are to be communicated to the cloud servers per hour [2]. Broadly, the vehicular network is formed by two components which are distributed and asynchronous. (a) (b)

Vehicles:—In VANET vehicles are the mobile nodes whose motion is restricted due to the structure of the road. Road side unit (RSU):—Static nodes whose storage capacity and transmission range are more than vehicles.

The communication in VANET is also called V2X (vehicle to everything). Figure 12.1 presents the V2X communication of VANET. The V2X communication of VANET, availability of 5G and 6G for quick transmission of data, availability of vehicular communication standard such as Wireless Access in Vehicular Environment (WAVE) and Dedicated short range communication service (DSRC), easy availability of a large number of sensors at an affordable cost for the measurement of various physical parameters, and performance enhancement of VANET by optimizing the vehicular parameters help to use VANET for the wide range of applications [3]. But the successful implementation of various applications of VANET is not possible if the sensor unit is faulty. This motivates the automatic

Fig. 12.1 V2X communication of VANET

12 Hard and Soft Fault Detection Using Cloud Based VANET

135

detection of faulty OBU so that various applications can be implemented successfully for VANET. The significant contribution of the article is stated as follows.) (a) (b) (c)

Hard fault detection using time-out window mechanism. Soft fault detection using comparative based approach. Transmission of vehicular sensor data to the observation center for the soft fault diagnosis using vehicular cloud.

12.2 Literature Study This section discusses two aspects of VANET. First aspects focus on the various application of VANET. Second, aspects mentions various work done by researchers for the fault based VANET. Safety applications of VANET include broadcasting road accident information, automatic fire monitoring service [4], etc. The convenience application of VANET includes automatic toll tax collection [5], automatic parking service [6], etc. The productive application of VANET includes automatic environmental parameters monitoring [7]. Using cloud service various applications are also feasible [8]. Thus, wide range of various categories of applications is possible using VANET. Implementations of various applications are not feasible if the communication unit is the faulty one. Various researchers worked for the fault detection of the communication unit in VANET [9]. Fault detection of OBU is done at the observation center using vehicular cloud [10]. The above approach although detect the faulty OBU but not able to detect different categories of fault. Bhoi et al. proposed a self-fault detection approach for the fault detection of OBU [11]. But the approach has high false alarm rate when the majority of the neighboring vehicles are faulty. Composite fault detection for VANET is done which detects various categories of fault [12]. But the computation is performed at the RSU that increases the overhead on RSU. Based on the above constraints, the fault detection approach is proposed for the hard and soft fault detection in the next section.

12.3 Proposed Approach In this section, hard and soft permanent faults are detected subsequently. The details of the explanation are proposed in the next subsection.

12.3.1 Hard Permanent Fault Detection The permanent hard fault is identified using time-out window mechanism [13–15]. The time-out window contains the results of repeated time-out messages according

136

B. R. Senapati et al.

Fig. 12.2 Overview of the time-out window for each time iteration

to the observation times. Figure 12.2 shows overview of the time-out window values in time iteration 1 to T. The RSU maintains a status table that contains the probably faulty value of each vehicular sensor unit VS i VS. The faulty value of vehicular sensor unit VS i VS at each time iteration is denoted as V Si(t) . The possible value of V Si(t) at each iteration is either zero(0) or one(1), which is defined in Eq. 12.1. V Si(t) =

0 Faulty Free 1 Probably Faulty

(12.1)

Initially all the V Si(t) values from 1 to T time iteration are considered as 0. The RSU sends periodic messages from 1 to T observation times and waits T 0 amount of time for the acknowledgment message. Equation 12.2 shows the time-out period T0 computation. T0 = (2 × T(propagation delay) ) + T(transmission delay) + T(processing delay) + T(queuing delay)

(12.2)

After the observation time t = 1, 2, 3,…, T, the status of each vehicular sensor unit VS i VS is computed by the RSU unit in its communication range. The faulty status of each vehicular sensor unit is denoted as f (VS i ) and is computed in Eq. 12.3. f (V Si ) =

T

V Si (t)

(12.3)

t=1

The actual faulty sensor unit is identified by a condition checking in Eq. 12.4. f (V S_i) =

T t=1

T V S_i(t) ≥ 2

(12.4)

If the fault status f (V Si ) of the vehicular sensor unit is greater than equal to times, i.e., sensor unit does not respond T or greater than T time units. If 2 2 the vehicular sensor unit VS i VS is satisfied the condition in Eq. 12.4, then it is T 2

12 Hard and Soft Fault Detection Using Cloud Based VANET

137

declared as hard faulty senor unit in the vehicular network. Algorithm 1 shows the hard permanent fault diagnosis steps.

12.3.2 Soft Fault Diagnosis The permanent soft fault is detected using comparative based approach [16, 17]. The vehicular sensor unit VS i VS is communicated with its one-hop neighbor VS j VS in its communication range. The vehicular sensor unit VS i VS with sensor data {s1 , s2 , s3 , …, st } are compared with its one-hop neighbor sensor VS j VS with sensor data {s1 , s2 , s3 , …, st }. If |si – sj | > 8, then a fault status fsij is generated and set to one; otherwise fsij is set to zero. The threshold value 8 is set as per the vehicular sensor value ranges. Each sensor unit VS i VS in the network is compared with its one-hop neighbor sensor VS j VS and generated fault status fsij . So, each sensor unit VS i VS is generated and the fault status fsij = {fsi1 , fsi2 , fsi3 , …., fsim }: j = 1, 2, 3, …, m, for m number of one-hop neighbor sensors in the vehicular network. Each sensor unit VS i VS in the vehicular network sends its fault status result fsij = {fsi1 , fsi2 , fsi3 , …., fsim } to the vehicular cloud for further computations. Algorithm 1: Hard Permanent Fault Diagnosis Algorithm. 1. 2. 3.

4. 5.

Initialized a faulty status table for each vehicular sensor unit VSi ϵ VS by RSU. Initialized fault status VSi(t) = 0 for each vehicular sensor unit VSi ϵ VS at each time iteration t. For each time iteration t = 1, 2, 3, 4 …….., ∆T do i. Each RSU sends periodic message in its range; ii. After receiving the periodic message, the vehicular sensor reply an ACK within T0 time period; iii. After T0 is expired, the VSi(t) = 1 is updated at t time iteration; End For Compute the actual fault status of each vehicular sensor unit VSi ϵ VS, ;

6.

If

7.

Else

i.

) then Vehicular sensor VSi is declared as hard faulty;

i. Vehicular sensor VSi is considered as fault free; 8. End If 9. The actual faulty status is broadcast in the vehicular network; 10. STOP.

138

B. R. Senapati et al.

In the vehicular cloud, the ith sensor unit fault status fsij = {fsi1 , fsi2 , fsi3 , …., fsim } and the jth sensor unit fault status fsji = {fsj1 , fsj2 , fsj3 , …., fsjm } are compared and generates a test result ω(i, j). The test result between sensor unit VS i, VS j VS is denoted as ω(i, j) and computed in Eq. 12.5. ω(i, j) = { f si1 , f si2 , f si3 , . . . , f sim } ∩ { f s j1 , f s j2 , f s j3 , . . . , f s jm } (12.5) In Eq. 12.5, {fsi1 , fsi2 , fsi3 , …., fsim } is the fault status of vehicular sensor unit VS i VS and {fsj1 , fsj2 , fsj3 , …., fsjm } is the fault status of vehicular sensor unit VS j VS. If the fault status fsij and fsji of vehicular sensor units are similar, then the test result ω(i, j) is computed as the cardinality of the number of similarity. So, the degree of similarity (i, j) is defined as the ratio between the test result ω(i, j) and the number of test comparison. The degree of similarity (i, j) is computed in Eq. 12.6. (i, j) =

ω(i, j) Number of comparison

(12.6)

The degree of similarity value is (i, j) [0, 1]. Case 1: If the fault status of the vehicular sensor units are similar, then the degree of similarity (i, j) is computed as one. Case 2: If the fault status of the vehicular sensor units are not similar, then the degree of similarity (i, j) is computed as zero. Case 3: If the fault status of the vehicular sensor units are partially similar, then the degree of similarity (i, j) is computed as within 0 & 1, such that (i, j) (0, 1). After the calculation of degree of similarity (i, j), the likely faulty set LF(i, j) is fetched. Then, for the likely faulty set LF(i, j) the summation of the degree of similarity (i, j) is calculated. The summation of the degree of similarity (i, j) for each likely fault set LF(i, j) is considered as the factual computed fault status. The highest summation value of the degree of similarity in the likely faulty set is considered as factual faulty module. So, the faulty and fault free modules are identified effectively by using the factual faulty set. Then, the vehicular cloud broadcast the actual fault status in the vehicular network. Algorithm 2 shows the soft fault diagnosis steps.

12 Hard and Soft Fault Detection Using Cloud Based VANET

139

Algorithm 2: Soft Fault Diagnosis Algorithm. 1. For each vehicular sensor unit i= 1 to n do a. For each neighbor vehicular sensor j= 1 to m do i. Vehicular sensor values are compared ii. If (│si – sj│ > Ɵ) then fsij = 1; iii. Else fsij = 0; iv. End If v. Fault status fsij← {fsi1,fsi2, fsi3, …., fsim } b. End For 2. End For 3. Each sensor VSi ϵ VS sends the fault status to vehicular cloud 4. For each vehicular sensor unit i= 1 to n do a. For each neighbor vehicular sensor j= 1 to m do i. Compare the fault status fsij and fsji; ii. Generate the test result: ω(i, j) = │{fsi1,fsi2, fsi3, …., fsim } ∩ {fsj1,fsj2, fsj3, …., fsjm }│; iii. Calculate the degree of similarity: ; 5. 6. 7. 8. 9. 10. 11. 12. 13.

b. End For End For For each degree of similarity a. Generate a likely faulty set LF(i, j); End For For each likely faulty set LF(i, j) a. Calculate the sum of the degree of similarity ΣΩ(i, j); End For Sort the likely faulty set LF(i, j) according to the ΣΩ(i, j) in increasing order. The highest ΣΩ(i, j) value for LF(i, j) is an actual faulty set. The faulty and fault free sensors are successfully identified using the actual faulty set LF(i, j). STOP.

In this fault diagnosis phase, we identify the hard and soft permanent faulty sensor units in the vehicular network. So, it is called as a complete diagnosis [18] of faulty vehicular sensor units in a vehicular network.

12.3.3 Fault Status Transmission Through Vehicular Cloud Comparison of fault status of the faulty vehicle and one-hop neighboring vehicle demands the transmission of fault status to the observation center. As the vehicular cloud server has the storage and quick transmission capability, so the transmission is

140

B. R. Senapati et al.

Fig. 12.3 Overall transmission scenario

done using the cloud server. Certain assumptions are considered for the transmission of data through vehicular cloud. These are as follows. Assumptions (A1) (A2) (A3) (A4)

All the vehicular sensors are considered to be fault free. Only the communication unit is checked to be faulty or fault free. Transmissions of vehicular data through vehicular cloud to the observation center are considered to be fault free. There is no temporary disconnection between the cloud server and OBU of the vehicle during transmission.

Transmission Through Vehicular Cloud To check whether the OBU of a vehicle is faulty or not, its fault status along with the fault status of one-hop neighbor is transmitted to the observation center. For this transmission cloud storage as a service is used. The transmission of the fault status of a vehicle and its one-hop neighbor is possible when the vehicle is at the traffic place or at the parking lot. Overall transmission scenario is presented in Fig. 12.3.

12.4 Simulation Experiments and Discussions The simulation experiments are discussed in this section. Different performance metrics used for the simulation result are discussed as follows. i.

Fault detection accuracy

Fault detection accuracy is presented in Eq. 12.7. FDA =

Number of faulty nodes correctly identified Actual number of faulty nodes present

(12.7)

12 Hard and Soft Fault Detection Using Cloud Based VANET

ii.

141

False alarm rate

False alarm rate is presented in Eq. 12.8. FAR =

Number of fault free nodes detected as faulty Actual number of fault free nodes present

(12.8)

Simulation Results For the transmission of vehicular sensor data through cloud servers, ThingSpeak is used. The data from the cloud server can be accessed from another location using json, XML, or CSV format. Using DHT11 sensor, the temperature data are transferred to the ThingSpeak cloud server, which is presented in Fig. 12.4. The hard and soft fault diagnosis is evaluated by FDA and FAR and is simulated using MATLAB R2019a simulator. FDA for the hard permanent fault is presented in Fig. 12.5. From the graph (ref. Fig. 12.5), increases in the percentage of faulty vehicles the FDA decreases. The FAR is not so significant for hard permanent fault due to window based time-out mechanism. Fig. 12.4 DHT11 Temperature sensor data in ThingSpeak

Fig. 12.5 Fault detection accuracy for hard permanent fault

142

B. R. Senapati et al.

Fig. 12.6 FDA and FAR for soft permanent fault

From Fig. 12.6, increases in the percentage of faulty vehicles the FDA decreases, and the FAR increases.

12.5 Conclusion In this paper hard and soft permanent fault for the OBU is detected. The performance of the proposed work is evaluated using FDA and FAR. After the detection of faulty OBU, the OBU is isolated from the routing process and the OBU is isolated. In future, the process of isolation and real-life implementation of the proposed approach would be carried out.

References 1. Lee, M., Atkison, T.: Vanet applications: past, present, and future. Veh. Commun. 28, 100310 (2021) 2. Union, I.: IMT traffic estimates for the years 2020 to 2030. In: Report ITU, p. 2370. (2015) 3. Senapati, B.R., Khilar, P.M.: Optimization of performance parameter for vehicular ad-hoc network (VANET) using swarm intelligence. In Nature Inspired Computing for Data Science, pp. 83–107. Springer, Cham (2020) 4. Senapati, B.R., Khilar, P.M., Swain, R.R.: Fire controlling under uncertainty in urban region using smart vehicular ad hoc network. Wireless Pers. Commun. 116, (3), 2049–2069 (2021) 5. Senapati, B.R., Khilar, P.M., Naba Krushna Sabat.: An automated toll gate system using vanet. In: 2019 IEEE 1st International Conference on Energy, Systems and Information Processing (ICESIP), pp. 1–5. IEEE (2019) 6. Senapati, B.R., Khilar, P.M.: Automatic parking service through VANET: A convenience application. In: Progress in Computing, Analytics and Networking, pp. 151–159. Springer, Singapore (2020) 7. Senapati, B.R., Swain, R.R., Khilar, P.M.: Environmental monitoring under uncertainty using smart vehicular ad hoc network. In: Smart Intelligent Computing and Applications, pp. 229– 238. Springer, Singapore (2020)

12 Hard and Soft Fault Detection Using Cloud Based VANET

143

8. Sabat, N.K., Pati, U.C., Senapati, B.R., Das, S.K.: An IOT concept for region based human detection using PIR sensors and FRED cloud. In: 2019 IEEE 1st International Conference on Energy, Systems and Information Processing (ICESIP), pp. 1–4. IEEE (2019) 9. Ferreira.: A survey on fault tolerance techniques for wireless vehicular networks. Electronics 8(11), 1358 (2019) 10. Senapati, B.R., Mohapatra, S., Khilar, P.M.: Fault detection for VANET using vehicular cloud. In Intelligent and Cloud Computing, pp. 87–95. Springer, Singapore (2021) 11. Bhoi, S.K., Khilar, P.M.: Self soft fault detection based routing protocol for vehicular ad hoc network in city environment. Wireless Netw. 22(1), 285–305 (2016) 12. Senapati, B.R., Khilar, P.M., Swain, R.R.: Composite fault diagnosis methodology for urban vehicular ad hoc network. Veh. Commun. 29, 100337 (2021) 13. Swain, R.R., Khilar, P.M., Bhoi, S.K.: Heterogeneous fault diagnosis for wireless sensor networks. Ad Hoc Netw. 69, 15–37 (2018) 14. Swain, R.R., Dash, T., Khilar, P.M.: An effective graph-theoretic approach towards simultaneous detection of fault (s) and cut (s) in wireless sensor networks. Int. J. Commun. Syst. 30(13), e3273 (2017) 15. Swain, R.R., Khilar, P.M., Dash, T.: Fault diagnosis and its prediction in wireless sensor networks using regressional learning to achieve fault tolerance. Int. J. Commun. Syst. 31(14), e3769 (2018) 16. Swain, R.R., Khilar, P.M., Bhoi, S.K.: Underlying and persistence fault diagnosis in wireless sensor networks using majority neighbors co-ordination approach. Wireless Pers. Commun. 111(2), 763–798 (2020) 17. Mohapatra, S., Khilar, P.M., Swain, R.R.: Fault diagnosis in wireless sensor network using clonal selection principle and probabilistic neural network approach. Int. J. Commun. Syst. 32(16), e4138 (2019) 18. Swain, R.R., Dash, T., Khilar, P.M.: A complete diagnosis of faulty sensor modules in a wireless sensor network. Ad Hoc Netw. 93, 101924 (2019)

Chapter 13

A Novel Intelligent Street Light Control System Using IoT Saumendra Pattnaik, Sayan Banerjee, Suprava Ranjan Laha, Binod Kumar Pattanayak, and Gouri Prasad Sahu

Abstract In the twenty-first century, intelligent street light control system has become the easiest and powerful technique for IoT. Whenever a vehicle or pedestrian passes through the motion detector (PIR sensor), it triggers a signal to the NodeMCU which further triggers the 4-channel relay and sends the binary signal to the mobile application. To optimize the cost and energy efficiency, we are introducing motion sensor between every alternate electric pole. When the motion detector detects any motion of an object, it switches on two street lights at a time, which results in increasing the energy efficiency and reduces the installation charge as well. However, the light will remain turned on if the detector detects the motion of any vehicles or pedestrians. To increase the energy conservation, we use LED lights instead of conventional bulbs. Now to further optimize our solution, instead of using motion detectors after every alternate pole, we have used a single motion detector after every intersection or crossing. Now on highways, the vehicles must keep up to a minimum speed. We set up the street lights according to this minimum speed. Then with this data, along with the range of the detector and the time taken for the vehicle to cross the range of the detector, we estimate the distance between the street lights. Nevertheless, the experimental results showed that the proposed system is energy efficient and cost-effective.

S. Pattnaik (B) · S. Banerjee · S. R. Laha · B. K. Pattanayak Department of Computer Science and Engineering, ITER, Siksha ‘O’ Anusandhan University, Bhubaneswar, Odisha, India e-mail: [email protected] B. K. Pattanayak e-mail: [email protected] G. P. Sahu Accenture AG, Zürich, Switzerland © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2022 D. Mishra et al. (eds.), Intelligent and Cloud Computing, Smart Innovation, Systems and Technologies 286, https://doi.org/10.1007/978-981-16-9873-6_13

145

146

S. Pattnaik et al.

13.1 Introduction Light and electricity can be considered as one of the critical aspects of modern human society. So naturally, their conservation and proper use are of optimum importance. Most often, street lights are completely mechanical, which remain OFF during the day and ON during the night time. There are many areas on our highways that we can see less or no vehicular movement. Thus, it can be deduced that the loss of both electricity and light is imminent. Also, many of the street lights still use fluorescent bulbs, which have a high wattage and comparatively less light intensity than their modern counter parts. Since the year 2010, IoT has become one of the fastest emerging technologies [1]. IoT is a network of physical devices that are embedded with sensors and software to connect and exchange information among each other [2, 3]. Automation of street lights is an approach in which street lights are automated using motion sensors, light sensors, and an IoT platform (i.e. NodeMCU). In this paper we narrow our focus to the area of street lighting systems on national highways, which is a crucial yet often unseen area, consuming huge amount of electricity and the installation charges. One of the major areas of electricity misuse is the long-stretched highways and the less-traveled roads. A mechanism to handle this problem would be to automate the street lights. This can be done by using a technology that has gained massive popularity in the past decade, IoT. The thoughts around automating the street lights are somewhat similar which can be seen in other model presentations. The motion sensors, light sensors, and LED lights are the main components around which the models are built. The first major difference can be seen in the source of electricity. Many of the proposed models have taken solar energy as their electricity source, which is one of the best forms of electricity possible. We have not used this source and have stuck to the traditional sources still prevalent in India. Our model is designed to be implemented on the Indian highways and using solar panels for the entire highway stretch would require a huge budget. But, this can be changed when the energy ecosystem in the country changes because of the ongoing research. The second difference is related to impaired lamps. Many models propose a direct messaging system that is initiated whenever a street light is out of order. This is very helpful and will drastically reduce the time it will take to identify the damaged lights. We are yet to implement this system into our research. The last major drift our model witness is in the total count of motion and light sensors. Many of the proposed models, if not all, have used sensors for each street post. Our model as stated earlier has grouped many lamps in a single unit and a single motion sensor can be used for that particular group. This not only reduces the number of sensors but also makes it easy to replace a defective piece. This system will be very much useful for long highways and expressways. Also, the usage of number of hardware has been also reduced for improving cost-effectiveness. Our architecture uses very less number of sensors than prevailing systems which helps in decreasing the maintenance cost of the overall system. Also as the number of

13 A Novel Intelligent Street Light Control System Using IoT

147

sensor is less, this saves an ample amount of electricity. Thus making this proposed system eco-friendly and cost-effective which is easy to implement in large scale. The main purpose of the research was to reduce the total cost of implementation and maintenance as the prevailing systems were not cost-effective. As our system uses very less amount of sensors and LEDs, it is very much cost-effective and easy to implement. The rest of the paper is structured as follows: various technologies used for the smart street light described in Sect. 2 followed by summary table. Materials used for the development of smart street lightning system described in Sect. 3 as State of Art. The proposed system is elaborated in Sect. 4. Section 5 outlines the features of the smart street light system. Section 6 describes the results and discussion and finally Sect. 7 presents the conclusion.

13.2 Related Works We detail here various technologies used for smart street lights as reported in the literature. GSM, LDR along with RTC: Revathy et al. [4] proposed GSM, LDR & RTC sensors-based automated street for smart city. In this model, detection of defects of the street lights is faster which enables repairing. But, the street lights are never switched off completely and due to this, consumption of energy in metropolitan cities is increasing. As a result, a large amount of energy is wasted without being used. Solar Lights and PIR Sensor: Sarma et al. [5] in the year 2017, developed solar lights with PIR sensor. It is very common these days that we see street lights powered with solar panels. As fossil fuels degrade and at the same time, pollute the environment, the use of efficient power systems is inherent. This paper presents a remote sensebased street light system. Also, the high-intensity discharge lamp is replaced by LED. Where there is a need for timely control, the system can be easily implemented widely. LDR sensor, RTC, and GSM Module: Parekar et al. [6] says that most existing street lighting systems have cables that are difficult to build and not very flexible at the same time. To overcome this, a wireless system is required. In this model, authors have used GSM technology which uses energy efficiently by remotely monitoring and controlling the system. TEMT Ambient Light Sensors, LEDs, PIC18F Micro-controller: Lokhande et al. [7], claim that, street lights are one of the basic accessories provided by government in every town or city. In our country, ON-OFF control of street lights is manually operated and fixed timings aren’t suitable in all seasons. As the common street lights glow for a very long time or don’t switch on as and when required. The work done in this paper, along with LED-based efficient system, automated operating principle and the dimming of light improves life of lamps.

148

S. Pattnaik et al.

ZigBee wireless network and devices, control panels: Deo et al. [8] proposed scheme for street lighting control, with an aim to reduce the human error, ease the maintenance of the street light, and decrease the energy consumption of the system. By virtue of creating a wireless ZigBee network of street lights, these objectives are achieved that can be monitored from a base station. Here, a new method is incorporated i.e., an automatic mode of operation, which utilizes LDR sensor to automate street lights depending on the intensity of lights. PIR sensor, packet transfer technique: Yussoff et al. [9], implemented a low-cost sensor node for smart lightning of the streets based on the criteria needed by the industry. The authors mainly focused on the development of a cost-effective sensor node. This sensor node saves the power supplied to the light by preventing it from turning on all night throughout. Micro-controller, real time clock, MOSFET circuit, LEDs: Joshi et al. [10] proposed a time based intensity control for street lights using energy optimization system. The proposed system changes the intensity of light, in accordance with natural sunlight pattern. MOSFET is vulnerable to static electricity only. Also, the system takes one year to recognize unknown variations in atmosphere. Micro-controller, clock, light sensor, rain sensor, and laser sensor: Based on a lowcost microcontroller, Syed Abdul et al. [11] designed an energy-efficient intelligent street lighting system. The main advantage of the system is that the street lamps have five different levels of brightness, but are never turned off. But, maintenance of all the sensors may cause the bottleneck. ARM processor, GSM module, Solar panels, and sensors: Sandeep et al. [12], developed a new system with more efficient street lighting management system. It also contains a good fault detection system. Switching ON of the successive lights need sensors to be placed under all lights that is the main drawback of the system. With increased maintainability and performance, this system shows substantial conservation of energy. PLC (Programmable Logic Controller), LED lights: Sudha et al. [13] proposed a PLC-based intelligent Street Lighting Control system. This technology is used to change the intensity of street lights. The purpose of this work is to accomplish the demand for the public lighting systems by using Programmable Logic Controller (PLC). Much emphasis is given on the technology rather than the problem of efficient street lights. The detailed above proposed systems are summarized in Table13.1.

13 A Novel Intelligent Street Light Control System Using IoT

149

Table 13.1 Summary of various technologies used for smart street light Title

Technology used

Advantages

Automated street light GSM technology, LDR The detection of for smart city [4] and RTC sensors defects in the street lights which enables repairs faster

Disadvantages The street lights are never switched off completely. When there is no traffic, they work at 20% of their capacity

Using solar panel and microcontroller-based street light power reduction system [5]

Solar lights, IR sensors, LDR sensors

Usage of solar panel Installation cost of the batteries for powering solar system is high. the street lights The use of IR sensors for detection. IR may detect many unwanted objects

An intelligent system for controlling and monitoring of street light using GSM technology [6]

LDR and RTC sensors, The system has used GSM module a combination of sensors for detection of vehicles. This is said to give high efficiency

Apart from the fact that the overall cost of inputs would be very high, this system has a fixed time for switching the lights on and off

Adaptive street light controlling for smart cities [7]

TEMT 6000 X01 Ambient Light Sensor, LEDs, PIC18F Microcontroller

The lights are triggered based on “sunlight & sunset” timings. The sunlight and sunset timings will be kept in mind along with the environmental light present

The lights are dependent completely on time & environment. Vehicle movements are not considered. This will lead to some loss, when vehicles are not present

Smart Street lighting system using ZigBee. [8]

Zigbee wireless network and devices, control panels

The system uses low The presence of two cost technology and modes makes it a little low power complex system consumption wireless devices

Development of street PIR sensor, packet lighting monitoring transfer technique system using Sensor node [9]

Lights are either in an ON state or OFF state, which means there will be a good conservation of electricity

Many packets are transmitted across the poles for the detection of vehicle. This may cause packet losses

Time-based Intensity Microcontroller, real control for street lights time clock, MOSFET using energy circuit, LEDs optimization [10]

Changes the intensity of light, in accordance with natural sunlight pattern

MOSFET is vulnerable to static electricity. Also, the system takes 1 year to recognize unknown variations in atmosphere (continued)

150

S. Pattnaik et al.

Table 13.1 (continued) Title

Technology used

Automatic street lighting system based on low-cost microcontroller for energy efficiency [11]

Microcontroller, clock, The street lamps have Maintenance of all the light sensor, rain 5 different levels of sensors can be a sensor, laser sensor brightness, but are problem never turned off

Advantages

Disadvantages

ARM-based street lighting system with fault detection [12]

ARM processor, GSM module, Solar panels, and sensors

The system contains The switching ON of a good fault detection the successive lights system need sensors to be placed under all lights

PLC-based intelligent Street Lighting Control [13]

PLC (programmable logic controller), LED lights

This technology is used to change the intensity of street lights

Much emphasis is given on the technology rather the problem of efficient street lights

13.3 State of Art Now-a-days Internet is being the main source of communication with each other using different applications. Internet of things (IoT) helps the things to communicate each other using IoT module. IoT can be coined as the network of physical objects or “things” embedded with software, sensors, and networks enabling these objects for collection and exchanging the data [14]. The methods used for the development of smart street light system are described as follows.

13.3.1 Node MCU NodeMCU is an open source firmware and can use an open source board layout. The name “NodeMCU” consists of “node” and “MCU” (microcontroller unit) [15]. In our research, NodeMCU is like the core of everything. It is responsible for controlling very other piece of equipment.

13.3.2 PIR Sensor A passive infrared sensor, also known as PIR, is an electronic sensor that measures infrared (IR) light emitted by objects [16]. They are mainly used for PIR-based motion sensors. PIR sensors do not provide any information about moving people or things, but they detect general movement. Sensors help detect movement and input data into MCU nodes.

13 A Novel Intelligent Street Light Control System Using IoT

151

13.3.3 Channel Relay The 4-channel relay module contains the corresponding switches, four 5 V relays and isolation components. This simplifies the connection to the microcontroller or sensor with a minimum number of components and connections [17]. The rated current of each relay contact is 10A, 30VDC, and 250VAC respectively.

13.3.4 LDR The LDR also known as a photo resistor is a passive component that increases or decreases its resistance with a decrease or increase of ambient light [18]. It helps the system save electricity by turning on the complete system during the daytime.

13.4 Proposed System Here, we achieve the goal of automation of street light and reduce the cost of hardware used instead of using motion detectors after every alternate pole, we use a single motion detector after every intersection/crossing (Fig. 13.1). Now in highways, the vehicles must keep up to a minimum speed. We set up the street lights according to this minimum speed. Then with this data, along with range of the detector and the time taken for the vehicle to cross the range of the detector, we estimate the distance between the street lights in Fig. 13.2. The use case diagram and complete flow of the system is as shown in the Fig. 13.3a, b. The architecture of an automatic street light control system provides numerous benefits such as efficient and effective management of resources, intelligent management, and knowledge development [19]. The architecture is divided into 2 layers: 1. 2.

Physical Layer IOT Layer. Function of the proposed system can be elaborated here in the following steps.

Steps 1: All devices are initialized. Fig. 13.1 Pole-sensor positioning

152

S. Pattnaik et al.

Fig. 13.2 Working model of smart street light

Fig. 13.3 a Use-Case diagram, b complete flow of smart street light system

Steps 2: The light sensor checks the ambient light and turns the system on. Steps 3: The motion sensor senses the presence of any movement and gives 0/1 output based on the situation. Steps 4: The Node MCU processes the data from the light sensor, the motion sensors and the Blynk app. Steps 5: The Node MCU triggers the 4-channel relay. Steps 6: The 4-channel relay toggles the LED street lights on/off.

13.5 Features of Smart Street Light System 1. 2.

The system automates the street lighting system. The system uses a smaller number of sensors as compared to others.

13 A Novel Intelligent Street Light Control System Using IoT

3. 4. 5. 6.

153

The system is energy efficient. The system can be controlled using android application. The system can be converted to solar power easily. The system is very user and eco-friendly.

13.6 Results and Discussion User Interface (UI) is the space where human interacts with computers. An IoT devices are hardware-based, user needs a UI to interact and control the IoT system. UI should always be easy to use. Figure 13.4 shows the snapshot of UI of the intelligent street light control system. In this, we can see if motion is detected or not and we can also power on/off the whole system using wireless control. Figures 13.5 and 13.6 provide the demonstration about the circuit that has been used for running the entire model. The indication of success is given by the LEDs attached to the board. The lights glow in a successive manner, in accordance with the speed of the object that went passes the PIR sensor. One of the main challenges is the overhead cost of diffused street lamps. Switching the lamps on and off on a regular basis, will take a toll on the longevity of the lamps. If the cost of electricity saved even after taking into account the overhead cost of Fig. 13.4 Blynk application user interface

Fig. 13.5 Relay circuit

154

S. Pattnaik et al.

Fig. 13.6 Node MCU circuit

diffused lights will be at least twice as much as the cost encountered by the current system, then we will have a good model in our hand. Let us calculate a “Theoretical Cost” which will give us an idea about the energy efficacy of our proposed system. The cost of a street light varies according to the region and type used. Majority of Indian street lights were fluorescent street lights, which not only had high wattage and higher electricity cost, but also produced less light as compared to its brilliant counterpart, LEDs. Just by making this switch, the cost of electricity came down drastically. Reportedly, India has around 3.5 crore street lights as per 2015 report, the energy demand of which was around 3400 MW when fluorescent lights were used. With LEDs, the wattage has come down to 1400 MW, saving a total of 900 crore kWh of electricity annually, creating a total savings of Rs. 5950 crores. These numbers are from the current system, where on average, an electric lamp stays on for about 12 h every day. Getting back to our problem of diffused street lights versus electricity saved, the power of a single electric lamp would be around 40 W, taking the above report into consideration. The electricity cost for a street lamp is Rs. 85/w considering the traffic in different places, there will be many areas where the lamps will be on most of the night and areas where the opposite will happen. Let us carry on with our calculation with the assumption that on an average each lamp will be turned on for 6 h. This means the total electricity charges for one lamp per year would be around Rs. 74 lakhs, which has a staggering drop of 2% from the previous average of Rs. 1.4 crore per street light. On a slightly greater perspective, the electricity consumed by the all the street lights put together, will cost the nation around Rs. 2 lakh crore, which is a huge decrease from the current model which most likely costs Rs. 5 lakh crore. LEDs are a high-quality power sources with magnificent life span of around 50,000 h, which means if used properly, there is a high possibility that one may never change an LED light in a lifetime. But, our usage of these street lights is not going to be normal which may affect its life span. For that we are setting a very low life span for each of our LED lights, which is 1 year. The cost of a single lamp is around Rs. 10,000. This will equate to a cost of Rs. 350 crores/year, for 3.5 crore lamps across Indian highways.

13 A Novel Intelligent Street Light Control System Using IoT

155

The above discussion shows that, even after very low longevity of the street lamps, we have a model whose “Theoretical Cost” is half as much as the existing model.

13.7 Conclusion and Future Scope The proposed system here uses less number of sensors as compared to others. This is because the workings of our system are such, that a single sensor will be used to cover around 20 street lamps. Also, the system is very energy and cost efficient and can be controlled by mobile app. There are three things that we would like to improve in our system. The implementation of LDR sensor and compose a system that reverts back reports of damaged lights through messages. And we were setting the delay time value of the PIR sensor according to the minimum speed of the vehicles in the highway. In this approach, we consider the time taken by the vehicle to cross the range of the detector. Using this data, we set our delay time for the street lights. Thus, if the speed of the vehicle increases or decreases, then the delay time of the street lights changes accordingly. Hence, it makes the system completely automated and dynamic which is feasible for large-scale environment. This project helps in reduction of e-waste by reducing the total number of components used. Also this helps in saving electricity by a huge amount by automating the lights, thus helping in making the environment safe for our society. By using less number of sensors makes the retrenchment of power which in turn is highly beneficial to the society.

References 1. Laha, S.R., Mahapatra, S.K., Pattnaik, S., Pattanayak, B.K., Pati, B.: U-INS: an android- based navigation system. In: Cognitive Informatics and Soft Computing, pp. 125–132. Springer, Singapore (2021) 2. Biswal, A.K., Singh, D., Pattanayak, B.K., Samanta, D., Yang, M.H.: IoT-Based smart alert system for drowsy driver detection. Wirel. Commun. Mobile Comput. (2021) 3. Ramlowat, D.D., Pattanayak, B.K.: Exploring the internet of things (IoT) in education: a review. Inf. Syst. Des. Intell. Appl. 245–255 (2019) 4. Sarma, A., Verma, G., Banarwal, S., Verma, H.: Street light power reduction system using microcontroller and solar panel. In: 3rd International Conference on Computing for Sustainable Global Development (INDIACom), New Delhi, pp. 2008–2010. (2016) 5. Revathy, M., Remaya, R, Sathiyavathi, R, Bharathi, B., Maria Anu, V.: Automation for street light in smart city. In: International Conference on Communication and Signal Processing, 6–8 Apr 2017, India 6. Yussoff, Y.M., Samad, M.: Sensor node development for street lighting monitoring system. In: IEEE Symposium on Computer Applications & Industrial Electronics (ISCAIE), Penang, pp. 26–29. (2016) 7. Deo, S., Prakash, S., Patil, A.: Zigbee-based intelligent street lighting system. In: 2nd International Conference on Devices, Circuits and Systems (ICDCS), Coimbatore, pp. 1–4. (2014)

156

S. Pattnaik et al.

8. Lokhande, H.N., Markende, S.D.: Adaptive street light controlling for smart cities. Int. J. Appl. Eng. Res. 13(10), 7719–7723 (2018). ISSN 0973-4562 9. Lau, S.P., et al.: A traffic-aware street lighting scheme for smart cities using autonomous networked sensors. Comput. Electr. Eng. (2015) 10. Joshi, M., Madri, R., Sonawane, S., Gunjal, A., Sonawane, D.N.: Time based intensity control for energy optimization used for street lighting. In: India Educators’ Conference (TIIEC), 2013 Texas Instruments, Bangalore, pp. 211–215 (2013) 11. Al Junid, S.A.M., Rohaida, H., Othman, Z., Saari, M.F.: Automatic street lighting system for energy efficiency based on low-cost microcontroller. Int. J. Simul. Syst. Sci. Technol. IJSSST 13, 43–48 (2012) 12. Sumathi, V., Sandeep, A., Kumar, B.T.: ARM based street lighting system with Fault detection. Int. J. Eng. Technol. (IJET) 4141–4144 13. Latha, D.V.P., Sudha, K., Devabhaktuni, S.: PLC based smart street lighting control. Int J Intell. Syst. Appl. 64–72 (2013). Published Online in MECS 2013 14. Hosenkhan, M.R., Pattanayak, B.K.: Security issues in internet of things (IoT): a comprehensive review. New Paradigm Decis. Sci. Manage 359–369 (2020) 15. Rath, M., Swain, J., Pati, B., Pattanayak, B.K.: Network security: attacks and control in MANET. In: Handbook of Research on Network Forensics and Analysis Techniques, pp. 19–37 (2018) 16. Rath, M., Pattanayak, B.K..: SCICS: a soft computing based intelligent communication system in VANET. In: International Conference on Intelligent Information Technologies, pp 255–261. Springer, Singapore (2017) 17. Anguraj, D.K., Balasubramaniyan, S., Kumar, E.S., Rani, J.V., Ashwin, M.: Internet of things (IoT)-based unmanned intelligent street light using renewable energy. Int. J. Intell. Unmanned Syst. (2021) 18. Archibong, E.I., Ozuomba, S., Ekott, E.: Internet of things (IoT)-based, solar powered street light system with anti-vandalisation mechanism. In: International Conference in Mathematics, Computer Engineering and Computer Science (ICMCECS), pp. 1–6. IEEE (2020) 19. Kumar, P.: IOT based automatic street light control and fault detection. Turk. J. Comput. Math. Educ. (TURCOMAT) 12(12), 2309–2314 (2021)

Chapter 14

Enhancing QoS of Wireless Edge Video Distribution using Friend-Nodes Khder Essa, Sachin Umrao, Kaberi Das, Shatarupa Dash, and Bharat J. R. Sahu

Abstract In this paper, we extended a novel video dissemination framework at the edge cloud using the Device-to-Device (D2D) communications. We propose a caching scheme using Information Centric Network framework, referred as Collaborative Caching (CC) by exploiting the direct connectivity and cooperative communications method offered by D2D. In CC, popular video contents (video-on-demand) are cached in three different entities located geographically closer (or in proximity) to the clients or users. In CC, the video contents are cached and distributed in edge devices such as Cloud Radio Access Network (CRAN), Distributed Antennas, and smart devices labelled as Friend Nodes (FNs). By placing popular contents close to the clients, our proposed scheme facilitates clients to download contents from the nearest FN, thus improving the Quality of Service using delay parameters. Our simulation result shows that CC can significantly reduce the video transmission delay up to 60%.

14.1 Introduction The Device-to-Device (D2D) communications is being seen as one of the cutting edge paradigms to improve wireless networks’ Quality of Service (QoS) at the edge [1]. In D2D edge, cellular area under one evolved NodeB (eNB) (or a Distributed Antenna in case of 5G) is divided in to multiple groups and the user equipment (UE)s in a cluster are close enough to communicate directly. Cellular networks’ performance can be remarkably improved by using the clustering concept. Koskela et al. in [2] investigates several improvement aspects like shorter delay, improved aggregate throughput, higher spectral efficiency, and low energy consumption. K. Essa · S. Umrao · K. Das · B. J. R. Sahu (B) Department of Computer Science and Engineering, Siksha ‘O’ Anusandhan (Deemed to be University), Bhubaneswar, India e-mail: [email protected] S. Dash University of California San Francisco, San Francisco, USA © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2022 D. Mishra et al. (eds.), Intelligent and Cloud Computing, Smart Innovation, Systems and Technologies 286, https://doi.org/10.1007/978-981-16-9873-6_14

157

158

K. Essa et al.

Common solutions to cope up with multimedia data, mainly videos are: (i) using additional bandwidth and increase the link capacity in physical layer, (ii) shrinking the cell size to enhance the bandwidth utilization. All these conventional methods are not adequate and may even cause other problems like increase cost, scarcity of spectrum, and interference etc. Later, large storage space and improved processing capacity of mobile devices encouraged researchers to analyze a novel architecture based on D2D communication. Golrezaei et al. in [3] have examined the scaling behavior of the interference free direct (D2D) links with consideration of the collaboration distance, caching statistics, and users request pattern. Furthermore, Wang et al. in [4] propose two data transaction models using an ACA-A auction model as well as Stackelburg game in D2D communications. Moreover, users’ behavior in data offloading while selling or buying videos in their proximity from an economic point of view is analyzed. In this work, we extend a video delivery architecture using high-end smart phones called as Friend Nodes (FNs) [1]. FN can serve video demands of other nodes within its coverage and cluster. All the FN maintain a cached copy of popular data using some popularity indexing. Our proposed collaborative caching with Information Centric Network (ICN) for FN model includes Cloud Radio Access Network (CRAN) caching in addition to previously proposed EPC model. Experimental results proves the usefulness of the proposed work which improved transmission delay ~60% than the earlier works.

14.2 System Model A cellular network coverage area is considered with every eNB is linked to a remote server through a CRAN. There are n users inside the coverage area. Usually the eNB are the transmitter and receiver devices under CRAN scenario. Let’s consider that in a particular cluster Ci of UE, a UE received files from an FN by using transmission power Pi. Two neighbor D2D clusters’ centers have a distance of ℵi j , and the distance between UEi and FN i is expressed by μi . The received power (Pr ) by UE from FN is given as: Pi Hi Pr = √ μi

(14.1)

where Hi represents the channel gain. We consider that the Additive White Gaussian Noise (AWGN) power is σ 2 . An UE has M neighbors D2D clusters in a mobile network. PI;j is the interference power of neighboring cluster C j (j = 1, 2, 3 … M), and H I;j is the corresponding channel gain. The Signal to Interference plus Noise Ratio (SINR) for cluster Ci is expressed by:

14 Enhancing QoS of Wireless Edge Video Distribution …

SINR =

σ2 +

Pir m j=1

159

PI;j H I;j √ ℵi j

(14.2)

When the total bandwidth is W, the maximum achievable data rate between the user and FN is ⎛ ⎞ Pir ⎠ R Ri = W log2 ⎝ (14.3) ;j H I;j σ 2 + mj=1 P I√ ℵi j

Each communication link, either physical or wireless link incurs transmission delay. The transmission delays of the Remote Server (RS), CRAN, eNB, and FN are dRS, dCRAN, deNB, dFN respectively. Transmission delay varies depending on the channel condition and traffic load. λi denotes the popularity of the requested video va by the user. If a video content is accessed from RS, CRAN, eNB, and FN, the total delay occurred is represented as DRS, DCRAN, DeNB and DFN respectively. The objective of our proposal is to minimize total delay. Naturally, it translate to the efficiency of caching and availability of the FN. The users requesting video are designated as ui ∈ U as the set of users U = {u1 , u2 , … uk } The requested videos vi ∈ V. A subset of videos are cached in the friends node, base stations, and CRAN. A requested video which is present at the server may also present at any of the node. The caching at each level imitates Zipf distribution with different exponent values. 1/i γ λi = n j=1 1/i γ

(14.4)

The frequency of the ith popular video among n videos, is given by λi in Eq. 14.4.

14.3 Video Dissemination Architecture The popularity of different videos may vary at different location. The amount of video to be cached depends on the actual user’s choice in case of a friend node. However, the network operator decides the size of cache in bae stations and CRAN. Which videos to cache in the device depends on each of the stakeholders’ interest and algorithm they use. Zipf distribution is one of the good models to estimate the popularity of videos being accessed at any nodes. In this work, we have considered the zipf distribution at each level. Umrao et al. [1] considered that each level (FN, eNB, EPC) is equipped with its own cache storage. In this work, the EPC is replaced with and CRAN core. As shown in Fig. 14.1, the requested videos are searched from the bottom to top. Whenever user seeks desired video, the corresponding mobile node generates the request to FN i.e.

160

K. Essa et al.

Fig. 14.1 Video dissemination architecture

edge devices. FN searches the requested video in its own cache (VFN). The video v is transmitted to the requesting mobile user if it exists in FN cache otherwise the video request is forwarded to eNB in edge or CRAN. Subsequently, if requested video exists in eNB/CRAN cache, it is delivered to the respective user through the reverse path. In case the video is not available at the eNB cache, the request is further forwarded to the CRAN. If the demand is not met at any of the hops, then the content provider ultimately satisfies the demand.

Algorithm 1 Video dissemination and cache update for u ∈ U with video request v ∈ V do if v ∈ VF N then Disseminate requested video to requester (u) else if v ∈ VeNB or v ∈ VCRAN then Update(VF N) Serve the video to requester (u) else Recalculate popularity (V CRAN ) Recalculate popularity (V eNB ) Recalculate popularity (V FN ) Video is served by content provider. end if end for

14 Enhancing QoS of Wireless Edge Video Distribution …

161

In parallel cache at FN, eNB and CRAN are also updated during the video dissemination process. Simultaneous cache updating improves the network efficiency. In this work, we have used an ICN model for the same task. The cost calculation to take a video from an FN depends on the auction methods proposed by Umrao et al. [1]. As delineated in Algorithm 1, if requested video is found in eNB and not in the FN cache then FN updates its own cache (VFN) during the transmission. However, the eNB, CRAN, and FN update their respective caches during transmission (Update (VeNB), Update (VCRAN) and Update (VFN)), if the requested video is found at the anywhere but the remote server. The algorithm given below is similar to the one proposed in Umrao et al. [1]. However, in our work the video dissemination Algorithm is naturally followed by an ICN model.

14.4 Simulation and Results We use ns3 simulator with ndnSIM to evaluate our proposal. We consider uniformly distributed N = 100 mobile users and M = 10 FNs over a circular area of radius 500 m to emulate D2D network topology. The path loss of cellular link is computed by using COST-231 Hata model for eNB. The small cell radio antennas with a CRAN model is implemented in parallel with parameters following [5]. The D2D communication’s path loss follows a modified short range communication model as in [6] (Table 14.1). We have compared our proposed work with the previous work of Umrao et al. [1]. The simulation result is as shown in Fig. 14.2. Depending up on the cache size used at the FN nodes and subsequent edge devices, the access delay of the video contents Table 14.1 Simulation parameters

Parameters

Values

Path loss for LTE

36.7 + 35 log2 d [7]

Path loss for D2D link

31.54 + 40 log2 d [6]

LTE transmit power

23 dBm [7]

D2D transmit power

10 dBm [6]

Backhaul link rate

10 Mbps

Core delay

7 ms

Backhaul delay

10 ms

Stream interval time

0.1 s

Stream frame size

100 bytes

5G channel bandwidth LTE channel bandwidth

520 MHz 20 MHz

Device antenna gain

4 dBi

eNB antenna gain

9 dBi

Penetration loss

20 dB

Path loss compensation

3.8

162

0.09 Access delay in millisecond

Fig. 14.2 Simulation result-access delay

K. Essa et al.

0.08 0.07

FN Caching with ICN FN Caching

0.06 0.05 0.04 0.03 0.02 0.01 0 10 20 30 40 % of cached data in D2D edge (FN Cache)

could be improved. We have considered the D2D edge (FN Caches) to cache 10~40% of the total video from set V. The eNB and CRAN cache 15% of the total requested videos. As shown in Fig. 14.2, the content access delay can be improved up to 60% by using ICN model with the FN cache at the D2D edge.

14.5 Conclusion In this paper, we added and tested a collaborative Information Centric Network based caching to existing Friend Node (FN) caching algorithm to enhance the video dissemination process in the edge network. Our simulation was extended to 5G network with CRAN cache. The simulation results show that the content access delay of videos content can be improved up to 60% when 40% of the requested videos are available at the edge network. However, in actual scenario, the improvements in access delay also depends on the efficiency of caching scheme used at different nodes.

References 1. Umrao, S., Roy, A., Saxena, N.: Device to device communication from control and frequency perspective: a composite review. IETE Tech. Rev.1–12(2016) 2. Koskela, T., Hakola, S. Chen, T., Lehtomaki, J.: Clustering concept using device-to-device communication in cellular system. In: IEEE WCNC, pp. 1–6 (2010) 3. Golrezaei, N., Dimakis, A.G., Molisch, A.F.: Device-to-device collaboration through distributed storage. In: IEEE GLOBECOM, pp. 2397–2402. (2012) 4. Wang, J., Jiang, C., Bie, Z., Quek, T.Q.S., Ren, Y.: Mobile data transactions in device-to-device communication networks: pricing and auction. IEEE Wirel. Comm. Letters 5(3), 300–303 (2016)

14 Enhancing QoS of Wireless Edge Video Distribution …

163

5. Saxena, N., Roy, A., Sahu, B.J.R., Kim, H.: Efficient IoT gateway over 5G wireless: a new design with prototype and implementation results. IEEE Commun. Mag. 55(2), 97–105 (2017) 6. Wen, S., Zhu, X., Zhang, X., Yang, D.: QoS-aware mode selection and resource allocation scheme for Device-to-Device (D2D) communication in cellular networks. IEEE ICC 101–105, 9–13 (2013) 7. Jo, M., Maksymyuk, T., Strykhalyuk, B., Cho, C.H.: Device-to-device-based heterogeneous radio access network architecture for mobile cloud computing. IEEE Wirel. Comm. 22(3), 50–58 (2015)

Chapter 15

Groundwater Monitoring and Irrigation Scheduling Using WSN S. Gilbert Rozario and V. Vasanthi

Abstract Water resources are an important factor in agriculture and food production. Groundwater is a critical component of water resources for sub-humid and arid climatic conditions. Nowadays, aquifer sustainability has become big challenge because of climatic change, population growth, and industry development, since it influences pumping of more groundwater out of those reservoirs leading to slow desiccation. Groundwater monitoring is a critical factor for the development of the economy and ecology. Crop water demand varies with crop requirement and growth stages. Most of the time groundwater is pumped more than the crop requirement. In this study, we used simulated crop water demand and sensor-based groundwater level data with a wireless sensor network and a Web-based dashboard for decision-making. An automated and user-friendly groundwater monitoring and decision support system for capturing the real-time data on crop water demand, irrigation schedule, and groundwater discharge and recharge were developed, the system will help in maintaining the groundwater reservoirs and reduce electrical power consumption with added advantages of open-source data availability, data cleaning, and visualization. Crop water demand is estimated and is crosschecked with water availability in groundwater. Whenever, there is water demand, the decision support system will send a notification to users regarding water demand. Irrigation scheduling is dependent on the groundwater level.

15.1 Introduction Groundwater management depends heavily on data relating to changes in groundwater levels and capacity. However, groundwater pumping and recharging data were collected monthly or yearly; therefore, there is a lack of hourly or daily groundwater S. Gilbert Rozario (B) Department of Computer Science Rathinam College of Arts and Science Coimbatore, Coimbatore, India V. Vasanthi Sri Krishna Adithya College of Arts and Science, Coimbatore, India e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2022 D. Mishra et al. (eds.), Intelligent and Cloud Computing, Smart Innovation, Systems and Technologies 286, https://doi.org/10.1007/978-981-16-9873-6_15

165

166

S. Gilbert Rozario and V. Vasanthi

discharge and recharge data [1]. Groundwater level monitoring depicts how aquifer level changes when groundwater is siphoned contrasted with when conditions are static and how surface development influences groundwater level and the aquifer. Groundwater level monitoring gives a better understanding of how nearby wellsprings, surface water, and groundwater interface. Exorbitant groundwater siphoning can bring down the water table and affect groundwater levels on aquifer stores. As the water table sinks lower, it takes more energy to siphon groundwater to the surface. This implies that as groundwater levels drop, the expense of water increments. Knowing the groundwater levels helps decision-makers know how much groundwater to siphon safely, without destructive effect on the aquifer. Water level sensor installed in bore well or canal could measure high-frequency groundwater level data for every 30 min’ interval and retrieves monthly or yearly groundwater level data if required. This data used for more scientific analysis to find aquifer characteristics, the impact of contiguous siphoning well, and changes in aquifer storage [2]. Groundwater level sensors are connected to a telemetry optic cable to the data logger. The data logger transmits data to the base station with help of a wireless sensor network. The base station transmits data to server and store in a database. In-person data accessing is not required data automatically get transferred to a server. End-user can access data from anywhere and anytime whenever they required the data [3]. Although widely used in groundwater level monitoring, time is taken for recharge and discharge, energy consumption, crop management, ecology, etc., [4]. In this study, we designed and completed a solution for optimum utilization of groundwater. Groundwater level sensors monitor real-time groundwater levels. Calculate the amount of water pumped from the well. Time taken for recharge and time taken for discharge. It will calculate the water available in an aquifer. Along with groundwater data monitoring, the other system will calculate the crop water demands. At the time of irrigation, the system will check the water availability in the aquifer. If the water in aquifer match the crop requirement, then the system will initiate the irrigation else it will trigger an alarm to the end-user that water is not available in an aquifer. This study also helps detail study about the aquifer, crop requirement, energy consumption, etc.

15.2 Materials and Method 15.2.1 System Design The real-time groundwater monitoring and management system involved the use of sensor technology, wireless network communication, crop water demand estimation, and information technology. Groundwater level sensor was installed in the well and observe groundwater level data. The observed data insisted for the basic validation process in data logger, then validated data were transmitted to the Web server through

15 Groundwater Monitoring and Irrigation Scheduling Using WSN

167

wireless sensor network architecture. Basic weather parameters like temperature, relative humidity, rainfall, and solar radiation were collected from the weather station sensors and transmitted to the Web server. The data were stored in the database for future analysis and groundwater management. Land and crop information were loaded in the database. The Web server estimated the crop water demands using the Penman Monteith method. Groundwater level data were used to understand the aquifer storage, groundwater discharge, and recharge. Depending on the groundwater level and crop water demand, the system intelligently transmitted the instructions for the next irrigation scheduling.

15.2.2 Data Collection Groundwater level sensor installed in the bore well at 121 m’ depth from the ground surface. The sensor had the capacity of recording groundwater temperature, barometric pressure, and sensor drive pressure that could be used for calculating groundwater levels. Every 30 min’ interval sensor recorded the data and transmitted it to the data logger, which was connected to the sensor node. Groundwater level sensor is collect around 2 years of data from October 2017 to October 2019.

15.2.3 Hardware The groundwater level monitoring uses in-situ inc rugged100 as pressure transducers sensor used to calculate water presser in groundwater and observed data transmitted to the base station over a wireless sensor network. In-situ inc rugged 100 data logger has 120,000 records storage capacity. It measured in kPa, bar, mbar, and mmhg for water pressure units and inches, feet and millimeter for water level units. Raspberry Pi 4 model B (Broadcom BCM2711, Quad core Cortex-A72 (ARM v8) micro processer) min computer is used as a base station. Zigbee XBee Module S2C 802.15.4 device used for transmitting data from data logger to base station. Campbell Scientific CR300 data logger used for collecting weather data and transmit to the base station. All data collected in a base station and transmit to the Web server using SIM 80001 GSM GPRS device module.

15.2.4 Software and Database Software, which is used in the data logger, is a C program for reading groundwater water level sensor nodes. Python programming is used in a Raspberry Pi base station to collect data from sensor nodes and process data for validation. SQLITE database

168

S. Gilbert Rozario and V. Vasanthi

used for data storage in the base station. PHP programming is used for Web servers and MYSQL used for server databases.

15.2.5 Groundwater Level Measurement and Pumping Time Groundwater level sensor is a pressure transducer that converts input mechanical presser into an electrical signal. The suspended sensor measured the water pressure in the well and the sensor used to calculate the water height in the well measured the hydrostatic pressure was from the sensor which was the pressure acting on transduces was measured at zero points in the well is water level above the sensor (L); the pressure acting on the water surface is barometric or atmosphere press (B) and time of observation (t). The suspended sensor recorded the raw data. An increase in atmospheric pressure affected the downward force on the water in the well and a decrease in atmospheric pressure affected an upward force on the water in the well. Atmosphere pressure was measured separately and deducted from the pressure on water level above sensor (Z) [1] (Table 15.1). Z (t) = L(t) − B(t)

(15.1)

The water level height (GWL) was measured from the surface of the groundwater to the bottom of the well [5]. Table 15.1 Symbols used equation and their definition

Symbols

Definition

t

Time of observation

t−1

Previous observation

g

Local gravity (e.g., standard = 9.80665 ms−2 )

ρ

Density of liquid

ρ0

Density of fresh water (1000 kgm−3 @4 °C)

SG

Specific gravity of liquid (e.g., freshwater = 1)

HP

Horse power

T

Pumping duration (hrs)

ET0

Reference evapotranspiration (mm day−1 )

Rn

Solar radiation (MJ m−2 day−1 )

G

Soil temperature (MJ m−2 day−1 )

Tem

Temperature (°C)

u2

Wind speed (m s−1 )

RH

Relative humidity

Slope vapor pressure curve (kPa °C−1 )

γ

Psychrometric constant (kPa °C−1 )

15 Groundwater Monitoring and Irrigation Scheduling Using WSN

169

GWL(t) = (Z (t))/(ρ × g)

(15.2)

ρ = ρ0 × SG

(15.3)

The depth of water pumping (H) distance between the water level height to bottom of the well H (t) = total well depth − WL(t)

(15.4)

Pumping of water from the well impacts aquifer storage. During pumping, decrease in aquifer water storage is generally observed. Pumping discharge (Q) is a rate of water pumping from the well measured liter per second. Q(t) = (HP × 75)/(H (t) + HL) Volume of water pumped(m3 ) = (Q(t) × T × 3600)/1000

(15.5) (15.6)

Groundwater monitoring is an important factor for maintaining groundwater resources. During pumping, the water is removed from the well, hence water surrounding the well slowly moves toward the well and refill the well. When the rate of water pumping from the well (discharge) is more than the natural refilling of groundwater, it may cause depression or low pressure in the well. In order to maintain the groundwater resource, water discharge time and water recharge time have to be maintained properly. Water discharge time = D if H (t) > H (t − 1) = 0, else D = 1

(15.7)

Water recharge time = R if H (t) < H (t − 1) = 0, else R = 1

(15.8)

In India, over 60% of total cultivable area is dependent on groundwater for irrigation. Mostly, irrigation or rainfall is measured in millimeter (mm). 1 mm depth of water = 1 L applied for 1 square meter. Irrigation for one hectare (1000 m2 ) of agricultural land required 1000 L of water. Pumping time was calculated by using irrigating area (A) in square meters and irrigation or depth of crop water demand (I) in millimeter. Pumping time(in minutes) = (A × I )/(Q(t) × 60)

(15.9)

170

S. Gilbert Rozario and V. Vasanthi

15.2.6 Crop Water Demand Crop water demand (CWR) is calculated as the amount of water consumed by the crop through evapotranspiration (ETc) in various crop growth stages [6]. Crop water demand was calculated in terms of depth of water in millimeters. In this study, we used the Penman Monteith equation for calculating evapotranspiration [6]. E T c = kc × E T 0

(15.10)

+ γ (900/Tem + 273U2 )(RH/100) + γ (1 + 0.34U2 )

(15.11)

E T o = 0.408(Rn − G)

Reference crop evapotranspiration (ET 0) calculated from meteorological data [6], which is provided by the weather station and kc is crop coefficients. All the weather parameters were recorded by the weather sensor and transmitted to the base station. The base station, thereafter transmitted it to the Web server and stored in the database. Crop and land information were stored in a Web server database. Crop water demand modal was created in the Web server to calculate the CWR using the weather parameter from weather sensor [6], crop, and land information from the database.

15.2.7 Irrigation Water Requirements The soil water balance was calculated through CWR for the effective root depth. Soil water deficit (SWD) was measured in millimeters of water required for soil water content to increase field capacity from the water of root zone. Irrigation (I) and rainfall (R) are the sources for increasing the amount of soil water content. Water loss (WL) is the amount of water that cannot be consumed by a crop [6]. At the time of irrigation, some amount of water flow blows the crop root depth is referred to as deep percolation and some amount of water flow from soil surface to lowland area. Soil water deficit and water loss are calculated from planting date to harvest date [6]. I (i) = SWD(i − 1) + ET(i) − R(i)

(15.12)

WL(i) = R(i) + I (i) − SWD(i − 1) − ET(i)

(15.13)

SWD(i) = SWD(i − 1) + ET(i) + WL(i) − R(i) − I (i)

(15.14)

15 Groundwater Monitoring and Irrigation Scheduling Using WSN

171

15.2.8 Energy Consumption The pump is the most important factor in an irrigation system. It has to be managed effectively if not it will increase irrigation costs and waste electricity. The pump has to be chosen depending on the water requirement and structure of the well. When groundwater level decreases, then energy consumption will increase. So, continuous maintenance and monitoring are required for the pump. Electrical energy required for pump = Power(1 HP = 746 Watts) × Time Therefore, energy consumed(KWh) = 0.746 × HP × T

(15.15) (15.16)

Cost of water pumped = (Energy consumed)/(Volume of water pumped (m3 ))

(15.17)

15.2.9 Process Flow Diagram See Fig. 15.1.

Fig. 15.1 Process flow diagram explain communication between sensor node to server, transmit data, and receiving instructions from server

172

S. Gilbert Rozario and V. Vasanthi

15.2.10 Dashboard The dashboard ran in Apache Web server, a free, and open-source library that extend the PHP script to develop a user-friendly and attractive dashboard. Dygraph JavaScript framework was used for interactive bar chart and line graph.

15.3 Results Groundwater level real-time monitoring system for every 30 min accommodated complete study about aquifer storage, discharge, and recharge. The fully automated groundwater pressure transduces installed in the well helped to focus on data analysis rather than data collecting and management process. Wireless sensor network took collected data from a sensor to visualize data in the dashboard (Fig. 15.2).

15.3.1 Discharge and Recharge of Groundwater Groundwater discharge and recharge were monitored frequently for complete study of the aquifer. Groundwater discharge-recharge ratio was maintained properly. If there was more groundwater discharge than groundwater recharge ratio, it will disturb aquifer storage and exploitation of groundwater resources (Table 15.2). In this study, groundwater level has a monitor for 2 years and recorded groundwater level and calculated discharge-recharge ratio for every 30 min. Groundwater

Fig. 15.2 Groundwater level sensor data collected for 2 years’. The barometric pressure in green color, sensor presser in blue color, and water level in metallic seaweed color

15 Groundwater Monitoring and Irrigation Scheduling Using WSN

173

Table 15.2 Groundwater discharge and recharge Average time (hrs)

Maximum time (hrs)

Minimum time (hrs)

Number of times

Days

Discharge

2.3

6.5

0.5

104

692

Recharge

3.5

9.5

0.5

102

692

Fig. 15.3 Groundwater level, time taken for discharge, and time took for recharge

discharge and recharge were monitored for 692 days. The ratio between discharge and recharge was well maintained so does the groundwater level (Fig. 15.3).

15.3.2 Electricity Consumption Electricity consumption depends on the volume of water pumped from the groundwater. It influences the cost of production in agriculture. When the groundwater level goes down, then more electrical energy is required to pump the water from the well. When the groundwater level goes up, then less electrical energy is required to pump water from the well. In the present study, the average of 4.54 KHw of electricity was required per day. Maximum and minimum of 14.54 KHw and 1.11 KHw of electricity were consumed in a day. A total of 472.21 KHw of electricity was required for 2 years.

174

S. Gilbert Rozario and V. Vasanthi

Fig. 15.4 Balancing crop water demand with rainfall and irrigation

15.3.3 Crop Water Demand Crop water demand is an important factor in irrigation scheduling. The amount of water required to meet the water loss in the crop through ET. Balancing total crop water demand with rainfall and irrigation is important to analyze the amount of water required through irrigation to satisfy the crop need. We calculated crop water demands for soybean in the month of June to October. Based on the weather parameter, around 479 mm of water was required for evapotranspiration. This particular season got around 1164 mm of rainfall in 154 days and also 126 mm amount of water is irrigated to meet crop water demand. 801 mm in water loss due to heavy rainfall (Fig. 15.4). Water which cannot be consumed by the crop is the loss of water. In this study, we analyzed irrigation scheduling with minimizing the amount of water loss. Irrigation is based on the amount of water required for the crop and the amount of water that the soil can hold by soil, this type of irrigation reduces water loss (Fig. 15.5). In this study, we finding relationship between water loss with rainfall and irrigation and also analysis how to reduce water loss at time of irrigation.

15.4 Conclusion Groundwater is a major source of good quality fresh water. Groundwater levels could decline day by day because of poor groundwater management. In the existing approach, there is no proper groundwater monitoring and understanding of crop water demand. In the present study, an automated system for a complete solution for groundwater exploitation was developed. Wireless sensor networks helped for the complete study of aquifer storage, discharge ration, recharge ration, energy consumption, and weather condition. Wireless sensor network recorded all sensor data and transmitted it to an automated system in a server for future analysis. The automated system calculated the amount of water in aquifer storage and crop water demand. If the required water was available in the aquifer, the instructions were send to

15 Groundwater Monitoring and Irrigation Scheduling Using WSN

175

Fig. 15.5 Water loss comparing with rainfall and irrigation

the base station that the volume of water was pumped from the well for irrigation without groundwater exploitation. This system also focused on reducing water loss and energy consumption. While calculation the evapotranspiration, the system can estimate exact water requirement of the crop depending on crop growth stage and weather condition. So, the system will instruct the irrigation system to irrigate based on crop water demand; therefore, system will reduce the water loss and also irrigation is depending on groundwater level if ground water level is very low, it consume more electrical energy to pump the water from the well so the system will wait for groundwater recharge to save electrical energy.

References 1. Calderwood, A.J., Pauloo, R.A., Yoder, A.M., Fogg, G.E.: Low-Cost, open source wireless sensor network fo real-time, scalable groundwater monitoring. Water 12:1066 (2020) 2. Porter, J., Arzberger, P., Braun, H.-W., Bryant, P., Gage, S., Hansen, T., Hanson, P., Lin, C.-C., Lin, F.-P.: Kratz, Wireless sensor networks for ecology. Bioscience 55, 561–572 (2005) 3. Department of Water Resources California Data Exchange Center-Reservoirs, https://cdec.water. ca.gov/reservoir.html 4. Smith, S.: Data services field engineer, In-Situ Inc, manual level mode correction for vented sensors 5. Emiliawati, A.: A study of water pump efficiency for household water demand at Lubuklinggau. In: AIP Conference Proceedings, vol. 1903, pp. 100003-1–100003-10; https://doi.org/10.1063/ 1.5011613 6. Pereira, L.S., Alves, I.: Crop water demands. Elsevier Ltd. (2005)

Chapter 16

A Soft Coalition Algorithm for Interference Alignment Under Massive MIMO Wanying Guo, Hyungi Jeong, Nawab Muhammad Faseeh Qureshi, and Dong Ryeol Shin Abstract With the development of wireless communication technologies, such as multiple-input multiple-output (MIMO), the rate of transmission is faster and faster. Interference alignment (IA) (Olwal et al. in IEEE Commun Surv Tutorials 18:1656– 1686 [1]) as an important interference management strategy is widely used to eliminate interference because interference is the huge barrier to limiting system performance. Aiming at the advantages and disadvantages of the graph partitioning algorithm and coalition clustering algorithm, we propose a new algorithm named Soft Coalition Algorithm By redefining the edge weights in the graph and designing the system algorithm performance indicators in the clustering algorithm with cluster size soft constraints as the heuristic algorithm, and balances the pre-clustering algorithm with the greedy idea, so that the feasible conditions of the IA are more easily satisfied in the cluster.

16.1 Introduction In 5G times, with the trend of high-speed data transmission and regional base station intensification, multiple-input multiple-output (MIMO) technique has been used to transmit signals independently with the big number of antennas at the receiving terminal. And, the data throughput and transmission distance without increasing the bandwidth or total transmission power loss can be greatly increased. Applying interference alignment under massive MIMO interference channels can no longer maximize the system performance. W. Guo · N. M. F. Qureshi (B) · D. R. Shin Sungkyunkwan University, Suwon, South Korea e-mail: [email protected] D. R. Shin e-mail: [email protected] H. Jeong Gachon University, Seongnam, South Korea e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2022 D. Mishra et al. (eds.), Intelligent and Cloud Computing, Smart Innovation, Systems and Technologies 286, https://doi.org/10.1007/978-981-16-9873-6_16

177

178

W. Guo et al.

For the existing cluster interference alignment algorithm, it can be mainly divided into graph division and based on the coalition cluster method, which both have their own problems. The clustering algorithm based on graph division has good complexity and system performance, but the channel state information (CSI) overhead is not considered. As we know, in addition to throughput, CSI is also an important indicator for evaluating channel quality. Meanwhile, the clustering algorithm of the coalition game is comprehensive, but its model is highly complex and very difficult to converge. In the real life, it is unnecessary to eliminate all the interference. So, clustering interference alignment was first proposed in and was used in ad hoc in [2]. There are two classic clustering IA techniques named graph partitioning model and coalition model. The soft constraint method is a good graph partitioning method with low computing complexity and not bad throughput without considering CSI, while coalition clustering algorithm must have the best clustering results, but high computing complexity occurred. And, it will not be suitable for fast clustering scenarios. Therefore, this paper targets the system performance and CSI overhead of the cluster IA algorithm. Through optimizing the clustering algorithm, balance the system performance and CSI. This paper is organized as follows. Section 16.2 presents the basic IA system and performance estimation parameters considered in this paper. Section 16.3 gives the model of a new algorithm proposed and specific steps for the new method following by the experiment results in Sect. 16.4, and a conclusion is drawn in Sect. 16.5.

16.2 System Model An illustration of the system model is shown in Fig. 16.1. The K single-user cell interference channel MIMO. The number of transmission antennas and at the base station is M, and the number of user-received antennas is N. For easy description, we assume the sent data flow d = 1, so the whole system can be considered as (M × N, 1)K . In this system, the number of cluster dividing N A can be get from Eqs. (16.1) and (16.2) [3] L max = M + N − 1 NA =

K L max

(16.1) (16.2)

If transmit power is P j , the precoding matrix is the v j and receiving filter matrix at the receiver is u j . In channel H ji , each element satisfies the zero mean and the cyclic symmetric Gaussian distribution of the square difference [3]. The noise power to noise was normalized to 0). In order to ensure the fairness, is σ 2 (the transmit power v j = 1 and u j = 1 [4]. Then, the system throughput t can be summarized as:

16 A Soft Coalition Algorithm for Interference …

179

Fig. 16.1 Illustration of the system model

t=

K

log2

j=1

2 P j u j H j j v j j 1+ 2 σ 2 + i= j Pi u j H ji vi

(16.3)

Global channel information, namely the global CSI, is required for both designing precoding and receiving filtering matrices, regardless of channel interoperability. Since each sender needs to consider the alignment of the sending signal to the remaining K −1 interference space, the CSI overhead is proportional to the square of the number of users K .

16.3 Proposed Model In this section, we use a soft constraint algorithm [5] to be the heuristic for the new algorithm we proposed. Base on this, we learn from the CSI overhead quantization method in the coalition game dividing cluster algorithm so that our soft coalition algorithm not only inherits the low complexity of the soft constraint algorithm, but also takes into account the CSI overhead problem. Through the soft constraint algorithm mentioned from [6], we can obtain the cluster dividing scheme A = A(1), A(2), . . . , A(N ), and |A(m)| > L max , ∃A(m) ⊆ A. + Definition 1 If all the cluster size does not meet the requirement by L max , S = + {A(m)||A(m) > L max | } and S = 0; If all the cluster size meets the requirement by L max , S − = {A(m)||A(m) ≤ L max | }.

180

W. Guo et al.

Cluster dividing is a system-level operation that can only use statistical CSI. If the user’s operation level is a millisecond level, the cluster dividing will hardly change in such magnitude time interval [7, 8]. Thus, we can define efficient throughput as following to measure system performance during the process of clustering: teffec =

K

log2

j=1

ρ j j Pj 1+ 1 + F A m j + k= A(m j ) ρ jk Pk

⎧ ρ ji P j , ⎨ F A m j = i∈A(m j ) ⎩ 0,

A m j > L max A m j ≤ L max

(16.4)

(16.5)

From above, conditions for the soft coalition algorithm can be full and necessary > teffec . Therefore, the problem of maximizing the obtained as S + > 0 and teffec efficient throughput can be translated to:

max teffec s.t.|A(m)| > 0, m = 1, . . . , K

(16.6)

Therefore, the problem of maximizing the efficient throughput can be translated to (Table 16.1). Table 16.1 Soft coalition algorithm Soft coalition algorithm 1. Through Soft Constraint Algorithm get the pre-cluster scheme A 2. Obtain collection S + and S − from scheme A, the collection of cell can be split F and current efficient throughput teffec 3. if S + = 0 4. Print current cluster scheme A 5. else 6. Taking the cell from the F to the shortest center of the S cluster, the split cell was removed from the original cluster and added to the new target cluster. Calculate the current effective throughput teffec 7. if teffec > teffec

8. Update cluster scheme A, collection S + and S − , collection F and current effective throughput teffec 9. else 10. Print the cluster scheme A before the operation of Soft Coalition Algorithm

16 A Soft Coalition Algorithm for Interference …

181

16.4 Experiment Results The algorithm we proposed is the optimization algorithm based on the soft constraint algorithm. Therefore, we use the cluster size soft constraint algorithm as a comparison of the algorithm we proposed. All of the following experimental environments are under MATLAB. We assume that there is K single-user cell located at (30 km, 30 km). To ensure the effectiveness of the experiments, we let the positions of all base stations fit with a uniform random distribution of the X and Y axis. The cell user corresponding to each base station is uniformly randomly distributed over a circle with a radius of 1 km centered on the base station. For wireless links, the number of transmission antennas for all base stations is assumed to be the number of data streams d = 1 transmitted by M antennas. The large-scale fading factor α = 3.76 according to IEEE 802.16 m. The number of antennas received by all users is N. For the backhaul link, all base stations are connected to the same central dispatcher and the core network via the respective backhaul link. We performed simulations of the (2 × 2.1)10 , (4 × 4.1)20 , and (4 × 4.1)50 systems, respectively, where the channel satisfies the zero mean and the cyclic symmetric Gaussian distribution of the square difference. By normalizing the noise power σ 2 = 1, transmit power P can be 103 W and SNR is 30 dB. From [9, 10], we let cache capacity parameter f 0 equals 2. According to Fig. 16.2, we can intuitively see that our proposed algorithm is more balanced relative to the results of the soft constraint algorithm. For (2 × 2.1)10 system, through Eq. (16.1), we can calculate the L max = 3, and soft constraint algorithm exceeds the biggest cluster size, while soft coalition algorithm does not. For (4 × 4.1)20 system and (4 × 4.1)50 system, the percentage of soft constraint algorithm oversize is all larger than soft coalition algorithm. The results from Fig. 16.3 show: (1) The soft coalition algorithm is a superior cluster splitting algorithm over the soft constraint algorithm under any SNR condition; (2) as SNR grows, the advantage of the proposed algorithm is increasingly evident compared to soft constraint algorithm.

16.5 Conclusion In this paper, we proposed a new algorithm that combined the advantages of soft constraint algorithm and coalition cluster algorithm, not only taking the CSI into consideration but also reducing the system complexity. The experiment results prove that the clustering scheme obtained by the new algorithm is more balanced than the soft constraint algorithm. And, the proposed algorithm has more advantages over the soft constraint algorithm in its long-term throughput.

182

Fig. 16.2 Results of two cluster algorithms under different systems

W. Guo et al.

16 A Soft Coalition Algorithm for Interference …

183

Fig. 16.3 Effective throughput of the two algorithms in the three systems

References 1. Olwal, T.O., Djouani, K., Kurien, A.M.: A survey of resource management toward 5G radio access networks. IEEE Commun. Surv. Tutorials 18(3), 1656–1686 (2016) 2. Wen, C.-K., Shih, W.-T., Jin, S.: Deep learning for massive MIMO CSI feedback. IEEE Wireless Commun. Lett. 7(5), 748–751 (2018) 3. Brandt, R., Mochaourab, R., Bengtsson, M.: Distributed long-term base station clustering in cellular networks using coalition formation. IEEE Trans. Signal Inf. Process. Over Netw. 2(3), 362–375 (2016) 4. Tresch, R., Guillaud, M.: Performance of interference alignment in clustered wireless ad hoc networks. In: 2010 IEEE International Symposium on Information Theory, pp. 1703–1707. IEEE (2010) 5. Alignment, I.: A new look at signal dimensions in a communication network introduces the concept via a series of simple linear algebraic examples that very effectively convey the central idea of interference alignment 6. Chen, S., Cheng, R.S.: Clustering for interference alignment in multiuser interference network. IEEE Trans. Veh. Technol. 63(6), 2613–2624 (2013) 7. Brandt, R., Mochaourab, R., Bengtsson, M.: ‘Interference alignment-aided base station clustering using coalition formation. In: 2015 49th Asilomar Conference on Signals, Systems and Computers, pp. 1087–1091. IEEE (2015) 8. Mochaourab, R., Brandt, R., Ghauch, H., Bengtsson, M.: Overhead-aware distributed CSI selection in the MIMO interference channel. In: 2015 23rd European Signal Processing Conference (EUSIPCO), pp. 1038–1042. IEEE (2015). 9. Lee, J., Han, J.-K., Zhang, J.: MIMO technologies in 3GPP LTE and LTE-advanced. EURASIP J. Wirel. Commun. Netw. 2009, 1–10 (2009) 10. Srinivasan, R.: IEEE 802.16 m evaluation methodology document (EMD). IEEE 802.16 m08/004r3 (2008)

Part III

Optimization and Nature Inspired Methods

Chapter 17

Improved Grasshopper Optimization Algorithm Using Crazy Factor Paulos Bekana, Archana Sarangi, Debahuti Mishra, and Shubhendu Kumar Sarangi

Abstract This paper presents the new improved grasshopper optimization algorithm using crazy factor (crazy-GOA). The crazy factor technique helps to make a sudden direction change in order to enhance the diversity while exploring the entire search space. This method is a blessing in most of the modified version of the algorithms to achieve the optimal solution with smaller number of iterations. This paper explains the utilization of such technique to achieve the global optimum very fast, and this is a prime requirement of most of the applications. The experimental result analysis is performed using unimodal as well as multimodal standard benchmark functions. For extra verification as well as validation, the new improved algorithm is compared with other popular intelligent algorithms. The test result projected by this modified algorithm is superior to other intelligent algorithms in addition to classical GOA.

17.1 Introduction Optimization algorithm is the approach of finding the best solution for complex as well as large problems. The field of optimization can also simulate the cooperative behaviors of various organisms for finding the solution of several engineering problems. A population of simple agents that act together in a given surroundings with some arbitrary mechanisms to accomplish an objective globally. The technique of optimization generally can be verified [1], by the maximization or minimization of several benchmarking function. This task is very essential as the design variables to be optimized from the possible combination of inputs have to be evaluated efficiently. These parameters evaluation are also subjected to a number of design constraints and the desired solution has to be established within a predefined solution limits. So the effectiveness of the algorithm is generally decided by the testing with several P. Bekana · A. Sarangi (B) · D. Mishra Department of Computer Science and Engineering, ITER, Siksha ‘O’ Anusandhan Deemed to be University, Bhubaneswar, Odisha, India S. K. Sarangi Department of Electronics and Communication Engineering, ITER, Siksha O Anusandhan Deemed to be University, Bhubaneswar, Odisha, India © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2022 D. Mishra et al. (eds.), Intelligent and Cloud Computing, Smart Innovation, Systems and Technologies 286, https://doi.org/10.1007/978-981-16-9873-6_17

187

188

P. Bekana et al.

standardized function so that the proposed technique can overcome a number of constraints while discovering the optimal parameters for difficult as well as complex engineering problems. The optimization algorithms, i.e., based on nature inspired algorithm, are a set of new solution searching methodologies for finding solutions by the approach derived from natural process. These algorithms are highly efficient for finding optimal solution to multi-dimensional and multimodal problems. There are many optimization algorithm developed in last two decades, i.e., artificial bee colony (ABC) [2], particle swarm optimization (PSO) [3], ant colony optimization (ACO [4], and firefly algorithm (FA) [5], and there are a quite differences between optimization algorithms. The grasshopper optimization algorithm is one type nature-based algorithm which is governed by natural process swarms and was proposed by Saremi et al. in 2017 [6]. The algorithm makes progress in accordance with the information of number of available solutions. However, they have two important technical phases during the searching process. These are exploration and exploitation [7]. Exploration is the capability to consider various areas in order to get the promise regions which enable to get the global optimum solution. Exploitation is the capacity to concentrate search in specified location or position of optimum preciously. The major objective of the proposed improved grasshopper algorithm is to initiate a random diversity while doing exploration by the utilization of crazy factor which is already utilized with some previous algorithm. The utilized crazy factor helped to provide an improvement in exploration as well as exploitations. The rest of this paper is organized as follows; Sect. 17.2 presents the improved grasshopper optimization algorithm using crazy factor (crazy-GOA). Section 17.3 presents the analysis of results of test functions with comparison to other algorithms of swarm intelligence category. Section 17.4 presents the conclusion.

17.2 Grasshopper Optimization Algorithms This section provides a brief description of the basic GOA along with the improved GOA with the addition of crazy factor (crazy-GOA). In crazy-GOA, the improvement to the existing GOA is done by the consideration of crazy concept with the normal GOA.

17.2.1 Basic Grasshopper Optimization Algorithm (GOA) Grasshoppers are very powerful group of insects that can damage a large scale of agricultural productions. They are considered sometimes as nightmare for farmers. GOA utilizes the communication patterns of swarm of individual grasshoppers. There are two categories grasshopper which are adult and larval based on the movement they make. The adult phase makes movement abrupt in long range while larval

17 Improved Grasshopper Optimization Algorithm Using Crazy Factor

189

move slowly in small steps. GOA tries to model mathematically the life movement nature inspired grasshopper. There are three things that influence the movement of grasshoppers: social interaction denoted S i , gravity force denoted by Gi, and wind advections denoted by Ai . The equations for the movement of grasshoppers are described as follows [6]. X i = Si + G i + Ai

(17.1)

Here, X i represents the position of ith grasshopper. The social interactions S i is presented as Si =

N x j − xi s di j di j j=1

(17.2)

j =i

The s is a function of social force which is calculated as follows s(x) = f e l − e−r r

(17.3)

where the di j = x j − xi is the Euclidian distance between ith and jth grasshopper, l and f are 1.5 and 0.5, respectively, represent the parameters to regulate the social forces. The gravity force given by G i = −geg

(17.4)

Here, g is a constant as well as eg is a vector. Wind advection Ai can be given by Ai = uew

(17.5)

where u refers a constant and ew denotes a vector. Substituting Eq. 17.2, 17.3, 17.4, and 17.5 to Eq. 17.1, the mathematical model extended as follows Si =

N x j − xi s x j − xi − geg + uew di j j=1

(17.6)

j =i

However, the above mathematical model is modified since it prevents the algorithm from exploring as well as exploiting around the search spaces for solution. The mathematical model in modified form is as follows [6].

190

P. Bekana et al.

⎞

⎛ ⎜ X id = c⎜ ⎝

N

c

j=1

x j − xi ⎟ ubd − lbd d ⎟ + Td s x j − xid 2 di j ⎠

(17.7)

j =1

In this case, ubd and lbd denote the upper and lower bound of dth dimension. T d represents the a best position obtained so far, c is the most important decreasing coefficient to shrink comfort zone and calculated as follows. c = cmax − t

cmax − cmin L

(17.8)

Here, cmax denotes the highest value 1, cmin denotes the least amount value 0.0001, t is the current number iteration, and L is the maximum number of iterations.

17.2.2 Improved GOA with Crazy Factor (Crazy-GOA) In crazy-GOA, GOA is modified using crazy factor. Crazy factor is an improvement method implemented in bird flocking or fish schooling [8]. It is used to make a sudden change in the directions as well as to enhance the diversity movement in GOA that is why the craziness factor operation added to GOA algorithm. The newly proposed method of integration with GOA by the addition of crazy factor can be expressed as ⎞

⎛

N ⎜ x j − xi ⎟ ubd − lbd ⎟sign(r ) + Td (1 − r ) s(x dj − xid ) c X id = c⎜ ⎠ ⎝ 2 d i j j=1

(17.9)

j =1

where r is random number uniformly taken from between 0 and 1.sign(r ) is defined as −1, r ≤ 0.05 sign(r ) = (17.10) 1, r > 0.05 The Fig. 17.1 describes the simple steps for addition of crazy concept in the GOA. The purpose of sign(r) in Eq. 17.10 is used to introduce the change direction of grasshopper position suddenly. The integration of the craziness factor to GOA algorithm produces better result as well as brings good searching ability and reduces time of execution. The improved technique is mostly used many research papers and it improves also the performance of intelligent optimization algorithms. The improved grasshopper optimization algorithm with crazy factor (crazy-GOA) procedure can be described as follows. 1.

Initialization of positions

17 Improved Grasshopper Optimization Algorithm Using Crazy Factor

191

Fig. 17.1 Schema diagram of proposed crazy-GOA

2. 3. 4. 5. 6. 7.

Finding target fitness and respective position For condition termination not end Update the location and normalize the distance Update each position using crazy factor Terminate condition End.

17.3 Result Analysis The experimental analysis used five normal benchmark functions in this paper to evaluate the performance of crazy-GOA. The proposed modified version is compared with its own classical base, i.e., standard GOA along with some of the traditional swarm intelligent algorithms likes ABC, PSO, and FA. The set of parameters used to perform testing is 100 iteration, 100 agents, and 30 dimensions and the same parameters used for the other algorithm. The iteration number is taken as 100 to verify the effectiveness of algorithms in less iteration and it is one of the prime requirements in modern world applications. The test experiment utilized unimodal functions such as fa, f b , f c , and f d as well as multimodal functions are f e and f f . All

192

P. Bekana et al.

Table 17.1 The list used standard benchmark test functions Function Fa (x) = Fb (x) = Fc (x) =

n

2 i−1 x i

n

i−1 |x i | +

n i−1

n

i−1 |x i | 2 i j−1 x j

Fd (d) = maxi {|xi |, 1 ≤ i ≤ n} −0.2 1 n 2 Fe (x) = −20 exp i−1 x i n 1 n − exp cos(2π x + 20 + e ) i i−1 n n x1 1 n 2 √ +1 F f (x) = 4000 i−1 x i − i−1 cos 1

Dimension

Range

30

[−100, 100]

30

[−10, 10]

30

[−100,100]

30

[−100, 100]

30

[−32, 32]

30

[−600, 600]

the standard test functions run independently for 30 times. The individual parameters for each algorithm are referred from various standard literatures [4–8]. The result of experiment is presented by means of table and figures. The table includes the most excellent value, worst value in addition to mean value. Sample simulation convergence graph is also presented in different figures to exhibit the improvement in the searching process. Convergences accurateness as well as searching ability of crazy-GOA is better than the other intelligent algorithms for all used test functions in Table 17.1. The searching accomplishment of crazy-GOA is apparently higher than other intelligent algorithms for both unimodal as well as multimodal test functions. It can be seen clearly from the last column of Table 17.2 that the result values exhibited by crazy-GOA are best with the least in cost for all test functions in comparison with other intelligent algorithms. This means that the newly proposed modified version of GOA is cost effective to handle the real problems as well as machine learning applications. The list of Figs. 17.2, 17.3, 17.4, 17.5, 17.6 and 17.7 plotted convergence graph of crazy-GOA which exhibits superior reduced cost when compared to ABC, PSO, FA, and classical GOA itself. So it can be concluded that the new technique crazy-GOA has better convergence accuracy, strong searching ability in addition to precision that extremely significant. In all these convergence profiles, the initial slope of the newly improved algorithm, i.e., crazy-GOA is less but as iterations increases the profile improved at a faster rate which is mainly the advantage of utilization of the crazy concept.

17 Improved Grasshopper Optimization Algorithm Using Crazy Factor

193

Table 17.2 Test result of standard benchmark function to proposed algorithms Functions

Result

ABC

PSO

FA

GOA

Crazy-GOA

Fa

Best

0.00078175

0.00000603

0.0017947

3.3134E-09

3.2156E-11

Fb

Fc

Fd

Fe

Ff

Worst

0.0020168

0.0053852

0.0059972

6.9219E-08

9.2093E-10

Mean

0.00089927

0.00283949

0.00389595

5.11765E-09

6.21245E-11

Best

0.00083382

0.28134

0.0054195

3.9911E-06

2.738E-07

Worst

0.0051558

8.7105

0.061840

1.1212E-04

1.7966E-06

Mean

0.00149481

3.49592

0.00880175

7.60155E-06

1.0352E-06

Best

0.81962

0.00052277

0.0020475

4.211E-06

1.7281E-12

Worst

4.8261

0.021993

0.057797

8.9175E-05

8.9665E-10

Mean

2.82286

0.01125788

0.0039136

6.56425E-06

5.3473E-10

Best

0.29344

0.0058842

0.039322

0.0004249

1.7112E-06

Worst

0.58198

0.24287

0.097233

0.001888

4.2273E-05

Mean

0.33771

0.0150856

0.0432775

0.0011564

2.19921E-05

Best

0.029659

1.6473

0.025932

0.00007399

2.8887E-06

Worst

0.088692

5.7905

0.096412

0.00067228

2.5817E-05

Mean

0.0441755

1.7189

0.041172

0.00037313

1.43529E-05

Best

0.19654

0.27069

0.074749

0.051765

6.5036E-10

Worst

0.74303

0.80046

0.74843

0.22294

7.1136E-09

Mean

0.219785

0.285575

0.111589

0.137352

6.8086E-10

Fig. 17.2 Profile for F a

194

P. Bekana et al.

Fig. 17.3 Profile for F b

Fig. 17.4 Profile for F c

17.4 Conclusion Crazy-grasshopper optimization algorithm (crazy-GOA) is integration of grasshopper optimization algorithm and crazy factor. The crazy factor was also used

17 Improved Grasshopper Optimization Algorithm Using Crazy Factor

Fig. 17.5 Profile for F d

Fig. 17.6 Profile for F e

195

196

P. Bekana et al.

Fig. 17.7 Profile for F f

in many research papers and it improved the performance of many intelligence optimization algorithms. The newly proposed algorithm is also tested on both unimodal as well as multimodal standard benchmark functions. The simulation process of crazy-GOA algorithm with other algorithm along with normal GOA proved the effectiveness of the newly proposed modified version. Since in modern machine learning applications, time is one vital parameter, so the algorithms are tested for less number of iterations. The crazy-GOA algorithm proved its utility in giving the best output in less number of iterations. Therefore, faster convergence to optimal solution with less iteration helps in reduction in time of execution. Therefore, the suggested version of the algorithm is well suited for handling complex engineering, scientific as well as a variety of real-world applications.

References 1. Saremi, S., Mirjalili, S., Mirjalili, S., Dong, J.S.: Grasshopper optimization algorithm: theory, literature review, and application inhand posture estimation. In: Nature-Inspired Optimizers, pp. 107–122. Springer (2020) 2. Karaboga, D., Akay, B.: A comparative study of artificial bee colony algorithm. Appl. Math. Comput. 214(1), 108–132 (2009) 3. Kennedy, J., Eberhart, R.: Particle swarm optimization. In: Proceedings of ICNN’95International Conference on Neural Networks, vol. 4, pp. 1942–1948. IEEE (1995)

17 Improved Grasshopper Optimization Algorithm Using Crazy Factor

197

4. Dorigo, M., Gambardella, L.M.: Ant colony system: a cooperative learning approach to the traveling salesman problem. IEEE Trans. Evol. Comput. 1(1), 53–66 (1997) 5. Sarangi, A., Priyadarshini, S., Sarangi, S.K.: A MLP equalizer trained by variable step size firefly algorithm for channel equalization. In: 2016 IEEE 1st International Conference on Power Electronics, Intelligent Control and Energy Systems (ICPEICES), pp. 1–5. IEEE (2016) 6. Saremi, S., Mirjalili, S., Lewis, A.: Grasshopper optimisation algorithm: theory and application. Adv. Eng. Softw. 105, 30–47 (2017) 7. Abualigah, L., Diabat, A.: A comprehensive survey of the grasshopper optimization algorithm: results, variants, and applications. Neural Comput. Appl. 1–24 (2020) 8. Mondal, S., Chakraborty, D., Kar, R., Mandal, D., Ghoshal, S.P.: Novel particle swarm optimization for high pass fir filter design. In: 2012 IEEE Symposium on Humanities, Science and Engineering Research, pp. 413–418. IEEE (2012)

Chapter 18

Reliability Estimation Using Fuzzy Failure Rate Sampa ChauPattnaik, Mitrabinda Ray, and Mitali Madhusmita Nayak

Abstract The software design is the method of creating software using computer engineering applications. Predicting the software system’s reliability has recently been a significant topic in the world of software engineering research. To evaluate and estimate a system’s reliability, various methodologies have been used. Componentbased software engineering (CBSE) minimizes the effort needed to design new software by using factors such as component reliability, average execution time, and component interaction. These parameters are uncertain during the design phase of the software, so we apply fuzzy theory to evaluate the system reliability, as in realworld problem, the system reliability is a multi-objective problem, so to optimize it, fuzzy logic is considered.

18.1 Introduction Component-based software engineering (CBSE) focuses on component interface, dependencies, and reuse. Reliability is defined as the capacity of a reused component to provide new results with minimal errors while meeting customer requirements with minimal changes. To predict reliability, the coupling methodology between components and the reliability of individual components is evaluated. There are several approaches for predicting reliability [6, 8, 11, 13]. For reliability assessment during design phase, architecture-based models are proposed. Gokhale model [5], Laprie model [12], Shooman model [17], Yacoub model [20], Everett model [3], and so on are some of the architecture-based software reliability models. The models are S. ChauPattnaik (B) · M. Ray Department of Computer Science and Engineering, Siksha ‘O’ Anusandhan (Deemed to be) University, Bhubaneswar, Odisha, India M. Ray e-mail: [email protected] M. M. Nayak Department of Mathematices, Siksha ‘O’ Anusandhan (Deemed to be) University, Bhubaneswar, Odisha, India e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2022 D. Mishra et al. (eds.), Intelligent and Cloud Computing, Smart Innovation, Systems and Technologies 286, https://doi.org/10.1007/978-981-16-9873-6_18

199

200

S. ChauPattnaik et al.

described by state, path, or additive approaches. The common parameters used in these models are bugs in mathematical algorithms, mean time to repair, components reliability, transition probabilities, components dependency, operation profile [15], constant probability of failure, number of failures and average execution time, etc. For technical products and systems, failures are almost inevitable. The probability of failure is measured by reliability. In traditional reliability theory, a system’s reliability is defined as the probability that it will accurately execute stated functions. The paper presents a fuzzy logic technique for reliability estimation. The failure behavior of the system in fuzzy hypothesis is fully characterized by fuzzy measurement, and system success and failure are also represented by the fuzzy state. The system can be thought of as being in one of two fuzzy states (success or failure) at any given time. In other words, the system failure is not accurately defined, but rather vaguely. The component reliability, constant failure rate, transition probability, etc., are relevant in making reliability assessment decisions. To characterize the implications of these parameters on total system reliability, a rule-based approach is employed. It is very difficult to quantify the component-based software reliability estimation. Due to uncertainty in the factors like component reliability, transition probability between components, failure rate, and presence of fuzziness in human perception, soft computing approached models are used to handle this. Several models are purposed for component-based software reliability estimation [2, 9, 14, 16, 19]. Tyagi et al. [19] represent a rule-based method to estimate the reliability. They used 4 factors to check the reliability. Diwakar et al. [2] used fuzzy logic as well as particle swarm optimization (PSO) to achieve high reliability. Malhotra et al. [14] proposed that it is extremely difficult to estimate the parameters accurately without using a hybrid model. In this paper, we consider fuzzifying the failure parameter using a triangular membership function to quantify the system reliability. The paper is summarized following manner: Sect. 18.2 describes the fuzzy reliability system in architectural level; Sect. 18.3 describes the proposed method for fuzzy reliability estimation for component-based system, Sect. 18.4 represents a case study of stock trading system, and Sect. 18.5 presents the conclusion.

18.2 Reliability Using Fuzzy System For evaluating the reliability of component-based applications, several reliability models and estimate methodologies have been suggested. Fuzzy logic is defined as a condition-based problem solving method which considers the grain of membership instead of categorizing a task as successful or unsuccessful. Many studies [2, 9, 16, 19] have proposed different approaches to assess system reliability for componentbased systems. Fuzzifier, fuzzy inference system (FIS), rules, and defuzzifier are the four components of the fuzzy system. Figure 18.1 depicts the general design of a fuzzy system. The fuzzy system works as follows: Fuzzifier takes classical input and displays it in a serial order. The results are evaluated using a number of rules and forms a computation intelligence system. Relying on the evolutionary computation

18 Reliability Estimation Using Fuzzy Failure Rate

201

Fig. 18.1 Schematic diagram of fuzzy logic system

and parameters used, the defuzzifier minimizes the results and finds the optimal solution.

18.3 Discussion Using fuzzy theory, we propose a method for estimating reliability. Input: Component reliability (CRe), component’s average execution time, transition probability between two components (TPRi, j ). Output: Overall system reliability. Step-1: Evaluate the system reliability using path-based method using Equation-1 When the architecture is represented by an absorbing DTMC and the component reliabilities are known, the application’s reliability is known. Re =

n

θ

Re j j

(18.1)

j=1

where Re = System reliability, Rej = Component reliabilities, n = number of components, θ j = expected value of a visit to a component, i.e., the product of number of visits and the probability of reaching the component. Step-2: Design the fuzzy system Step-3: Make a list of the linguistic variables (ranges) Step-4: Evaluate the input values with fuzzy membership functions (like triangular, quasi or trapezoidal membership functions) Step-5: Using linguistic inputs, design the fuzzy rules.

202

S. ChauPattnaik et al.

Step-6: To obtain the classical value and assess the system reliability, use a defuzzifiction method such as centroid, max-member function, or weight average method.

18.4 Illustration In this part, we will offer an overview of a share trading [18] system to show how our concept is used. The system has 11 components, as indicated in Fig. 18.2. The component dependence graph (CDG) is given in Fig. 18.2. In Table 18.1, it describes

Fig. 18.2 Stock trading system architecture

Table 18.1 Component reliability and execution time

Component #

Reliability

Execution times

COR1

1

1

COR2

0.974

2

COR3

0.970

1

COR4

0.982

2

COR5

0.960

2

COR6

0.999

1

COR7

0.999

1

COR8

0.999

2

COR9

0.975

2

COR10

0.964

4

COR11

1

1

18 Reliability Estimation Using Fuzzy Failure Rate Table 18.2 Transition probability (TRP(i, j)) between two components

203

TRP1,2 = 0.489

TRP1,3 = 0.511

TRP2,4 = 0.5

TRP2,5 = 0.5

TRP3,4 = 0.5

TRP3,5 = 0.5

TRP4,6 = 0.333

TRP4,7 = 0.333

TRP4,8 = 0.334

TRP5,6 = 0.333

TRP5,7 = 0.333

TRP5,8= 0.334

TRP6,9 = 0.7

TRP6,10 = 0.3

TRP7,9 = 0.7

TRP7,10 = 0.3

TRP8,9 = 0.7

TRP8,10= 0.3

TRP9,11 = 1 TRP10,6 = 0.333

TRP10,7 = 0.333

TRP10,8 = 0.334

the components reliability and mean execution time of each component; whereas Table 18.2 represents the transition probability between two components. To estimate the system reliability, we consider a path-based approach [7, 10, 20]. The system reliability using Cheung’s model [1] is 0.90 or 90%. Similarly, using Gokhale’s [5] model, we get the overall system reliability as 0.89 or 89%. We apply fuzzy triangular function on failure rate to determine the reliability. The failure rate [4] is expressed as follows: Rel =

n −μm ET(m) vm e

(18.2)

m=1

where Rel = Reliability, n = number of components, vm = number of visit to each component, ET(m) = execution time of the n components, μm = m components constant failure rate. The constant failure rate is calculated as follows: μm =

1 1 μm = ET(m) ET(m)

(18.3)

And fuzzy triangular function is described by a lower limit cr1, an upper limit cr2, and a value cr3, where cr1 < cr3 < cr2. The function is represented as follows:

μCR (r ) =

⎧ ⎪ ⎨ 0, ⎪ ⎩

r −cr 1 , cr 3−cr 1 cr 2−r , cr 2−cr 3

r ≤ cr 1 and r ≥ cr 2 cr 1 < r ≤ cr 3 cr 3 < r < cr 2

(18.4)

Table 18.3 shows the failure rate of each component with the expected component reliability of each component. Considering the failure rate of each component to be fuzzy, the reliability of the components was calculated which was found to be almost same as in case of reliability of components without fuzzy failure rate (Fig. 18.3).

204 Table 18.3 Calculated failure rate and estimated reliability of each component

S. ChauPattnaik et al. Component#

Failure rate

Reliability with failure rate

1

0

1

2

0.013171988

0.995215

3

0.030459207

0.99447

4

0.009081985

0.996688

5

0.020410997

0.989158

6

0.0010005

0.999817

7

0.0010005

0.9998

8

0.00050025

0.999818

9

0.012658904

0.995405

10

0.009165996

0.990257

11

0

1

Fig. 18.3 Original failure rate vs fuzzy failure rate

18.5 Conclusion Many reliability models have been developed based on various parameters. Soft computing is becoming increasingly prominent in the realm of software reliability estimation and prediction. A novel fuzzy system is proposed that uses different parameters like component reliability and failure rate to calculate the reliability of component-based software (CBS). Using the fuzzy technique in the failure rate of each component, we get the estimated system reliability as 0.95 where without fuzzy logic, we get the system reliability as 0.90. There are other component-related parameters and application-related parameters may be considered to estimate the reliability. We may extend the work to compare the result with other soft computing techniques like artificial bee colony, particle swarm optimization, etc., and find which is more reliable to real-world application with minimum cost.

18 Reliability Estimation Using Fuzzy Failure Rate

205

References 1. Cheung, R.C.: A user-oriented software reliability model. IEEE Trans. Softw. Eng. 6(2), 118– 125 (1980) 2. Diwaker, C., Tomar, P., Solanki, A., Nayyar, A., Jhanjhi, N.Z., Abdullah, A., Supramaniam, M.: A new model for predicting component-based software reliability using soft computing. IEEE Access 7, 147191–147203 (2019) 3. Everett, W.W.: Software component reliability analysis. In: Proceedings 1999 IEEE Symposium on Application-Specific Systems and Software Engineering and Technology. ASSET’99 (Cat. No. PR00122), pp. 204–211. (IEEE) (1999) 4. Gokhale, S.S., Trivedi, K.S.: Analytical models for architecture-based software reliability prediction: a unification framework. IEEE Trans. Reliab. 55(4), 578–590 (2006) 5. Gokhale, S.S., Trivedi K.S.: Reliability prediction and sensitivity analysis based on software architecture. In: 13th International Symposium on Software Reliability Engineering, Proceedings, pp. 64–75. IEEE, Location (2002) 6. Goševa-Popstojanova, K., Trivedi, K.S.: Architecture-based approach to reliability assessment of software systems. Perform. Eval. 45(2–3), 179–204 (2001) 7. Hsu, C.J., Huang, C.Y.: An adaptive reliability analysis using path testing for complex component-based software systems. IEEE Trans. Reliab. 60(1), 158–170 (2011) 8. Inoue, T.: Reliability analysis for disjoint paths. IEEE Trans. Reliab. 68(3), 985–998 (2019) 9. Jaiswal, G.P., Giri, R.N.: Software reliability estimation of component based software system using fuzzy logic. Int. J. Comput. Sci. Inf. Secur. 13(7), 66 (2015) 10. Krishnamurthy, S., Mathur, A.P.: On the estimation of reliability of a software system using reliabilities of its components. In: Proceedings the Eighth International Symposium on Software Reliability Engineering, pp. 146–155. IEEE (1997) 11. Kubat, P.: Assessing reliability of modular software. Oper. Res. Lett. 35–41 (1989) 12. Laprie, J.C.: Depndability evaluation of software systems in operation. IEEE Trans. Softw. Eng. 10(6), 701–714 (1984) 13. Littlewood, B.: Software reliability model for modular program structure. IEEE Trans. Reliab. 241–246 (1979) 14. Malhotra, S., Dhawan, S.: Parameter estimation of software reliability using soft computing techniques. In: Proceedings of International Conference on Machine Intelligence and Data Science Applications, pp. 329–343. Springer, Singapore (2021) 15. Musa, J.D.: Operational profiles in software-reliability engineering. IEEE Softw. 10(2), 14–32 (1993) 16. Ritu, O.P.: Software quality prediction method using fuzzy logic. Turkish J. Comput. Math. Education 12(11), 807–817 (2021) 17. Shooman, M.L.: Structural models for software reliability prediction. In: Proceedings of the 2nd International Conference on Software Engineering, pp. 268–280. IEEE Computer Society Press (1976) 18. Si, Y., Yang, X., Wang, X., Huang, C., Kavs, A.J.: An architecture-based reliability estimation framework through component composition mechanisms. In: 2010 2nd International Conference on Computer Engineering and Technology, vol. 2, pp. V2–165. IEEE (2010) 19. Tyagi, K., Sharma, A.: A rule-based approach for estimating the reliability of component-based systems. Adv. Eng. Softw. 54, 24–29 (2012) 20. Yacoub, S., Cukic, B., Ammar, H.H.: A scenario-based reliability analysis approach for component-based software. IEEE Trans. Reliab. 53(4), 465–480 (2004)

Chapter 19

A Sine Cosine Learning Algorithm for Performance Improvement of a CPNN Based CCFD Model Rajashree Dash, Rasmita Rautray, and Rasmita Dash

Abstract Nowadays, everyone is habituated of using credit cards for different modes of monetary transactions. This swift acceleration of ecommerce and transaction digitization has shown a comfortable direction to the fraudsters to enact deviant frauds. Only frauds associated with credit cards are being a serious cause of decline in revenue of the country. One way to cease this loss is by promoting efficient credit card fraud detection (CCFD) models. In this article, a CCFD model is built up using Chebyshev polynomial neural network (CPNN). The model performance is enhanced by applying a sine cosine algorithm (SCA) in its training phase. The proposed CCFD model is also compared with two other models on two credit card datasets.

19.1 Introduction Day to day, we are getting the news of financial frauds, especially the ones related to credit cards. The loss due to credit card fraud alone is precipitating immense demolition in the revenue of the country. So this serious issue has become an eyecatching research area for several researchers, so the main motivation is to develop more efficient fraud detection models. This domain has already witnessed several CCFD models for revealing fraudulent activities. In [1], an extensive application of data mining models is presented for four categories of monetary extortion. In [2], a random forest (RF)-based CCFD model is found to surpass support vector machine (SVM), logistic regression (LR)-based model. An ANN-based CCFD model outperforms a LR-based model in [3]. Another LENN-based CCFD model is suggested in [4]. Though the literature of CCFD comprises a variety of models, yet it is always a challenging practice to detect the fraud efficiently from the enormous real-world data. In this article, a recent meta-heuristic algorithm SCA designed based on functioning of trigonometric sin cosine function is suggested in the training stage of CPNN to build an efficient SCA-CPNN-based CCFD model. This network is built R. Dash (B) · R. Rautray · R. Dash Computer Science and Engineering Department, Siksha ‘O’ Anusandhan (Deemed to be University), Bhubaneswar, India © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2022 D. Mishra et al. (eds.), Intelligent and Cloud Computing, Smart Innovation, Systems and Technologies 286, https://doi.org/10.1007/978-981-16-9873-6_19

207

208

R. Dash et al.

up with an expansion unit using a set of Chebyshev polynomials instead of traditional hidden layers in MLP [5–7]. In [5], the network is compared with MLP and DT for addressing the credit card fraud detection problem. To overcome the slow convergence and local minima problem of BP learning, in this study, SCA technique is proposed as a learning algorithm for CPNN-based CCFD model. SCA is a recently developed optimization algorithm that utilizes the sine and cosine functions to search the required solution vector from randomly initialized candidate solutions. Being motivated by the success of SCA compared to other state- of-the-art optimization approaches in different other real-life problems [8–10], we have used it to tune the weights of the expansion unit of CPNN-based CCFD model. Empirical validation of the proposed CCFD model is presented with detailed comparative results with two other state-of-the-art techniques such as particle swarm optimization (PSO) and shuffled frog leaping (SFL) technique on two credit card datasets. The next section details the SCA-CPNN-based CCFD model followed by experimental result analysis in Sect. 19.3 and conclusion in Sect. 19.4.

19.2 Proposed SCA-CPNN-Based CCFD Model Figure 19.1 depicts the overall working principle of the SCA-CPNN-based CCFD

Collecon and preprocessing of credit card fraud data • Data normalization • Preparation of training and testing sampples CPNN structure creaon • Specifying the input layer and output layer size. • Specifying the expansion order and type of activation function. Traiing of CPNN using SCA • Set randomly the initial solution vectors. • Calculate the fitness value of each vector and find the destination vector. • Update each vector according to its distance fom the destination vector using sin or cos function. • Include the updated vector with better fitness value in population. • Repeat the steps of fitness calculation and improvement of solution vector untill termination condition is reached. Tesng the performance of CPNN based CCFD model

Fig. 19.1 SCA-CPNN-based CCFD framework

19 A Sine Cosine Learning Algorithm for Performance Improvement …

209

framework. In the first stage of the model, the collected data are normalized and separated into training and test. Then, the expansion order of CPNN is fixed, and accordingly, the network is created. CPNN has the resemblance with feed forward NN. In its architecture, the in-out sample nonlinear relation is captured by spreading out the input pattern with a lower dimensional space to higher dimension by the help of Chebyshev polynomials. This network includes an input expansion block (IEB) using Chebyshev polynomials. Further, the expanded in-values are entered through the learning component present in the output layer to furnish the output. The emerging error value of the network is adopted to estimate the weights of the network in the training phase. Next, the training phase of CPNN is promoted by randomly initializing the solution vectors of SCA, where each solution vector represents a feasible set of weights of the network. Then, the fitness value of each vector is calculated, and the vector with the best fitness value is assigned as the destination vector. Then, the position of the solution vectors is updated by using sine and cosine functions based on their distance from the destination vector as follows: t t t svi,t+1 j = {svi, j + r 1 ∗ sin(r 2) r 3 ∗ dv j − svi, j if r 4 < 0.5

(19.1)

t t t svi,t+1 j = {svi, j + r 1 ∗ cos(r 2) r 3 ∗ dv j − svi, j if r 4 < 0.5

(19.2)

where sv represents the solution vector, dv represents the destination vector. The performance of SCA is tuned by using 4 random variables r1, r2, r3, and r4. The variable r1 adaptively changes its value from 2 to 0 keeping a balance between exploitation and exploration, where r2 is a random number within 0–1, used to control the movement of the solution vector toward or outwards the destination vector. The random number r3 adds a random weight to the destination vector so as to stochastically emphasize or deemphasize the effect of dv on the distance. The random number r4 equally does the improvement of solution vector either through the sine or cosine function. The steps of fitness calculation and solution vector improvisation continue till termination condition is not reached. Finally, the values of the destination vector are preserved as the final weights of CPNN and the network is used for testing. The pseudo code of the proposed SCA-CPNN-based CCFD model is depicted in Fig. 19.2.

19.2.1 Evaluation Criteria Assessment method is an important factor in evaluating different models. In this study, two popular scalar classification criteria such as correct classification rate (CCR) and misclassification rate (MCR) are used for validating the performance of the proposed CCFD model. Let, TP represents the true positive, TN represents the true negative, FP represents the false positive, and FN represents the false negative values

210

R. Dash et al.

Fig. 19.2 Pseudo code of SCA-CPNN-based CCFD model

in the confusion matrix. The evaluation criteria can be derived from the confusion matrix using the Eqs. (19.3) and (19.4). CCR =

TP + TN TP + TN + FP + FN

(19.3)

MCR =

FP + FN TP + TN + FP + FN

(19.4)

19 A Sine Cosine Learning Algorithm for Performance Improvement …

211

19.3 Experimental Result Discussion This study emphasizes on improving performance of a CPNN-based CCFD model by applying SCA as the learning algorithm of CPNN. Experimental outcomes observed through simulations performed on two credit card datasets such as Australian and German dataset collected from the UCI machine learning repository are taken into account for validation of the proposed model. The Australian dataset includes 690 samples each with 14 features, and German dataset has 1000 samples with 20 features. The data have been normalized and split into a training testing set comprising of 70% samples and 30% samples, respectively. After specifying suitable order for CPNN and its creation, it is learned with three different stochastic learning techniques. For all the three techniques, the same population with size 20 and the same iteration, i.e., 200 are fixed after a number of simulations. From the convergence graph depicted in Fig. 19.3 and 19.4, it is clearly visible that SCA-based learning has better convergence and correct classification accuracy than PSO and SFL technique for both the dataset. The classification metric observed by all the three models for the training samples is depicted in Fig. 19.5, in which it is observed that SCA-CPNN-based CCFD model is producing around 89.78% CCR and 10.217% MCR for Australian data and 76.46% CCR and 23.53% MCR for German data. Similarly, the outcome of different CCFD models over testing data samples depicted in Fig. 19.6, clearly illustrates that SCACPNN-based CCFD model is producing around 83.91% CCR and 16.09% MCR for Australian data and 78.08% CCR and 21.92% MCR for German data. Overall, all the empirical results highlight the potential of SCA-CPNN model in detection of credit card frauds and also demonstrate its superiority compared to PSO-CPNN and SFL-CPNN-based CCFD model for both the datasets.

Fig. 19.3 Convergence analysis on Australian dataset

212

Fig. 19.4 Convergence analysis on German dataset

Fig. 19.5 Training outcome of different CCFD models

Fig. 19.6 Testing outcome of different CCFD models

R. Dash et al.

19 A Sine Cosine Learning Algorithm for Performance Improvement …

213

19.4 Conclusion With the unraveling tactics of fraudsters, developing an efficient fraud detection model has always been a challenge for the financial industry. This paper has explored the performance of a CPNN-based classifier trained using SCA for CCFD. A comparative performance of the suggested model with PSO-CPNN and SFL-CPNN-based CCFD models clearly reveals the better CCR and MCR values in training and testing phase for both dataset. In future, we have planned to improve the performance of SCA by integrating its features with other search problems and to implement it in the feature selection and network optimization phase of CPNN-based CCFD model.

References 1. Ngai, E.W., Hu, Y., Wong, Y.H., Chen, Y., Sun, X.: The application of data mining techniques in financial fraud detection: a classification framework and an academic review of literature. Decis. Support Syst. 50(3), 559–569 (2011) 2. Bhattacharyya, S., Jha, S., Tharakunnel, K., Westland, J.C.: Data mining for credit card fraud: a comparative study. Decis. Support Syst. 50(3), 602–613 (2011) 3. Sahin, Y., Duman, E.: Detecting credit card fraud by ANN and logistic regression. In: IEEE International Symposium on Innovations in Intelligent Systems and Applications, pp. 315–319. (2011) 4. Dash, R., Rautray, R., Dash, R.: A legendre neural network for credit card fraud detection. In: Intelligent and Cloud Computing, pp. 411–418. (2021) 5. Mishra, M.K., Dash, R.: A comparative study of Chebyshev functional link artificial neural network, multi-layer perceptron and decision tree for credit card fraud detection. In: IEEE Conference on Information Technology, pp. 228–233. (2014) 6. Dash, R.: DECPNN: a hybrid stock predictor model using differential evolution and Chebyshev polynomial neural network. Intell. Decis. Technol. 12, 93–104 (2018) 7. Mohanty, S., Dash, R.: Predicting the price of gold: a CSPNN-DE model. In: Intelligent and Cloud Computing, pp. 289–297. (2021) 8. Mirjalili, S.: SCA: a sine cosine algorithm for solving optimization problems. Knowl.-Based Syst. 96, 120–133 (2016) 9. Abd Elaziz, M., Oliva, D., Xiong, S. An improved opposition-based sine cosine algorithm for global optimization. Expert Syst. Appl. 90, 484–500 (2017) 10. Belazzoug, M., Touahria, M., Nouioua, F., Brahimi, M.: An improved sine cosine algorithm to select features for text categorization. J. King Saud Univ.-Comput. Inf. Sci. 32(4), 454–464 (2020)

Chapter 20

Quality Control Pipeline for Next Generation Sequencing Data Analysis Debasish Swapnesh Kumar Nayak, Jayashankar Das, and Tripti Swarnkar

Abstract The current trends in biomedical research especially genome sequence analysis are enriched by the introduction of the high-throughput sequencing technique. High-throughput sequencing like next generation sequencing (NGS) plays a crucial role in the area of whole genome sequencing (WSG), epigenomics, transcriptomics, and many more. NGS data give a deeper insight into the results of various microbial and cancer genomics analysis. NGS data are focusing on quality and diversity over the traditional biological datasets. NGS data make the process of classification, clustering, and analysis more reliable with the help of advanced computational intelligence techniques like machine learning (ML) and deep learning (DL). The accuracy of the result mostly depends upon the dataset standard, similarly, the analysis of data is more effective when it is in normalized form. In this paper, we prepare a pipeline for preprocessing of NGS data of oral cancer and breast cancer into a normalized form and make them analysis-ready. The obtained result is in a standard format with normalized values. The normalized values of NGS dataset provide an ultra-edge dimension to the analysis process and help to obtain more accurate results.

20.1 Introduction High-throughput sequencing like NGS is the emerging trend in the field of biomedical research. The NGS dataset is having a huge amount of data which leads to D. S. K. Nayak (B) Department of Computer Science and Engineering, Siksha ‘O’ Anusandhan Deemed to be University, Bhubaneswar, Odisha, India J. Das Research and Innovation, Institute of Medical Science and SUM Hospital, Siksha ‘O’ Anusandhan Deemed to be University, Bhubaneswar, Odisha, India e-mail: [email protected] T. Swarnkar Department of Computer Application, Siksha ‘O’ Anusandhan Deemed to be University, Bhubaneswar, Odisha, India e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2022 D. Mishra et al. (eds.), Intelligent and Cloud Computing, Smart Innovation, Systems and Technologies 286, https://doi.org/10.1007/978-981-16-9873-6_20

215

216

D. S. K. Nayak et al.

complexity for analysis and result interpretation [1]. For an efficient clinical research laboratory setup, data management plays a key role in Biomedical Laboratory Information System (BLIS). To deal with such a huge and complex dataset like NGS, it is important to keep all the data processing pipelines in a proper structure with relevant metadata information. The NGS data have more potential than traditional data like Sanger sequencing. The limitations in Sanger sequencing like the minority resistance variable which is clinically significant but due to below threshold value is not detected [2]. To issues like this, NGS data can detect the minority variants, and most importantly, it comes with less cost, time, and in scalable mode [3]. The features like informative data in less time, low cost, and minimal computation time make NGS more popular. Similarly, due to the huge amount of data, there are more numbers of sequencing errors. So, the main objective, while dealing with NGS data, is to find a solution to reduce the issues associated with it. The importance of data preprocessing is very crucial for analyze and interpret the result in the NGS dataset. In the research area of whole genome sequencing (WGS), disease prediction, drug resistance, and antimicrobial resistance, the use of the NGS dataset is increased exponentially. The main reason behind this trend is the significant improvement in the accuracy of the result. The data processing process starts from data collection to give them a standard shape for analysis. In several works of literature, the data are processed using traditional tools available and thus losing the accuracy, it is due to the complex (vast data size) nature of NGS. Numerous experimental methods, tools, and computational methods are available to work with the NGS dataset but there are very limited numbers of the computational pipeline are available to deal with raw data [4]. A full-proof data processing pipeline needs to be developed which can provide efficiencies in different stages of the research life cycle. The pipeline must include the features like evaluating controls, quality control, and relevant information for future analysis. The NGS dataset for various gene expression data must undergo a common data processing pipeline to maintain robustness. The aim of the data processing pipeline should be to provide a unique solution to the same type of NGS data.

20.2 Background and Related Work The sequencing platform Illumina (HiSeq4000, NextSeq500) is widely used in various sequencing steps. A sequencer like MiSeq can produce ~30 million pairedend reads in a single day with extremely fast computational processing time. The transcriptome analysis needs a huge number of cell profiling hence increasing the sequencing cost [5]. To deal with the above issues Islam et al. [6] considered a single cell, there are 105–106 mRNA molecules and more than 10,000 expressed genes are present. The researchers integrate the unique molecular identifiers (UMIs) or barcodes with reverse transcriptomics, and in this technique, the PCR bias is removed and each read can be mapped to its original cell with more accuracy. By the use of

20 Quality Control Pipeline for Next Generation …

217

RPKM/FPKM (read/fragment per kilobase per million mapped reads) sequencing read-based terminology, the barcode approach provides better reproducibility over the molecular counting method [7]. The quality control (QC) comes into the picture, once all the reads are obtained from the well-designed sequencing experiments. The FastQC (Babraham Institute, http://www.bioinformatics.babraham.ac.uk-/ projects/fastqc/) QC tool to examine the quality distribution of all reads is used by Byungjin et al. [1, 8]. In the preprocessing step, low-quality sequence reads are removed by this tool. The next step is read alignment, which includes the methods like Burrows-Wheeler Aligner (BWA) and STAR [9, 10]. The result of this pipeline provides a summary of post-alignment status like mapped reads, reads annotated, and coverage patterns. The problem associated with this tool is there is no process involved in trimming the raw reads of the sequence. The ClinQC pipeline is developed using Python 2.7.9 (https://www.python.org) has been used for preprocessing the data, and this pipeline is consists of nine different steps, the steps are starting with raw sequencing reads and finished with QC report [11]. The various steps of this pipeline are base calling, format conversion, demultiplexing, adapter and primer trimming, duplicate and contamination filtering, quality trimming, read filtering, GC content assessment, and result [5]. This is a complete data processing pipeline for the researcher to work in the wet lab. The major limitation of this pipeline is there are no steps involved to carry out the dimension reduction of the NGS data and this may lead to an increase in turnaround time for the pipeline.

20.3 Datasets To run the datasets in the pipeline, we focus on two different types of datasets like oral cancer and breast cancer. The breast cancer NGS data are taken from 79 PDX model (patient-derived xenograft or immuno-compromised mice with human tumors). Similarly, the oral cancer gene expression data have been taken from NCBI which has 100 × tumor and 44 × normal samples.

20.4 Methods and Material The raw paired-end sequence reads of breast cancer (21 × 22,635) and oral cancer (144 × 10,376) are taken for the pipeline implementation. The pair-end sequencing provides alienable sequence data with high quality which allows the user to sequence both ends of any fragment. Paired-end sequencing facilitates detection of genomic rearrangements and repetitive sequence elements, as well as gene fusions and novel transcripts. The raw sequence reads then feed to the data preparation block. In our pipeline, the data preparation block has six (06) steps to process the raw reads (Fig. 20.1).

218

D. S. K. Nayak et al.

Fig. 20.1 Blueprint of NGS data quality control pipeline

20.4.1 Duplicate Removal and Trimming Before dealing with sequencing reads, it is very important to preprocess the raw reads and irradiate the unnecessary technical artifacts. The raw read data should be converted into a format that can be efficiently used by subsequent algorithms; thus, it is necessary to clean the sequencing adapters from the reads. The Trimmomatic algorithm cleans the technical sequences from raw sequencing data, and the PCR clean algorithm removes all duplicate reads from sequencing data. In this step, the raw data preprocessing, error correction, and simulation are done.

20.4.2 NGS Dataset In this step, the raw reads are aligned to a genome sequence for the similar species, chosen as the reference genome. If WGS is not available, then genome contigs (it

20 Quality Control Pipeline for Next Generation …

219

is a set of overlapping DNA segments that represents a consensus region of DNA) can be used. The procedure also helps the user to quantify reads associated with expressed genes. The final raw reads are saved as FASTQ (a text-based file format that stores biological data/ nucleotide sequences and quality scores) format and the FASTQ files (NGS data) are having different parameters like name, base qualities, and the sequence. The mapped and well-structured gene expression table (values) is obtained after running a specific pipeline, the method we choose is resulting in a table that contains numeric values for each gene in each sample. The obtained gene expression table for the NGS data may be stored as .txt, .csv, .xlsx file format, which are the input for subsequent steps of our pipeline.

20.4.3 NULL Value Removal The NULL values present in the dataset are affecting the performance of any subsequent steps (models) for analyzing the data. This step deals with the eradication of the NULL value present in the NGS gene expression dataset. The null values present in each sample are removed, and thus, the dimension of the dataset is reduced which contains the significant values only.

20.4.4 Normalization The need for technical standardization and normalization of data is crucial in the course of biological data analysis. It is observed that most of the analysis models are expected a normal distribution in samples [12, 13]. In the normalization step, the gene expression data are normalized to fit with most of the models associated with the analysis of large datasets. In general, the normal distribution has equal trails on both sides and a higher probability around the mean. There are several normalization methods, but it is important to choose the appropriate methods while there is a question about how to bring the normalized data into their original values if needed. Methods like log scale transformation are very useful and can make modifications in data properties and help for parametric tests. Thus, in our pipeline, log scale transformation is implemented to obtain the normal values.

20.4.5 Dimensionality Reduction The biological datasets especially NGS dataset have a large number of genes with a very less number of samples. The curse of dimensionality while dealing with NGS dataset is a major threat to the researchers. This pipeline provides a solution to

220

D. S. K. Nayak et al.

Table 20.1 Summary of dataset used in the pipeline S. No.

Dataset used

Samples

Genes

No. of feature groups

1

Breast cancer

20

22,635

4

2

Oral cancer

144

10,376

2

deal with the NGS data especially by adopting the dimension reduction method like principal component analysis (PCA).

20.5 Result Discussion The results for the proposed NGS data quality control pipeline have been discussed in terms of two different types, (i) (ii)

The differentially expressed genes and outlier detection visualization. The differentially expressed gene selection using dimension reduction.

20.5.1 The Differentially Expressed Genes and Outlier Detection Visualization The pipeline has been implemented on two different datasets as shown in Table 20.1. Log2 transformation is applied on the raw dataset of Table 20.1. The result of normalization in the case of breast cancer data is significantly changed as seen in Fig. 20.2a, b representing the box plot of raw data and Log2 transformation data, respectively. It is observed that the distribution of the gene expression values for the different studied samples can be very well visualized in Fig. 20.2b, where the Log2 transformation is being applied to the expression values. In the case of oral cancer, it was being observed that the changes in the two box plot the raw data and Log2 transformation data were not very significant as the oral cancer raw data itself were in a normalized form. Most importantly, the outliers are detected very well as seen in Fig. 20.2b.

20.5.2 The Differentially Expressed Gene Selection Using Dimension Reduction The curse of dimensionality for both the dataset has been handled by adopting principal component analysis (PCA). The number of principal components (PC) for breast cancer and oral cancer dataset are obtained as per the number of samples considered for each dataset. Breast cancer is having four different sample groups like basal,

20 Quality Control Pipeline for Next Generation …

221

Fig. 20.2 Breast cancer raw dataset and dataset after Log2 transformation a breast cancer samples before Log2 transform, b breast cancer samples after Log2 transform

claudin-low, luminal, and normal-like, whereas oral cancer is of two different sample groups tumor and normal. It is observed that the best dimension reduction result was found while taking the threshold values for breast cancer and oral cancer data are 5 and 2, respectively. The PC values for each sample in the dataset are considered and the top three PC values (PC1, PC2, and PC3) of different samples for 15 most significant genes for future analysis.

222

D. S. K. Nayak et al.

It is observed that the four different classes in various samples of breast cancer are very clearly differentiated from each other while taking PC 1 with PC 2 as seen in Fig. 20.3 and PC1 with PC 3 seen in Fig. 20.4 (Figs. 20.3 and 20.4). Similarly, the two different classes in oral cancer are efficiently distinguished from each other by taking PC 1 with PC 2 and PC 1 with PC 3 as seen in Figs. 20.5 and 20.6, respectively, (Figs. 20.5 and 20.6). The resulted principal component (PC) values for all the samples in the used dataset are summarized in (Table 20.2). The variance ratio helps to prevent overfitting while dealing with a huge dataset like NGS and it should be limited to 0.8 by adding

Fig. 20.3 Scatter plot for PC1 and PC2 of different samples in breast cancer

Fig. 20.4 Scatter plot for PC1 and PC3 of different samples in breast cancer

20 Quality Control Pipeline for Next Generation …

223

Fig. 20.5 Scatter plot for PC1 and PC2 of different samples in oral cancer

Fig. 20.6 Scatter plot for PC1 and PC3 of the different samples in oral cancer Table 20.2 Principal component values and variance ratio of breast cancer and oral cancer (all sample, 15 genes) S. No.

Dataset

PC1

PC2

PC3

PCA variance ratio PC1

PC2

PC3

1

Breast cancer

0.340

0.300

0.341

0.34140752

0.15966272

0.12024952

2

Oral cancer

0.500

0.130

0.065

0.5035812

0.13498335

0.06459973

224

D. S. K. Nayak et al.

each component’s variance ratio [14, 15]. It is seen that the variance ratio obtained for breast cancer and oral cancer for 15 significant genes is less than 0.8 which makes the selected genes to be the best fit for any prediction model.

20.6 Conclusion and Future Work The biggest challenge in dealing with biomedical data (sequencing/expression) is processing the raw format and makes it analysis-ready. The huge size of genes and few samples/features are putting real threat for the researchers to handle the processing stage of biomedical data especially NGS. There are several methods available for analysis of NGS data whereas the data processing methods are very few, even some of the existing methods/tools do not include the complete data processing packages. The pipeline used in this paper is self-sufficient to normalize the complex NGS datasets. The result obtained for breast cancer data normalization is significant, which proves that this model may achieve an extremely good result for processing the raw NGS data. The obtained results are very significant and can be suitable for any subsequent data analysis methods/models. The accuracy of the analysis result can improve as the normalization result obtained from the pipeline is significant. The implementation of different methods especially quartile normalization, classification of resulted from significant genes, and annotation of differentially expressed genes is included in our future work.

References 1. Capina, R., Li, K., Kearney, L., Vandamme, A.M., Harrigan, P.R., Van Laethem, K.: Quality control of next-generation sequencing-based HIV-1 drug resistance data in clinical laboratory information systems framework. Viruses 12, 1–16 (2020). https://doi.org/10.3390/v12060645 2. Lee, E.R., Parkin, N., Jennings, C., Brumme, C.J., Enns, E., Casadellà, M., Howison, M., Coetzer, M., Avila-Rios, S., Capina, R., Marinier, E., Van Domselaar, G., Noguera-Julian, M., Kirkby, D., Knaggs, J., Harrigan, R., Quiñones-Mateu, M., Paredes, R., Kantor, R., Sandstrom, P., Ji, H.: Performance comparison of next generation sequencing analysis pipelines for HIV-1 drug resistance testing. Sci. Rep. 10, 1–10 (2020). https://doi.org/10.1038/s41598-020-58544-z 3. Ji, H., Enns, E., Brumme, C.J., Parkin, N., Howison, M., Lee, E.R., Capina, R., Marinier, E., Avila-Rios, S., Sandstrom, P., Van Domselaar, G., Harrigan, R., Paredes, R., Kantor, R., Noguera-Julian, M.: Bioinformatic data processing pipelines in support of next-generation sequencing-based HIV drug resistance testing: the Winnipeg Consensus. J. Int. AIDS Soc. 21, 1–14 (2018). https://doi.org/10.1002/jia2.25193 4. Hwang, B., Lee, J.H., Bang, D.: Single-cell RNA sequencing technologies and bioinformatics pipelines. Exp. Mol. Med. 50, 1–14 (2018). https://doi.org/10.1038/s12276-018-0071-8 5. Pandey, R.V., Pabinger, S., Kriegner, A., Weinhäusel, A.: ClinQC: a tool for quality control and cleaning of Sanger and NGS data in clinical research. BMC Bioinform. 17 (2016). https:// doi.org/10.1186/s12859-016-0915-y

20 Quality Control Pipeline for Next Generation …

225

6. Islam, S., Zeisel, A., Joost, S., La Manno, G., Zajac, P., Kasper, M., Lönnerberg, P., Linnarsson, S.: Quantitative single-cell RNA-seq with unique molecular identifiers. Nat. Methods. 11, 163–166 (2014). https://doi.org/10.1038/nmeth.2772 7. Hashimshony, T., Wagner, F., Sher, N., Yanai, I.: CEL-Seq: single-cell RNA-Seq by multiplexed linear amplification. Cell Rep. 2, 666–673 (2012). https://doi.org/10.1016/j.celrep.2012.08.003 8. https://www.bioinformatics.babraham.ac.uk/projects/fastqc/. Last accessed on 04 July 2021. Vol. 148, pp. 148–162 9. Li, H., Durbin, R.: Fast and accurate long-read alignment with Burrows-Wheeler transform. Bioinformatics 26, 589–595 (2010). https://doi.org/10.1093/bioinformatics/btp698 10. Dobin, A., Davis, C.A., Schlesinger, F., Drenkow, J., Zaleski, C., Jha, S., Batut, P., Chaisson, M., Gingeras, T.R.: STAR: Ultrafast universal RNA-seq aligner. Bioinformatics 29, 15–21 (2013). https://doi.org/10.1093/bioinformatics/bts635 11. https://www.python.org. Last accessed on 04 July 2021 12. Mohapatra, S., Swarnkar, T., Das, J.: Deep convolutional neural network in medical image processing. Handb. Deep Learn. Biomed. Eng. Acad. Press. 25–60 (2021). https://doi.org/10. 1016/B978-0-12-823014-5.00006-5 13. Nayak, D.S.K., Mahapatra, S., Swarnkar, T.: Gene selection and enrichment for microarray data—a comparative network based approach. Prog. Adv. Comput. Intell. Eng. 417–427 (2018). https://doi.org/10.1007/978-981-10-6875-1_41 14. https://towardsdatascience.com/a-one-stop-shop-for-principal-component-nalysis-5582fb7e0a9c. Last accessed on 04 July 2021 15. Tripathy, J., Dash, R., Pattanayak, B.K., Mohanty, B.: Automated phrase mining using POST: The best approach. In: 2021 1st Odisha international conference on electrical power engineering, communication and computing technology (ODICON) (2021). https://doi.org/10. 1109/ODICON50556.2021.9429014

Chapter 21

Fittest Secret Key Selection Using Genetic Algorithm in Modern Cryptosystem Chukhu Chunka , Avinash Maurya , and Parashjyoti Borah

Abstract In today’s world, sharing a piece of secret information between the intended users has become one of the most challenging issues in computer networks due to the wireless communication medium and the possibilities of information leakage increase rapidly. Therefore, it is necessary to maintain a strong session key for secure communication. The objective of this research work is to generate a strong session key using a Genetic Algorithm (GA) which is non-repeatable, unpredictable, and random. However, in this paper, the key is generated using a GA and the dynamic keys are generated using the standard mechanism Automatic Variable Key (AVK). To proof, the effectiveness of the proposed scheme, a comparison is done with existing interrelated schemes and to verify the pseudo-randomness among the auto-generated keys, National Institute of Standards Technology (NIST) statistical test has been carried out.

21.1 Introduction With the continuous acceleration of internet technologies, e-commerce, e-business, social network, and online transaction are integral parts of everyday life. In additionally internet technology expanding quickly to fulfill the requirement of the public. The information exchange over public networks must be reliable and correctly acknowledged to the recipient. Since regrettably, the attacker can modify, masquerade, and eavesdrops on the exchange information over the insecure channel. Meanwhile, maintaining the secrecy of the information, the session secret keys act as a very C. Chunka (B) · P. Borah Department of Computer Science Engineering, ITER SOA Deemed to be University Odisha, Bhubaneswar, India e-mail: [email protected] P. Borah e-mail: [email protected] A. Maurya Department of Computer Science Engineering, National Institute of Technology Arunachal Pradesh, Nirjuli, India © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2022 D. Mishra et al. (eds.), Intelligent and Cloud Computing, Smart Innovation, Systems and Technologies 286, https://doi.org/10.1007/978-981-16-9873-6_21

227

228

C. Chunka et al.

significant role. In this regard, Genetic Algorithm can be one of the best alternate to establish a secret key and for the different blocks of data, the different secret key is produced by adopting AVK. The GA’s idea is to obtain optimization solutions from all possible solutions. Therefore, from all possible key solutions select one random key for encryption. The core objective of this scheme is to select the initial symmetric key by applying a GA fitness function on a bunch of pre-shared population. Hereafter, after the selection of the initial key rest of the key generates by the key generator, which is a fresh key for each new block of information transmitted based on the AVK mechanism. In this paper, the dynamic is generated based on the combination of the previous key, transmitted information block, and the threshold value. Further, the key and plaintext size are made to the same which confirms that the attacker will not be able to decrypt the original plaintext. The rest of the paper has been arranged. In Sect. 21.2 preliminaries, in Sect. 21.3 proposed work, in Sect. 21.4 examples of an initial key selection procedure, in Sect. 21.5 experiment results, Sect. 21.6 performance analysis, to prove the truthiness and randomness about auto-generated keys is verified using the NIST statistical test suite is shown in Sect. 21.7, in Sect. 21.8 conclusion and followed by future work.

21.1.1 Background Advancement of internet technology, many people use the internet for communication. To achieve confidentiality, integrity, authenticity, and legitimacy of information over an insecure network, many researchers develop a scheme [1–3]. Meanwhile, according to Shannon [4, 5] fundamental research is on perfect security in which the confidentiality of the message can be maintained if only if the size of the key is equal or larger than the length of the message. Later, Bhunia et al. [6] proposed a scheme to realize perfect security. In the continuation of this, many researchers like Singh et al. [7], Chakrabarty et al. [8], Goswami et al. [9, 10], and Banerjee et al. [11] develop a new technique to generate a secure key and claim that their scheme has higher randomness than other related standards AVK schemes. As a result, if the key code is compromised, the possibility of a brute force attack may occur. However, in their proposed schemes they fixed the key size to 8 bits or 16 bits. To improve the security Dutta et al. [12] works on AVK to achieve more security by enhancing different variable key sizes for a different session. Later, Chukhu et al. [13] proposed a scheme where the size of the auto-generated key is made to 128 bits and many researchers work on AVK to achieve high secrecy over the insecure public network [4–12]. Henceforth, to achieve a strong key best alternative method is by incorporating GA with the AVK mechanism. Aarti et al. [14] the initial key is selected from the GA population by applying central rules and the size of the key is to maintain to 128 bits. To enhance more security Sonia [15] developed a new approach to generate the best-fit key in cryptography using GA. To validate the proposed scheme, they have used a DES cipher. Later, Ankit et al. [16] focuses on the size of the key to 228 bits

21 Fittest Secret Key Selection Using Genetic Algorithm … Table 21.1 Key size used in different schemes

229

Schemes

Size of the key

Goswami et al. [9]

8-bits

Banerjee et al. [11]

8-bits

Dutta et al. [12]

16 to 128-bits

Aarti et al. [14]

128, 192, 256-bits

Sonia et al. [15]

128-bits

Kalaiselvi et al. [17]

128-bts

and the best key is selected using the fitness value of user-defined. Recently, Kalaiselvi et al. [17] proposed a scheme to generate a fittest key using GA to enhance the performance of AES-128. The input considered for encryption are text and audio. To generate a strong key for encryption, many researchers work on GA [14–26]. Table 21.1 shows the key size used in existing schemes.

21.1.2 Motivation In wireless communication medium, if the session goes for a long period of time, leakage of the secret key may occur. For that reason, the secret key must be made random in every different session. Therefore, to avoid such a situation cryptography GA helps to generate random keys for a different session.

21.1.3 Contribution The contribution of proposed scheme is mentioned below: i. ii. iii.

The initial key is generated at both sides using a pre-defined GA fitness function. Once the secret key is generated rest of the keys or dynamic keys are generated using automatic variable key. NIST statistical test is done to check the randomness of the keys.

21.2 Preliminaries The proposed scheme is based on the automatic variable key [2] and the genetic algorithm of Artificial Intelligence. The following subsections 2.1 AVK and 2.2 Genetic Algorithm is defined.

230

C. Chunka et al.

Table 21.2 Illustration of AVK [2] under XOR encryption Session Sender(S)_sends Receiver(R)_receives Receiver_sends Sender_receives Remark slots 1

Secret key = 2

2

3

2

Secret key = 6

6

For new session, Sender will practice 6 as key and Receiver 2 as key for transmitting data

Firstly, S sends R gets back unique R sends first information as 3 information as (3 ⊕ 6 information as ⊕6 ⊕ 6) = 3 7⊕2

S gets back unique information as (7 ⊕ 2 ⊕ 2) = 7

S will produce a new key (6 ⊕ 7) for next session. R will produce a new key (2 ⊕ 3)

S sends next R gets back unique R sends next information as 4 information as (4 ⊕ 6 information as ⊕6⊕7 ⊕ 7 ⊕ 6 ⊕ 7) = 4 8⊕2⊕3

S recovers information as (8 ⊕ 2 ⊕ 3 ⊕2 ⊕ 3) = 8

In this manner, S and R exchange information 34 and 78

21.2.1 Illustration of AVK The illustration of standard AVK is defined as a fresh keys are continuously constructed based on the past key and previously used chunk of information or data [6]. The complete explanation of AVK is referenced in Table 21.2.

21.2.2 Genetic Algorithm The GA is a strategy for solving both constrained and unconstrained optimization problems that are based on natural selection of Artificial Intelligence (AI) [27]. In GA at each stage, the population repeatedly changes. The individual chromosome is selected at random for next-generation using a fitness function. Figure 21.1 illustrates the genetic algorithm process. In GA to achieve consecutive generation, the population expands toward the best solution. To form the succeeding generation from the existing population the following standard needs to perform.

21 Fittest Secret Key Selection Using Genetic Algorithm …

231

Fig. 21.1 Basic genetic algorithm process

• Selection: individual parents are selected from the current population for the next generation. • Crossover: two individual parents combine to form a child for the next generation such as. Parental chromosome 1: CIKSEFGH

Offspring chromosome1: CIKSMNOP

Parental chromosome 2: IOFTMNOP

Offspring chromosome2: IOFTEFGH

• Mutation: to maintain genetic diversity among different generation [27] mutation is applied. The individual parents’ variations lead to form offspring. The example is shown such as. Parental chromosome 1: CIKSMNOP

Offspring chromosome1: CRKSMVOP

Parental chromosome 2: IOFTEFGH

Offspring chromosome 2: SOFTEFAH

21.3 Proposed Work The scheme name as Fittest Secret Key (FSK), the fittest key is an initial session key which is selected using the GA fitness function. To generate an initial session for encryption, the initial key is selected from the predefined population which is known to both the communicating parties through the GA user-defined function and secondly, a dynamic key is generated using the concept of AVK mechanism i.e., a different key is generated for a different session. In this paper, experiments are done to illustrate the randomness of the proposed scheme, and the results of experiments are based on, length of the key (128-bits long), the population size 50, and only one iteration of the GA cycle are applied to select the initial session key. A step wise brief explanation of the FSK is given below and flow chart is illustrated in Fig. 21.2. Step 1 Step 2

GA population is shared between two entities before communication start. In mating pool perform AND operation on individual chromosomes.

232

C. Chunka et al.

Fig. 21.2 Flow chart for working process of FSK

Step 3 Step 4

Perform one-point crossover and one-point mutation on individual chromosomes. Apply fitness function on individual chromosomes to choose the initial secret key. The secret key or initial key is selected based on the highest value of Ri and rest of chromosomes lesser than Ri is kept to perform XOR operation while generating dynamic keys.

21 Fittest Secret Key Selection Using Genetic Algorithm …

F Fi =

233

total no. of1 s in K i ; n

F Fi ; R i ← m j=1 F Fi Step 5

Dynamic key is generated based on initial key (previous) ⊕ data block ⊕ threshold value.

Algorithm to generate the initial key using GA 1. 2. 3. 4. 5. 6. 7. 8.

Initial key selection (i p, n, m) f or (i = 0; i < n; i + +) f or ( j = 0; j < m; j + +) Mating pool perform AND operation on K i ∗ K i+1 , K i+1 ∗ K i−1 ; One-point crossover perform at n2 the middle position of key K i and K i+1 ; One-point mutation perform at n2 the middle position of key K i and K i+1 ; Circular right shifts each key of 8 bits K i , K i+1 , …, K i−1 ; Perform fitness function on individual chromosome F Fi =

9.

total no. of 1 s in K i ; n

Apply Ri , if Ri is the highest fittest key, then consider as the initial session key for encryption. F Fi R i ← m ; j=1 F Fi

10. 11.

After the selection of Ri value, the dynamic keys are generated using the AVK approach. Next key = Initial key ⊕ Dataset block ⊕ key under the threshold value.

Where; i p ← initial population, n ← key si ze, m ← population si ze, Ri ← Roulette wheel selection. Threshold value = selected key ≥ 0.2 of Ri . The threshold value is that value which is computed after the selection of the initial session key. If the threshold value is ≥ 0.2 consider for further use else, discard the chromosome key. For every new session different keys are generated based on past key ⊕ past chunks of dataset ⊕ key under the threshold. A dynamic key will be generated until the information and threshold value = null.

234

C. Chunka et al.

21.4 Experiment Examples Let’s consider K 1 , K 2 , K 3 , and K 4 are the keys which are selected randomly from mating pool for further operation. K 1 = 10101111, K 2 = 00101101,K 3 = 10101010, and K 4 = 10101011. Step_1: AND Operation: AND operation is performed on K 1 , K 2 , K 3 , and K 4 . K 1 = 10101111 ∗ K 2 = 00101101 and result value is K 1 = 00101101. Similarly, perform AND operation on rest keys K 2 and K 3 . K 2 = 00101101 * K 3 = 10101010 and result value is K 2 = 00101000. K 3 = 10101010 * K 4 = 10101011 and result value is K 3 = 10101010 so on. Step_2: Crossover: After the performance of AND operation on keys, one-point crossover over is applied at a bit position of 4th of K 1 and K 2 . K 1 = 00101101, K 2 = 00101000. The results are K 1 = 00101000 and K 2 = 00101101. Similarly, all keys are to perform the one-point crossover. Step_3: Mutation: One-point mutation operation is performed on n2 chromosome key. K 1 = 00111000 bits, K 2 = 00111101 of all chromosome keys within the population. Step_4: Shift Operation: Perform right shift of 3-bits in chromosome keys K 1 = 11000001, K 2 = 01101001 and also perform shift operation on all chromosome key within the population. Step_5: Fitness Function: Apply the F Fi in individual chromosome key. (F Fi ) =

total no. of 1 s in K i n

K 1 =

3 = 0.375 8

K 2 =

4 = 0.5 8

Step_6: Initial symmetric key: Select the initial keys for encryption using the Ri .

F Fi K1 Ri ← m ; or K 1 = 0.875 j=1 F Fi K 1 =

0.375 = 0.43 0.875

K 2 =

0.5 = 0.57 0.875

21 Fittest Secret Key Selection Using Genetic Algorithm …

235

Fig. 21.3 HD among the successive key pairs for Experiment_1

The highest Ri value is selected as I k (Initial key) for encryption i.e. K 2 = 01101001. Step_7: Generate a key for Rest Session: Dynamic key = K 2 (past key) ⊕ Dataset ⊕ (threshold value ≥ 0.2) and for every new session different key is generated as per the Step 7, to generate new key perform = 01101001 ⊕ 10110110 ⊕ 11000001. New session key (K 3 ) = 00011110. The auto-generated key will generate until the dataset and threshold value = null.

21.5 Experiment Results In this section, illustrated experiment results measure the Hamming distance (H D) among the auto-generated successive keys. The generated keys rely on the initial key, dataset pairs, and threshold value. The H D is measured on various sets. For the experiment, consider a secondary dataset, single iteration, an initial population of 50 keys, and the length of the key as 128 bits. To measure the difference among the auto-key generation, consider the hamming distance of two consecutive keys i.e., the difference between the corresponding bits. The resultant variation of a successive key is depicted in Figs. 21.3, 21.4 and 21.5.

21.6 Performance Analysis In this segment, the average hamming distance and standard deviation is performed to know the superiority and efficacy of FSK with previous related work like AVK [6], Vertical Horizontal Automatic Variable Key (VHAVK) [13], and Computing

236

C. Chunka et al.

Fig. 21.4 HD among the successive key pairs for Experiment_2

Fig. 21.5 HD among the successive key pairs for Experiment_3

and Shifting Automatic Variable Key (CSAVK) [28]. The Hamming distance key depends upon the initial key and dataset pairs. AHD =

HDKP NHDKP

SD =

(HDKP − AHD)2 NHDKP

(21.1) (21.2)

where, AH D = Average HD, S D = Standard Deviation, H D K P = HD of key pair, and N H K P = Total Nos. of HD pair.

21 Fittest Secret Key Selection Using Genetic Algorithm …

237

Fig. 21.6 AHD comparison of various schemes and FSK

Fig. 21.7 SD comparison of various schemes and FSK

Experiments result of 1, 2 and 3 have been shown in Fig. 21.6 AH D and Fig. 21.7 S D. The experiment 2 AH D value 65.9 and experiment 3 AH D value 65.14 which show the higher value than other related existing schemes. From above Figs. 21.6 and 21.7 it can conclude that our scheme has the higher AH D and at par S D than the other existing AVK schemes.

21.7 Randomness Verification In this section, randomness and truthiness of the proposed scheme are measured using the NIST statistical test suite [29]. According to the NIST test, the computed P − value ≥ 0.01 indicates that the sequence of auto-generated keys are random otherwise, auto-generated keys are not random. Tables 21.3, 21.4 and 21.5 illustrates the P − value of experiments 1, 2 and 3. In NIST statistical tests suite there is countable number of tests to check the randomness of auto-keys out of which consider only 5 tests to determine the randomness. The following is the list of NIST statistical tests to check the randomness of keys that are considered.

238

C. Chunka et al.

Table 21.3 Random test Experiment_1 Sl. No

Test

P-value

Remarks

1

FT

0.437274

Random

2

BFT

0.162606

Random

3

RT

0.012650

Random

4

CSF

0.090936

Random

5

CSB

0.012650

Random

P-value

Remarks

Table 21.4 Random test Experiment_2 Sl. No

Test

1

FT

0.002971

Random

2

BFT

0.162606

Random

3

RT

0.437274

Random

4

CSF

0.090936

Random

5

CSB

0.162606

Random

P-value

Remarks

Table 21.5 Random test Experiment_3 Sl. No

Test

1

FT

0.162606

Random

2

BFT

0.048716

Random

3

RT

0.834308

Random

4

CSF

0.090936

Random

5

CSB

0.834308

Random

1.

2. 3. 4.

Frequency Test (FT): The test is to check the numbers of 0’s and 1’s in complete sequence of bits whether sequences of 1’s and 0’s are approximately equivalent i.e., 21 (0.5) which means the numbers of alternative bits must be almost the same. Block Frequency Test (BFT): The frequency of 1’s in the Fq-bit block is almost Fq . 2 Runs Test (RT): To determine the movement of bits (oscillation) between the 1’s and 0’s is too quick or too slow. Cumulative Sum Forward and Backward (CSF/CSB): To determine arbitrary sequences bits in the context of maximal excursion (0’s) of the arbitrary walk defined by the cumulative sum of adjusted (− 1, + 1) digits in the sequence.

21 Fittest Secret Key Selection Using Genetic Algorithm …

239

Tables 21.3, 21.4 and 21.5 illustrate the statistical random tests for the experiment 1, 2, 3, and key size of 128 bits length. From Tables 21.3, 21.4 and 21.5 it has been observed that almost all tests satisfy P-value ≥ 0.01. Hence, auto-key generated for every new block of data achieves the randomness in this scheme.

21.8 Conclusion The strength of the proposed protocol lies in the initial key and auto-generated key. The initial key does not require key synchronized exchanged between the sender and the receiver. The initial key is generated from the pre-shared bunch of key population using the Ri selection procedure and the dynamic key is generated with the improved security which can be implemented in various cryptographic application. While considering all the experiments results of our scheme has higher AH D and at par randomness S D value than other related existing schemes. Later, the randomness of this scheme has been tested using the NIST test suites where every experiment resulted P−value ≥ 0.01. Therefore, considering all the parameters it has been observed that our scheme has a higher secure transmission of data and it can be practically implemented in the cryptosystem algorithm such as RSA, DES, and AES to achieve perfect security.

21.9 Future Work The proposed scheme can be extended further by applying ‘n’ iteration to select highest peak chromosome key from the initial population. Further, incorporate artificial neural networks and GA to generate a truly random key.

References 1. Li, C.T., Lee, C.C., Liu, C.J., Lee.: A robust remote user authentication scheme against smart card security breach. In: IFIP Annual Conference on Data and Applications Security and Privacy, Springer, Berlin, Heidelberg, pp. 231–238 (2011) 2. Biham, E.: A fast new DES implementation in software. In: Proceedings of International Symposium on Foundations of Software Engineering, pp. 260–273 (1997) 3. HEberle, H.: A high-speed DES implementation for network application. In: Proceeding to International Conference Cryptology, pp. 521–539 (1992) 4. Shannon, C.E.: Mathematical theory of communication. Bell Syst. Tech J. 27(3), 379–423 (1948) 5. Shannon, C.E.: Communication theory of secrecy system. The Bell Syst. Tech J. (1949)

240

C. Chunka et al.

6. Bhunia, C.T., Mondal, G., Samaddar, S.: Theory and application of time-variant key in RSA and that with selective encryption in AES. In: Proceeding of EAIT (Elsevier Publications, Calcutta CSI), pp. 219–221 (2008) 7. Singh, B.K., Banerjee, S., Dutta, M.P., Bhunia, C.T.: Generation of automatic variable key to make secure communication. In: Proceedings of the International Conference on Recent Cognizance Wireless Communication & Image Processing, pp. 317–323 (2016) 8. Chakarabarti, P., Bhuyan, B., Chowdhuri, A., Bhunia, C.T.: A novel approach towards realizing optimum data transfer and automatic variable key (AVK) in cryptography. Int. J. Comput. Sci. Netw. Secur. 8, 241–250 (2008) 9. Goswami, R.S., Chakraborty, S.K., Bhunia, A., Bhunia, C.T.: New approaches towards generation of automatic variable key to achieve perfect security. In: 10th International Conference Information Technology, IEEE Computer Society, pp 489–49 (2013) 10. Goswami, R.S., Banerjee, S., Dutta, M.P., Bhunia, C.T.: Absolute key variation technique of automatic variable key in cryptography. In: Proceedings of the 8th International Conference on Security of Information and Networks, ACM Publisher, pp. 65–67 (2016) 11. Banerjee, S., Dutta, M.P., Bhunia, C.T.: A novel approach to achieve the perfect security through AVK over insecure communication channel. J. Inst. Eng. India Ser. B 98(2), 155–159 (2017) 12. Dutta, M.P., Banerjee, S., Bhunia, C.T.: An approach to generate 2-dimensional AVK to enhance security of shared information. Int. J. Secur. Its Appl. 9(10), 147–154 (2015) 13. Chukhu, C., Goswami, R.S., Banerjee, S., Bhunia, C.T.: An approach to generate variable keys based on vertical horizontal mechanism. Int. J. Secur. Its Appl. 11(3), 61–70 (2017) 14. Aarti, S., Suyash, A.: Key generation using genetic algorithm for image encryption. IJCSMC 2(6), 376–383 (2013) 15. Sonia, J., Adeeba, J.: Generating the best fit key in cryptography using genetic algorithm. Int. J. Comput. Appl. 98(20), 0975–8887 (2014) 16. Kumar, A., Chatterjee, K..: An efficient stream cipher using Genetic Algorithm. In: Wireless Communications, Signal Processing and Networking (WiSPNET), IEEE, pp. 2322–2326 (2016) 17. Kalaiselvi, K., Kumar, A.: Generation of fittest key using genetic algorithm to enhance the performance of AES-128 bit algorithm. J. Adv. Res. Dyn. Control Syst. (Special Issue 02/2017), JARDCS. ISSN 1943-023X 18. Sokouti, M., Sokouti, B., Pashazadeh, S.: Genetic-based random key generator (GRKG): A new method for generating more-random keys for one-time pad cryptosystem. Neural Comput. Appl., Springer. 22(8), 1667–1675 (2013) 19. Jawaid, S., Saiyeda, A., Suroor, N.: Selection of fittest key using genetic algorithm and autocorrelation in cryptography. J. Comput. Sci. Appl. 3(2), 46–51 (2015) 20. Sagar, V., Kumar, K.: A symmetric key cryptography using genetic algorithm and error back propagation neural network. In: 2nd International Conference on Computing for Sustainable Global Development (INDIACom), IEEE, pp. 1386–1391 (2015) 21. Anil, K., Ghose, M.K.: Information security using genetic algorithm and chaos. JISJ: A Glob. Perspect. 18, 306–315 (2009) 22. Sindhuja, K., Devi, P.S.: A symmetric key encryption technique using genetic algorithm. Int. J. Comput. Sci. Inform Tech. 5(1), 414–416 (2014) 23. Aditi, B., Shailender, K..: Genetic algorithm with elitism for cryptanalysis of Vigenere cipher. In: International conference on issue and challenges in intelligent computing techniques, IEEE, pp. 373–377 (2014) 24. Mondal, S., Mollah, T.K., Samanta, A., Paul, S.: A survey on network security using genetic algorithm. Int. J. Innov. Res. Sci. Eng. Technol. 5(1), 319–8753 (2016) 25. Xexeo, J., Souza, W., Torres, R., Oliveira, G., Linden, R.: Identification of keys and cryptographic algorithms using genetic algorithm and graph theory. IEEE Lat. Am. Trans. 9(2), 178–183 (2011) 26. Som, S., Chatergee, N.S., Mandal, J.K.: Key-based bit-level genetic cryptographic technique (KBGCT). In: 7th International Conference on Information Assurance and Security (IAS), IEEE, pp. 240–245 (2011)

21 Fittest Secret Key Selection Using Genetic Algorithm …

241

27. Elaine, R., Kevin, K., Shivashankar, B.N.: Artificial Intelligence, 3rd edn. McGraw Hill publication (2008) 28. Goswami, R.S., Chakraborty, S.K., Bhunia, C.T.: New techniques for generating of automatic variable key in achieving perfect security. J. Inst. Eng. (India) Ser. B, 95, 197–201 (2014) 29. Andrew, R., Soto, J., Nechvatal, J., Smid, M., Barker, E.: Statistical test suite for random and pseudorandom number generators for cryptographic applications. NIST Special Publication (2010)

Chapter 22

Variable Step Size Firefly Algorithm for Automatic Data Clustering Mandakini Priyadarshani Behera, Archana Sarangi, Debahuti Mishra, and Srikanta Kumar Mohapatra

Abstract Nowadays with the rapid development of data availability proper data analysis technique is required. Clustering is a popular unsupervised data analysis process which can be used to identify similar or dissimilar set of objects based on their characteristics. Data clustering algorithms can be applied to a wide range of areas. Several clustering methods have been considered to resolve data clustering issues. The very popular Firefly algorithm (FA) is a nature motivated optimization algorithm that depends on the natural light emission of fireflies. In this paper a modified firefly algorithm is used to resolve the data clustering issues. The suggested algorithm is compared with very popular particle swarm optimization algorithm (PSO) and existing standard firefly algorithm. The experimental results are calculated on six medical datasets from UCI and Kaggle machine learning repository. In addition to this, Davies-Bouldin (DB) cluster validity index is applied to analysis experimental results of discussed algorithms. With proper performance study we concluded that the modified firefly algorithm namely variable step size firefly algorithm (VSSFA) can perform better than PSO and FA.

22.1 Introduction The core idea of data clustering is to divide datasets based on their similarity and dissimilarity properties. There are many algorithms available for the data clustering task and each algorithm performs uniquely based on their nature and the type of sample availability [1]. In last several years, data clustering has performed well in different research areas like machine learning, artificial intelligence, webmining, textmining, genetics, education, microbiology, sociology, economics, and medical science etc. To challenge the complexity of large real-world datasets scientists are M. P. Behera · A. Sarangi (B) · D. Mishra Department of Computer Science & Engineering ITER, Siksha ‘O’ Anusandhan Deemed to be University, Bhubaneswar, Odisha, India S. K. Mohapatra Tata Consultancy Services, Bhubaneswar, India © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2022 D. Mishra et al. (eds.), Intelligent and Cloud Computing, Smart Innovation, Systems and Technologies 286, https://doi.org/10.1007/978-981-16-9873-6_22

243

244

M. P. Behera et al.

developing algorithms regularly. But conventional clustering methods are powerless to manage the huge real-world data complexity problems. Recently, several popular nature inspired swarm intelligence (SI) techniques motivate researchers to handle data clustering problems. SI techniques are composed of a group of simple agents communicating locally with each other as well as with their environment [2]. Recently, the firefly algorithm (FA) has been popularly used by researchers to successfully tackle clustering problems, owing to its simplicity, efficient and robustness. Among other swarm intelligence algorithms FA algorithm has achieved more attraction due to its excellence in global optimization, better ability to solve nonlinear optimization problems, multimodality, better parameter tuning and its ability to subdivide a group of individuals in to sub groups [3]. The main purpose of this paper is to suggest an automatic data clustering algorithm by using a modified firefly algorithm. As most of the algorithms require preceding information regarding cluster numbers which is unfortunately impossible in many real-world problems, to find out cluster numbers automatically with this condition is a challenging task. In this work, a modified version of existing very popular firefly algorithm namely variable step size firefly algorithm is used to perform automatic data clustering task. In variable step size firefly algorithm, a variable step size concept is integrated with the standard firefly algorithm to gain an improved efficient optimization result. In addition to this the modified firefly algorithm is employed to determine the optimal number of clusters for six different datasets based on DaviesBouldin (DB) cluster validity index [4]. The experiment analysis status shows that the suggested VSSFA clustering algorithm outperforms the existing PSO and FA clustering algorithm. The remaining of this study is outlined likewise. Section 22.2 gives a knowledge about the related work. Section 22.3 clearly presents the suggested work. Section 22.4 analyzes the experimental analysis perfectly. Finally, Sect. 22.5 concludes the study with a conclusion note and remarkable future scopes.

22.2 Related Work Data clustering is an unsupervised data mining technique and, in this technique, patterns are being categorized according to the similar or dissimilar properties. Researchers developed different clustering algorithms for cluster analysis. In literature, several nature-inspired metaheuristic data clustering algorithms like Particle swarm optimization, Ant colony optimization, Bat algorithm, Artificial bee colony, Firefly algorithm have been used by researchers. In automatic data clustering using hybrid firefly particle swarm optimization algorithm FA is integrated along with PSO for decreasing the disadvantages of both individual algorithms as well as to resolve the automatic data clustering issues [3]. Automatic clustering using genetic algorithms suggested an automated data clustering method that depends on genetic algorithm in which K is unknown [5]. Another automatic data clustering algorithm sustainable automatic data clustering using hybrid PSO algorithm with mutation

22 Variable Step Size Firefly Algorithm for Automatic Data …

245

is proposed in [6]. In this study an innovative hybridized clustering algorithm is suggested by integrating PSO along with mutation operator and the proposed algorithm has been applied in different network data. In an automatic data clustering algorithm based on differential evolution in [7]. In this study an innovative automatic clustering method depends on differential evolution algorithm is being developed to resolve the issues of calculating the cluster numbers automatically. The above discussed works by different researchers motivates a lot to further work on it. These above works give good clustering results, however also suffer from a few disadvantages and ambiguous results. We hope to gain more superior, efficient clustering result and proper parameter tuning by using our proposed algorithm on automatic data clustering problem.

22.3 Variable Step Size Firefly Algorithm Firefly algorithm (FA) is a nature inspired metaheuristic swarm intelligence algorithm developed as [8]. It has been inspired by flashing nature of fireflies. The flashing light of fireflies can be generated in such a manner that it is correlated with the objective function to be improved which can develop novel optimization algorithms. The standard firefly algorithm has three idealized hypotheses: • All fireflies are unisex, so brighter firefly can fascinate less bright fireflies regardless of their sex. • The degree of attraction is proportional to the brightness. Hence the light intensity will decrease as the distance between any two fireflies increase. If there is any brighter firefly in comparison to a fixed one, it will migrate randomly. • The degree of brightness of each firefly can be calculated by the use of objective function. In the firefly algorithm, there are two vital problems including: the variation in light intensity and formulation of the amount of attractiveness. In simplest manner it can be considered that the amount of attractiveness of any firefly depends on the its brightness which is somehow associated with the objective function. The light intensity (I ) of firefly is inversely proportional to the distance between any two fireflies (r ) [8]. I =

1 r2

(22.1)

As the attractiveness of any firefly is proportional to the intensity of light of nearby fireflies, the attractiveness β(r) of any firefly can be determined by [8]. β(r ) = β0 e−γ r

2

β0 is the attractiveness in r = 0 and γ is the light absorption coefficient.

(22.2)

246

M. P. Behera et al.

Each single firefly has its distinct attraction β which will vary along with the distance between two fireflies I and J placed in location X I and X J respectively. The distance between two fireflies I and J will be evaluated as [8]. rI J =

(X I − X J )2 + (Y I − Y J )2

(22.3)

Migration of the firefly I attracted toward another brighter firefly J is defined as [8]. 1 2 X I = X I + β0 e−γ r I J (X I − X J ) + α Rand − 2

(22.4)

To enhance the efficiency of standard firefly method its convergence rate should increase by reducing its limitations. More advantages can be achieved by having a proper equilibrium in between the global exploration and the local exploitation. For this purpose, the step size α should balance dynamically. In the very popular firefly algorithm, the step size α is constant so the standard FA algorithm fails to follow the proper searching mechanism. The exploration and convergence rate of the algorithm is depending on the step size α. For a proper scale between the identification and development ability of standard FA algorithm initially large step size is required gradually it will reduce by increasing iteration. Hence dynamically adjustment of the step size α is needed. The step size α can be defined as [9]. α(T ) = 0.4/(1 + exp(0.015 × (T − MaxGeneration)/3))

(22.5)

T is considered as the number of existing iterations whereas Max Generation is the maximum number of iterations. In this work, we propose the variable step size firefly algorithm by modifying the standard firefly algorithm to resolve the automatic data clustering issues. The usefulness and the performance of the suggested method is calculated using DB cluster validity index. This validity index uses to evaluate perfect optimal clustering numbers and also to obtain proper division of clusters. The fitness function of every solution of the firefly algorithm is calculated by using the DB cluster validity index. The perfect solution is calculated based on the smallest mean value and standard deviation value of individual algorithms on six datasets based on DB index. Procedures for VSSFA Step 1 Step 2 Step 3 Step 4 Step 5 Step 6

Initialization of every firefly with cluster centers randomly. Evaluate the fitness function value of the initialized population by using DB cluster validity index. Evaluate the light intensity as Eq. (22.1). Determine light absorption coefficient γ . Evaluate the step size α which is not constant as Eq. (22.5). Move specific firefly toward another attractive firefly based on Eq. (22.4) to update firefly position (cluster centers).

22 Variable Step Size Firefly Algorithm for Automatic Data …

Step 7 Step 8 Step 9

247

Calculate latest solution and update light intensity. Modify location of fireflies based on rank and get the current perfect solution. End if the termination conditions meet and also choose the perfect solution, or else move to step 2.

22.4 Result Analysis The proposed algorithm is implemented on MATLAB to test the efficiency and experiments are conducted on six health related datasets. Among these six datasets the covid dataset is collected from Kaggle machine learning repository whereas other five datasets are taken from UCI machine learning repository [10]. The results obtained from the suggested method are analyzed with both standard PSO and FA algorithm. The parameters taken for our proposed algorithm is presented in Table 22.1. For result analysis of automatic data clustering by using three above discussed algorithms best, worst, mean, standard deviation values are taken with 200 maximum iteration, 50 populations on 10 independent runs. The experiments are conducted by executing three discussed algorithms individually with six different datasets. The numerical results for PSO, FA and VSSFA are presented in Tables 22.2, Table 22.3 and Table 22.4 respectively. In these tables the best, worst, mean and standard deviaTable 22.1 Parameter setting

VSFFA Value

Parameters

nPop

50

Maxit

200

Alpha

0.2

Alpha_damp

0.98

Table 22.2 Numerical results of PSO based on DB index on 10 independent runs PSO Dataset used

DB-index Best

Worst

Mean

StaDev

Breast cancer

0.507

0.5711

0.1169

0.1226

Eye

0.0085

1.3777

0.5696

0.5745

HCV

0.1847

1.1201

0.5668

0.3873

Covid

0.3923

0.6446

0.4899

0.0680

Heart failure

0.3772

0.4792

0.4014

0.0324

Diabetes

0.4652

0.5672

0.4972

0.0350

248

M. P. Behera et al.

Table 22.3 Numerical results of FA based on DB index on 10 independent runs FA Dataset used

DB-index Best

Worst

Mean

StaDev

Breast cancer

0.0507

0.0575

0.0520

0.0019

Eye

0.0086

0.9245

0.4022

0.3977

HCV

0.1870

0.8178

0.4583

0.2747

Covid

0.3924

0.5104

0.4408

0.0534

Heart failure

0.3772

0.3852

0.3797

0.2702

Diabetes

0.4653

0.5203

0.4860

0.0211

Table 22.4 Numerical results of VSSFA based on DB index on 10 independent runs VSSFA Dataset used

DB-index Best

Worst

Mean

StaDev

Breast cancer

0.0507

0.0575

0.0520

0.0019

Eye

0.0105

0.9061

0.3947

0.3899

HCV

0.1848

0.7881

0.4387

0.2592

Covid

0.3923

0.5222

0.4293

0.0467

Heart failure

0.3772

0.3853

0.3795

0.0026

Diabetes

0.4652

0.5131

0.4836

0.0191

tion refers to the best clustering value, worst value, mean value and standard deviation value. In Table 22.5, the comparison results of VSSFA with PSO and FA is presented. In Table 22.5 both mean value as well as standard deviation value for three discussed algorithms with six individual datasets based on DB index is presented. Both mean value and standard deviation value for VSSFA are given in boldface. With deeper analysis to the obtained both mean as well as standard deviation value, it clearly shows that the proposed VSSFA offers superior automatic data clustering results than PSO and FA. Figures 22.1, 22.2, 22.3, 22.4, 22.5 and 22.6 presents graphical convergence graphs of PSO, FA and VSSFA for all used datasets based on DB index. The dotted orange line represents the VSSFA algorithm. It has been clearly observed that our proposed VSSFA converge faster than other two algorithms whereas FA performs better than PSO.

22 Variable Step Size Firefly Algorithm for Automatic Data …

249

Table 22.5 Comparison results of VSSFA with PSO and FA based on DB index on 10 independent runs Dataset used Breast cancer

Eye

HCV

Covid

Heart failure

Diabetes

Algorithm

DB index Mean

Standard deviation

PSO

0.1169

0.1226

FA

0.0520

0.0019

VSSFA

0.0520

0.0019

PSO

0.5696

0.5745

FA

0.4022

0.3977

VSSFA

0.3947

0.3899

PSO

0.5668

0.3873

FA

0.4583

0.2747

VSSFA

0.4387

0.2592

PSO

0.4899

0.0680

FA

0.4408

0.0534

VSSFA

0.4293

0.0467

PSO

0.4014

0.0324

FA

0.3797

0.2702

VSSFA

0.3795

0.0026

PSO

0.4972

0.0350

FA

0.4860

0.0211

VSSFA

0.4836

0.0191

Bold indicate it’s superiority than the other two implemented algorithms.

Fig. 22.1 Breast cancer dataset

250

M. P. Behera et al.

Fig. 22.2 Eye dataset

Fig. 22.3 HCV dataset

22.5 Conclusion In this study, a modified firefly algorithm namely VSSFA has been suggested for solving the automatic data clustering problems. The main idea of this suggested algorithm is to determine optimal cluster numbers automatically in real time for large complex data sets. The result analysis of all the discussed algorithms has been conducted based on DB cluster validity index on six different heath related datasets. In term of efficiency, optimal clustering results, comparatively better convergence rate,

22 Variable Step Size Firefly Algorithm for Automatic Data …

251

Fig. 22.4 Covid dataset

Fig. 22.5 Heart failure dataset

ability to control optimization problems VSSFA outperforms than other two above discussed algorithms. The experimental analysis shows that the proposed VSSFA algorithm performs better than PSO and FA automatic data clustering problem. Hence, in the future this suggested algorithm can be applicable for many optimization problems.

252

M. P. Behera et al.

Fig. 22.6 Diabetes dataset

References 1. Figueiredo, E., Macedo, M., Siqueira, H.V., Santana, C.J., Gokhale, A., Bastos-Filho, C.J.A.: Swarm intelligence for clustering—A systematic review with new perspectives on data mining. Eng. Appl. Artif. Intell. 82, 313–329 (2019). ISSN 0952-1976. https://doi.org/10.1016/j.eng appai.2019.04.007 2. Abraham, A., Das, S., Roy, S.: Swarm intelligence algorithms for data clustering. In: Maimon O., Rokach L. (eds.) Soft Computing for Knowledge Discovery and Data Mining. Springer, Boston, MA (2008). https://doi.org/10.1007/978-0-387-69935-6_12 3. Agbaje, M.B., Ezugwu, A.E., Els, R.: Automatic data clustering using hybrid firefly particle swarm optimization algorithm. IEEE Access 7, 184963–184984 (2019). https://doi.org/10. 1109/ACCESS.2019.2960925 4. Davies, D.L., Bouldin, D.W.: A cluster separation measure. In: IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. PAMI-1, no. 2, pp. 224–227 (April 1979). https://doi. org/10.1109/TPAMI.1979.4766909 5. Liu, Y., Wu, X., Shen, Y.: Automatic clustering using genetic algorithms. Appl. Math. Comput. 218(4), 1267–1279 (2011). ISSN 0096-3003. https://doi.org/10.1016/j.amc.2011.06.007 6. Sharma, M., Chhabra, J.K.: Sustainable automatic data clustering using hybrid PSO algorithm with mutation. Sustain. Comput. Inf. Syst. 23, 144–157 (2019). ISSN 2210-5379. https://doi. org/10.1016/j.suscom.2019.07.009 7. Tsai, C., Tai, C., Chiang, M.: An automatic data clustering algorithm based on differential evolution. In: 2013 IEEE International Conference on Systems, Man, and Cybernetics, pp. 794– 799 (2013). https://doi.org/10.1109/SMC.2013.140 8. Yang, X.S.: Firefly algorithms for multimodal optimization. In: Watanabe O., Zeugmann T. (eds.) Stochastic Algorithms: Foundations and Applications. SAGA 2009. Lecture Notes in Computer Science, vol 5792. Springer, Berlin, Heidelberg (2009). https://doi.org/10.1007/9783-642-04944-6_14

22 Variable Step Size Firefly Algorithm for Automatic Data …

253

9. Sarangi, A., Priyadarshini, S., Sarangi, S.K.: A MLP equalizer trained by variablestep size firefly algorithm for channel equalization. In: 2016 IEEE 1st International Conference on Power Electronics, Intelligent Control and Energy Systems (ICPEICES), pp. 1–5 (2016). https://doi. org/10.1109/ICPEICES.2016.7853679 10. Bache, K., Lichman, M.: UCI Machine Learning Repository. University of California, School of Information and Computer Science, Irvine, CA, USA. [Online] (2013). Available: http://arc hive.ics.uci.edu/ml/

Chapter 23

GWO Based Test Sequence Generation and Prioritization Gayatri Nayak, Mitrabinda Ray, Swadhin Kumar Barisal, and Bichitrananda Patra

Abstract The optimization of the regression testing process has proven critical in the development of high-quality software. Finding non-redundant and optimized test sequences remains a challenge in regression testing. There are a variety of optimization strategies that can be used to meet this challenge. Grey Wolf Optimization (GWO) has grown in prominence as an optimization technique for bettering scientific and engineering solutions. On the other hand, the original GWO requires a better or updated objective function. Our proposed work defines an improvised objective function and uses the GWO technique to generate optimal test paths. In this work, the average influence factor of each test node along a path is calculated, and then the test sequences are prioritized based on their rank. Experimenting on various moderately sized Java programs validates our simulation.

23.1 Introduction It is critical to thoroughly test software in order to assure higher software quality [1, 13]. Testing has been found to account for up to 60% of the overall time spent developing software [2]. The internal code of the program is tested using white box testing approach. In this testing approach, it is difficult to find the control flow path from the given software. It’s critical to check the control flow path that can be obtained from the same software. As a result, it is clear that the efficacy of testing must be G. Nayak (B) · M. Ray · S. K. Barisal · B. Patra Department of Computer Science and Engineering, Siksha ‘O’ Anusandhan Deemed to be University, Bhubaneswar, Odisha, India e-mail: [email protected] M. Ray e-mail: [email protected] S. K. Barisal e-mail: [email protected] B. Patra e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2022 D. Mishra et al. (eds.), Intelligent and Cloud Computing, Smart Innovation, Systems and Technologies 286, https://doi.org/10.1007/978-981-16-9873-6_23

255

256

G. Nayak et al.

increased. To improve the efficiency of testing procedures, coverage-based criteria are often used [2, 12, 14]. To achieve extensive coverage, test cases must be generated at the time of generation to ensure that all CFG nodes are covered. The symmetric matrix algorithm [4] and heuristics-based [5, 6] are two examples of graph-based testing techniques. Their performance, on the other hand, has been deemed unsatisfactory [7]. Branch coverage [2] and path coverage [3] are less effective and often yield unsatisfactory results. As a result, more effective process optimization solutions for test sequence development are necessary. Then we’ll be able to minimize the amount of time and money we spend on testing dramatically. A test sequence is made up of test nodes that are organized in a specific order. The sequence in which the tests are run is determined by the program’s execution flow. In graph-based testing, we usually use CFG as an intermediary representation of the input program. These test sequences help to replicate the testing process by establishing test inputs. Effective tests are chosen and prioritized during the simulation process depending on their effectiveness value. Four characteristics that contribute to the popularity of meta-heuristic algorithms are simplicity, adaptability, derivation-free mechanism, and avoidance of local optima. In contrast to gradient-based optimization procedures, which optimize problems stochastically, the bulk of meta-heuristics techniques operate on a derivation-free process. As a result, in terms of avoiding local optima, the randomized meta-heuristic improves traditional optimization algorithms. Biogeography Based Optimization (BBO) and Grey Wolf Optimization (GWO), two recently developed meta-heuristics, have shown promising outcomes in remote sensing, travelling tournaments, and other related issues [10, 11]. However, BBO is most suitable for multidimensional discontinuous functions but, we found GWO fits better to our proposed work. The GWO Algorithm is used in the suggested technique to generate an ideal path that is theoretically led by condition coverage requirements. GWO is based on wolves attacking one another and roaming the solution space.

23.1.1 Motivation Thorough software testing can minimize the total cost and time spent on the Software Development Life Cycle (SDLC) by minimizing maintenance charges incurred throughout the software development life cycle. It’s difficult to design test sequences for industrial software because it’s so big. To detect software flaws or defects in a single effort, proper testing is essential. Test sequences are created and executed to expose the flaws [8]. More time and money are necessary to generate and execute all test sequences. As a result, before prioritization, the number of test sequences must be reduced. In this paper, we first explain the Grey Wolf Optimization Algorithm’s fundamentals. Section 23.3 outlines the proposed strategy as well as the modules that were employed. The detailed implementation and results are shown in Sect. 23.4.

23 GWO Based Test Sequence Generation and Prioritization

257

Section 23.5 discusses some of the suggested approach’s assumptions and threats. Section 23.6 puts the project to a close by detailing possible future research goals.

23.2 Basic Concept We go over some basic principles in this section that are necessary to understand our proposed strategy.

23.2.1 GWO Algorithm GWO [9, 10] is now widely used in Science and Engineering fields, according to a literature review. The GWO technique outperforms other known meta-heuristic algorithms [10, 11]. It’s a better heuristic for coming up with new solutions in the search space. GWO is naturally and efficiently capable of dealing with severely non-linear, multi-modal optimization issues. In terms of finding the global optimal response, the pace of convergence of GWO is extremely fast. The GWO algorithm is a new meta-heuristic algorithm based on the distinctive leadership behavior and hunting mechanism of grey wolves. The GWO algorithm is inspired by the wolf’s hunting strategy. Various population-based intelligence optimization algorithms use this technique. As with the swarm algorithm, this algorithm starts by determining the population size. To some extent, this population-based meta-heuristic can avoid local optima stagnation [11]. It also has a high ability to converge on the optimal solution. GWO is a big proponent of exploitation in general. However, it is not always possible to implement global search effectively. As a result, GWO may fail to find a global optimal solution in some circumstances. The fundamental GWO search method is primarily based on random walk. As a result, it is not always able to solve the problem successfully. As a result, it is unable to properly manage the problem. Within the pack, they have a strict social hierarchy. The number of wolves in the GWO algorithm is divided into four types based on their fitness value [8] and labeled (α), (β), (δ), and (ω). The first wolf is referred to as (α), the second wolf is referred to as (β), and the third segment of wolves is referred to as (δ). The wolf pack tracks down the prey and attacks by renewing their territory based on the wolf territories. The best three wolves are the, β, and δ, and the other wolves follow them and update their areas accordingly. The surviving wolves are known as (ω), and they are led by α, β, and δ wolf. The pack’s leader, Alpha Wolf, has taken control of the line of command. For instance, assisting the alpha in the dynamic, obtaining food, dealing with the child wolves, and so on. Omega wolves are referred to as lone wolves since they have little need in the pack. GWO Algorithm REQUIRE: Grey Wolf Population

258

G. Nayak et al.

ENSURE: Optimal solution 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11.

Initialize a, A and C − → − → − → Compute fitness value of each wolf and initialize X α , X β , and X δ − → − → − → Compute distance vector of wolfs as D α , D β , and D δ . for (i < maxGen) do For each generation update wolf position as per equation Eq. 23.1. Update the value of A, C using Eqs. 23.2 and 23.3 respectively. Update fitness value of each wolf → − → − → − Update X α , X β , and X δ i=i+1 End for Display result − → − → − → X1+ X2+ X3 − → X (t + 1) = 3

(23.1)

− → → → → A = 2− a .− r 1−− a

(23.2)

c = 2 . r2

(23.3)

23.2.2 Adjacency Matrix In graph theory, an adjacency matrix is a square matrix that is used to represent a finite graph. The members of the matrix indicate whether two graph vertices are adjacent or not. In the special case of a finite simple graph, the adjacency matrix is a (0,1)-matrix with zeros on its diagonal. Figure 23.2 shows an example of this CFG and the adjacency matrix is represented in Table 23.1. Table 23.1 Adjacency matrix of the CFG Node

1

2

3

4

5

6

7

8

1

0

1

0

0

0

0

0

0

2

0

0

1

0

0

0

0

0

3

0

0

0

1

0

1

0

0

4

0

0

0

0

1

0

0

0

5

0

0

0

0

0

1

0

0

6

0

1

0

0

0

0

0

0

7

1

0

0

0

0

0

0

1

8

0

0

0

0

0

0

0

0

23 GWO Based Test Sequence Generation and Prioritization

259

23.2.3 Objective Function/Fitness Function The acquired value of Cyclomatic Complexity at node “i” (CCi) is used to calculate the influence factor (IF) of a wolf j at that node. Equation (23.4) defines this objective function, where nci represents number of conditions at node i in the graph with wolf i. According to our literature research [8, 9], the majority of existing work performs well without accounting for the complexity of the number of condition nodes in the intermediate CFG and CC of the input program. As a result, in comparison to the current literature’s objective functions, we explored and defined a better fitness function. Because it is a standardized scale, the random function rand () is utilized in Eq. (23.4) to create random numbers between [0,1]. It’s a kind of offset or bias value. Many standard literatures [9, 10] mention the use of such random numbers. The GWO algorithm’s objective function uses these values. nci ∗ rand() fit n i j = CCi

(23.4)

23.3 Proposed Approach Our proposed approach generates and prioritizes test sequences using the GWO algorithm.

23.3.1 Overview The “Grey Wolf Optimization Algorithm for Test Sequence Prioritization” is the name of our method (GWOTSP). Using the GWO algorithm, this approach defines an influence metric-based procedure for automating the process of producing ideal test sequences. The GWOTSP technique we have suggested is based on three algorithms. Algorithm 3 contains the pseudo-code for our algorithm. In our proposed GWOTSP algorithm a java program is provided as input for testing in the first stage. Figure 23.1 depicts the structure for our proposed work. Java program is given as input in this framework. Based on the obtained control flow graph of the input program, an adjacency matrix is created. It shows the relation between nodes and the possible execution paths. If there is a direct link between two nodes, it saves a value of 1 in the relevant cell of the matrix; otherwise, it stores a value of 0. The Grey Wolf algorithm is run using the adjacency matrix (Table 23.1) and the node complexity matrix as inputs (Table 23.2). The objective function is formulated using two metrics such as cyclomatic complexity and number of condition in a conditional statement of input

260

G. Nayak et al.

Fig. 23.1 Work flow of proposed technique

Table 23.2 Node complexity metrics and results Nodei

1

2

3

4

5

6

7

8

CCi

4

3

2

1

1

1

1

1

NCi

0.1

0.1

0.8

0.1

0.1

0.66

0.75

0.1

program. The condition coverage criteria are more responsive to the control flow. Then, GWO is used to traverse the graph to generate test sequences.

23.3.2 Generation of Control Flow Graph Algorithm 2 is used to produce a CFG from this Java program. We then calculate the CCi of each node in the CFG. Then, these obtained results, such as (CCi) and influence factor (IF) of each node are used in the Grey Wolf Optimization algorithm. GWO is given in Algorithm 1 to generate test sequences. After executing GWO, the suggested framework picks the linearly independent (LI) paths. The influence factor (IF) of each path is then computed, and each path is subsequently ranked based on the derived mean IF value. To improve functionality, the GWO’s objective function is rewritten with parameters such as CCi and IF values. CC aids in determining the input program’s complexity. As a result, these two parameters aid in the creation of ideal test sequences. Some sample Java programs were used to test this test sequence creation technique. Algorithm 2: Generation of Control Flow Graph (CFG)

23 GWO Based Test Sequence Generation and Prioritization

261

Input: Java Program Output: CFG and CC 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. 15. 16. 17.

Create window matrix to consist the output CFG Initialize visited_status(e) = 0, visited_status(n) = 0 and P(x,y) = (c1,c2) For k = 0 to line_count Set p(x,y) for current_node position If (readline() is not a comment line and not a decision line) Draw a node, number it and set visited_status(n) = 1 Set node_count + = 1; Set y-coordinate to draw the next node If (readline() is not condition line), then Draw left_node and set node_count + = 1 Else Draw right_node and set node_count + = 1 End if Merge the edges with the next node, and update the CFG End for Display CFG and compute CC using the formula e-n + 2 End

Algorithm 3: GWO Test Sequence Prioritization Algorithm Input: Program Under Test(PUT) Output: Optimized Test Sequence 1. 2. 3. 4. 5. 6. 7. 8. 9.

Input Program Under Test Construct CFG using Algorithm 2 Apply GWO algorithm in this CFG Record all path Select possible linearly independent path Calculate mean influence factor of each path Rank the test sequence according to the IF vale Display prioritized order of the test sequence End

Using GWO, we suggested a method for prioritizing test sequences. GWOTSP is called Grey Wolf Optimization based Test Sequence Prioritization. Algorithm 3 contains the pseudo-code for our algorithm. Now we’ll go over the suggested GWOTSP algorithm in further depth. Algorithm 2 is used to convert this program to its appropriate control flow graph (CFG). The information about a node is represented using adjacency matrix. The Grey Wolf method is run using the neighboring matrix and directed matrix as inputs. Algorithm 2 accepts a Java program as input and outputs the CC value. Let us now connect this Grey Wolf behavior to our recommended strategy. The attacking movement of wolves occurs in our proposed approach between the nodes of the CFG, which we achieved against the software under test.

262

G. Nayak et al.

Backtracking of nodes occurs in this search tree to improve the intermediate solution, which leads to the global optimal value. We repeat the execution for multiple iterations until we get a better result. Individual wolf leadership traits, as well as the average attacking behavior of each node, are changed during program execution to attract lesser wolves.

23.4 Implementation On a 64-bit Windows 10 computer, the recommended method is implemented. It contains 4 GB of RAM and a 2.33 GHz Core i3 processor. The GWO algorithm is implemented using technologies such as Net-Bean 8.2 with JDK1.8 in this experiment. The CFG of the input program Bubblesort.java is first generated. Algorithm 2 is used to generate the CFG. Figure 23.2 depicts the CFG for the provided program. There are eight nodes and ten edges in this CFG. This also implies that these nodes are more important in directing wolves through the solution space. The proposed method is used to generate the best test sequences. Every iteration of the GWOTSP Algorithm 3, the influence values are updated. We get the best test sequences after running the GWOTSP algorithm. There are four optimal test sequences (paths) obtained after this experiment. Table 23.4 contains a list of these routes that is called as node sequence. Fig. 23.2 Control flow graph (CFG)

23 GWO Based Test Sequence Generation and Prioritization

263

23.4.1 Experimental Result We used Java applications as input in our experiment during the implementation of this work. Let’s look at a little code sample from a program called Bubblesort.java in Listing 1 to see how our proposed strategy works in practice. There are 12 lines of code in this piece, three of which are predicates and the rest are plain statements.

23.4.2 Prioritization We prioritize the best test sequences after they’ve been generated. Table 23.3 shows the mean influence quality of each wolf at each node during the prioritization procedure. The sum of the influence mean leadership of the nodes present in a test path is then calculated. This is referred to as a test path’s Influence Factor. We injected wolves into each node during our experiment randomly. Due of the disparity in their leadership qualities, a lesser wolf (less effective test data) approaches a stronger wolf (effective test data). This wolf’s movement suggests a path. These pathways must be saved as viable graph traversal solutions. The Grey Wolf method returns the optimum path once a huge number of iterations have been completed. Table 23.4 shows the results of our experiment, which revealed four test sequences. Then, based on its mean leadership value, each test path is given a rank. Finally, we prioritize the resulting test paths using these rank values, as shown in Table 23.5. According Table 23.3 Influence factor values at an iteration n/w

W1

W2

W3

W4

W5

W6

W7

W8

W9

W10

N1

0.853

1.29

1.6

1.82

2.01

2.16

2.29

2.40

2.51

2.6

N2

1.30

1.86

2.43

2.78

3.05

3.28

3.48

3.65

3.81

3.95

N3

1.99

3.02

3.68

4.23

4.66

5.01

5.31

5.57

5.81

6.02

N4

1.74

2.65

3.25

3.71

4.08

4.39

4.65

4.88

5.09

5.27

N5

3.22

4.87

5.98

6.81

7.48

8.03

8.51

8.93

9.30

9.64

N6

22.46

33.93

43.3

47.43

52.06

55.93

59.2

62.2

64.76

67.1

N7

24.99

37.6

46.06

52.43

57.5

61.73

65.4

68.56

71.4

73.93

N8

8.6

25.83

31.53

36.03

39.4

42.4

44.9

47.1

49.03

50.8

Table 23.4 Generated node sequences (NS)

Test path (TP)

Node sequence (NS)

TP1

N1 → N2 → N3 → N6

TP2

N1 → N2 → N3 → N6 → N7

TP3

N1 → N2 → N3 → N4 → N5 → N7

TP4

N1 → N2 → N3 → N6 → N7 → N8

264 Table 23.5 List of test sequences obtained from control flow graph shown in Fig. 23.2

G. Nayak et al. Test sequence (TS)

Mean influence

Priority

TS 1

280.96

3

TS 2

448.65

2

TS 3

223.62

4

TS 4

456.45

1

to this table, Test Sequence 4 got the greatest mean, and as a result, it is given the highest priority (i.e. 1). Similarly, because Test Sequence 3 has the lowest priority value (i.e. 4) and is assigned to it. As a result, we can assert that Test-Path 4 has a better chance of finding defects than Test-Path 3 (Table 23.5).

23.5 Comparison with Related Work Doerner et al. [15] proposed a method for selecting a set of test paths using Markov Usage Model. They have calculated the probability of failure due to improper selection of test path. They have also determined the losses occurred at the failure time. Due to this reason, testing cost for different test sequences vary differently. This random variation of testing cost motivated them to apply Ant Colony Optimization for test sequences generation and then prioritization. But, in our approach, we have generated the random behavior based on influence factor weight at each node of a graph. We have proposed a linear method for defining the objective function for test sequences generation and then prioritize them. Li et al. [16] proposed a technique to generate test sequences from UML State Machine diagram. They have suggested how to generate test sequences directly from UML diagram using Ant Colony Optimization. The generated test sequences are always feasible, non-redundant and achieve the required test adequacy criterion. This paper presented an ant colony optimization approach to generate test sequence for state-based software testing. But, we have proposed a different probabilistic method for defining the objective function such that we can apply Ant Colony Optimization for optimal test sequences generation and then prioritize them. They have generated feasible test sequences, but in our case it is optimal test sequences. Sharma et al. [17] cited an algorithm to generate test sequences from the state machine transitions of the given software. They have considered control of graph as an intermediate state. Then, they have applied ACO for finding test sequences. This proposed method, tries to determine maximum coverage based on less amount of redundancy. We have proposed a probabilistic method for defining the objective function such that we can apply Ant Colony Optimization for test sequences generation and then prioritize them. They have optimized time and coverage score but we have optimized to find prioritized test sequences.

23 GWO Based Test Sequence Generation and Prioritization

265

23.6 Conclusion and Future Work Our proposed method efficiently generates Java source code test sequences. The generated test sequences were optimized using the suggested GWOTSP algorithm. Advanced aspects of the GWOTSP algorithm, such as the IF and CC values, aid in routing wolves to condition-produced nodes while traversing the obtained CFG. We used CCi for this, which provides greater advice at conditional nodes and condition coverage for the input programs. The influence factor aids wolves in making faster decisions when traversing graphs. Our method is unique in that it effectively guides wolves through graph traversal. As a result, test paths that are efficient are developed. The GWO algorithm has proven to be a better strategy for generating efficient test sequences, according to the simulation of our experiment.

References 1. Nayak, G., Ray, M.: Modified condition decision coverage criteria for test suite prioritization using particle swarm optimization. Int. J. Intell. Comput. Cybern. (2019) 2. Mall, R.: Fundamentals of Software Engineering. PHI Learning Pvt. Ltd. (2018) 3. Khanna, M. et al.: Test case prioritisation during web application testing. Int. J. Comput. Appl. Technol. 56, 230–243 (2017) 4. Bhattacherjee, V., Suri, D., Mahanti, P.: Application of regular matrix theory to software testing. Eur. J. Sci. Res. 12, 60–70 (2005) 5. Sharma, B., Girdhar, I., Taneja, M., Basia, P., Vadla, S., Srivastava, P.R.: Software coverage: a testing approach through ant colony optimization. In: International Conference on Swarm, Evolutionary, and Memetic Computing, Springer, Berlin, Heidelberg, pp. 618–625 (2011) 6. Srivastava, P.R., Baby, K.M., Raghurama, G.: An approach of optimal path generation using ant colony optimization. In: TENCON 2009–2009 IEEE Region 10 Conference, IEEE, pp. 1–6 (2009) 7. Kumar, S., Ranjan, P., Rajesh, R.: An overview of test case optimization using meta-heuristic approach. Recent Adv. Math. Stat. Comput. Sci. 475–485 (2016) 8. Jiang, T., Zhang, C.: Application of grey wolf optimization for solving combinatorial problems: job shop and flexible job shop scheduling cases. IEEE Access 6, 26,231–26,240 (2018) 9. Ozsoydan, F.B.: Effects of dominant wolves in grey wolf optimization algorithm. Appl. Soft Comput. 83 (2019) 10. Mirjalili, S., Mirjalili, S.M., Lewis, A.: Grey wolf optimizer. Adv. Eng. Softw. 69, 46–61 (2014) 11. Catal, C., Mishra, D.: Test case prioritization: a systematic mapping study. Software Qual. J. 21(3), 445–478 (2013) 12. Barisal, S.K., Dutta, A., Godboley, S., Sahoo, B., Mohapatra, D.P.: MC/DC guided test sequence prioritization using firefly algorithm. Evol. Intel. 14(1), 105–118 (2021) 13. Zhao, Y., Dong, J., Peng, T.: Ontology classification for semantic-web-based software engineering. IEEE Trans. Serv. Comput. 2, 303–317 (2009) 14. Barisal, S.K., Suvabrata Behera, S., Godboley, S., Prasad Mohapatra, D.: Validating objectoriented software at design phase by achieving MC/DC. Int. J. Syst. Assur. Eng. Manage. 10(4), 811–823 (2019) 15. Doerner, K., Gutjahr, W.J.: Extracting test sequences from a Markov software usage model by ACO. In: Genetic and Evolutionary Computation Conference, pp. 2465–2476. Springer, Berlin, Heidelberg (2003)

266

G. Nayak et al.

16. Li, H., Lam, C.P.: Software test data generation using ant colony optimization. In: International Conference on Computational Intelligence, pp. 1–4 (2004) 17. Sharma, B., Girdhar, I., Taneja, M., Basia, P., Vadla, S., Srivastava, P.R.: Software coverage: a testing approach through ant colony optimization. In: International Conference on Swarm, Evolutionary, and Memetic Computing, Springer, pp. 618–625 (2011)

Part IV

Intelligent Computing

Chapter 24

Artificial Intelligent Approach to Predict the Student Behavior and Performance G. Nagarajan, R. I. Minu, T. R. Saravanan, Samarjeet Borah, and Debahuti Mishra

Abstract Because of the massive amount of data in educational databases, predicting student performance is still proving to be increasingly fruitful. This notion is used in a variety of evaluations, and the field of application is continually expanding. Examples include student grades, attendance, semester scores, grade emphasis, structure, well-being, educational program activities, and so on. The amount of information stored in late trends is rapidly increasing. The main goal is to provide an overview of “An astute approach to anticipating student conduct and performance” to create a model for predicting student behavior and performance. In an instructional situation, student performance in training is really important. Student data is stored in the put away database in order to improve student attitudes and behavior. This study demonstrates the patterns and types of breakdowns that should be conducted in order to improve the training process. We anticipated student quirks and produced a report for staff and parents. In light of the display, we provide suggestions on how to improve their presentation. Predicting student achievement is all the more difficult given the massive amount of data in informative archives. This research also looks at how the expected data may be used to determine the most important qualities in a student data set. With the use of instructive information mining processes, we might significantly boost student achievement and accomplishment. It has the potential to benefit students, teachers, and educational institutions.

G. Nagarajan (B) Sathyabama Institute of Science and Technology, Chennai, Tamilnadu, India R. I. Minu · T. R. Saravanan SRM Institute of Science and Technology, Kattankulathur, Tamilnadu, India e-mail: [email protected] S. Borah Sikkim Manipal Institute of Technology, Sikkim Manipal University, Sikkim, India D. Mishra ITER, Siksha ‘O’ Anusandhan University, Bhubaneswar, India © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2022 D. Mishra et al. (eds.), Intelligent and Cloud Computing, Smart Innovation, Systems and Technologies 286, https://doi.org/10.1007/978-981-16-9873-6_24

269

270

G. Nagarajan et al.

24.1 Introduction Significant progress in improving student behaviour and performance, as well as oversight of a large student database, has altered our educational framework. In the world of information technology, data mining systems are the most sweltering points or research territories. Scientists are concentrating their efforts in these areas in order to improve them. [1–3]. Foreseeing student success is a simple example of knowledge mining. In contrast to the aforementioned improvements, student subtleties administrations are improving and getting to student administrations are continually improving [4–6]. A big amount of them is seeking to enhance student behavior and performance over time by automating education. In any case, the development in the field of education necessitates watching the complete nuances of student expansion. The goal of constructing this exercise is to anticipate student behavior and performance. Instruction is a major and important topic that isn’t getting enough attention right now [7–9]. The amount of data stored in informative databases has been rapidly growing recently. The hidden database contains information about a student’s progress in terms of exhibition and conduct [10, 11]. In instructional settings, the ability to predict student performance in training is critical. Various instructive information mining approaches predict a student’s scholastic achievement under instructional conditions based on mental and ecological factors. Information mining algorithms, for example, use information order and selection tree techniques to evaluate student performance [12]. Overall Semester Marks, Practical Lab, Attendance, Paper Presentation, End Semester Marks, smaller than normal endeavors, consistency and shortcomings, and so on should all be available to examine for quality. The information about the student, as well as their behavior and performance, is stored in a database. We can anticipate and access their information at a later time. The information about the student, as well as their behavior and performance, is stored in a database. we are able to The information about the student, as well as their behavior and performance, is stored in a database. We can forecast and retrieve their information at a later date. The information about the student, as well as their behavior and performance, is stored in a database. We can anticipate and access their information at a later time. Gathered all of the student’s information for the examination and produced a report. The acquired details will be handled and sent as a report, and students and staff will be able to access their details whenever they need them. The scope of the project provides an insightful technique to deal with predicting student performance and behavior by automating the data. The product gathers all of a student’s information and is trained to provide simple and effective data storage for understudy. Student performance forecast is going most exceedingly terrible in our current instructive systems. In the event that the exhibitions of understudies are anticipated well ahead of time, at that point it can upkeep or improve the nature of instruction by foreseeing student subject interests, student level exercises, and helps with improving their exhibitions in schools colleges and instructive foundations. Grouping dropout focuses should likewise be possible by this [13, 14]. By methods for machine learning

24 Artificial Intelligent Approach to Predict the Student …

271

procedures alongside EDM, constant assessment system is rehearsed by a few foundations today. These plans are useful in improving the student’s exhibition. Profiting the standard understudies is the prime saying of consistent assessment systems. Preprocessing pipelines and information changes is the result of the exertion in vital spreading out of machine learning calculations. They contribute in information portrayal to accommodate dynamic machine learning, and center the downsides of current learning calculations [15, 16]. Study discoveries on some deep learning applications that can be applied in various fields like Image Processing, Natural Language Processing, and Object location were found. So as to anticipate understudies’ presentation, information disclosure is recommended here to mine principles from the dataset of Systems of Learning Managements. Profound learning and information mining systems are utilized here. Profound Learning classifier-MLP and different classifiers like KNN, Naïve Bayes are utilized in our examination work. A model is developed by applying the classifiers on our information. 10-overlay Cross Validation is practiced. Parameters like Accuracy, Specificity, Sensitivity, Kappa-measurement and ROC bend are considered for assessing the classifiers.

24.2 Related Work A few overviews and articles can be contained in the rational research on EDM. A few creators have provided a broad outline of the subject [17, 18]. Kaur et al. [19] proposed a medical diagnostic system with the help of artificial intelligence algorithms. The traditional diagnostic systems were completely dependent on manual decisions. But the collaboration of manual diagnostic approaches and the possibilities of artificial intelligence could result an error free prediction of diseases. The authors combined fuzzy logic, machine learning approaches and deep learning approaches to develop an automated disease prediction system. The system accepts the MRI images as inputs and it generates feature maps by using convolution methods. The proposed method could attain good accuracy level while comparing with other existing diagnostic approaches. Topal et al. [20] proposed an artificial intelligence based approach for determining the travel preferences of Chinese people. The proposed system analyzes the historical data of TripAdvisor application for evaluating the interest of Chinese people to visit Turkey. The authors additionally analyzes all other influencing factors which led the people to select Turkey as their favorite destination. It uses the artificial intelligence classification methods for analyzing the bulk amount of data. In order to validate the performance of analysis approach, the authors use the k-Fold Cross Validation approach. Ezzat et al. [21] proposed a hybrid approach based on Artificial intelligence for diagnosing the faults in aerospace systems. The aerospace system requires high accuracy level in their operational environments. A small error in such systems may lead to entire damage. Thus, a single error must be treated as a critical scenario. Such errors may be far beyond our predictions. Thus, applying artificial intelligence

272

G. Nagarajan et al.

in such critical applications will definitely avoid the unpredicted or unnoticed errors or faults. The proposed method uses binary grasshopper optimization algorithm for feature selection and artificial neural network for fault prediction. Artificial intelligence has been applied to existing applications to increase the accuracy and easiness. This paper will focus on the exam which identifies with this investigation. In the accompanying, I give a few strategies and speculations that help with the exploration.

24.3 Existing System Our suggested architecture is based on previous research in influence detection, deep learning-based emotion recognition, and continuing portable distributed computing. We investigate these developments in depth and determine the computational requirements of a framework that incorporates them. We conduct a framework attainability analysis in light of these requirements. Despite the fact that the best in class research in the great majority of the segments we suggest in our framework is advanced enough to comprehend the framework, the main test is (I) combining these innovations into an all-encompassing framework structure, (ii) algorithmic adjustments to ensure consistent performance, and (iii) assessment of legitimate instructional factors for use in computations.

24.4 Proposed System The acquired details will be handled and distributed as a report in the suggested framework, and students and staff will be able to access their details whenever they need them. After gathering all of the student’s information and conducting the examination, the report was created. Mechanizing data is a clever technique to cope with predicting student performance and behavior. The product gathers all of a student’s information and is capable of providing simple and practical data storage for understudy. The acquired details will be handled and sent as a report, and students and staff will be able to access their details whenever they need them. After gathering all of the student’s information and conducting an investigation, the report was written. The administrator, staff, student every one of them is given a client id and secret phrase. The administrator is permitted with a component to transfer recordings caught in the study hall. From the video, pictures of understudies are pre-handled. Upon prepreparing, the student conduct is anticipated. The staff and student are permitted to give criticism on student conduct and performance. The advantages of the proposed system include. • Improving the students’ presentation and conduct are among the benefits of the proposed system.

24 Artificial Intelligent Approach to Predict the Student …

273

Fig. 24.1 Overview of the proposed system

• To motivate students to progress. • To point each and every student in the right direction. • To increase student performance, identify their strengths, and re-screen the entire nuances of a specific student database. • Taking a broad picture of the approaches utilized to increase student performance (Fig. 24.1). The primary goal of this initiative is to improve student performance in classrooms by focusing on a few key factors. Education is a necessary component for a country’s improvement and progress. It allows a country’s citizens to be civilized and wellbehaved. In order to discover knowledge from educational databases, new strategies are being developed these days. To examine the patterns and attitudes of students toward education analyze data from several dimensions, classify it, and describe the correlations. It inspired us to embark on analyzing student datasets. Data collection, categorization, and classification are all done by hand.

24.5 Module Description 24.5.1 Feature Extraction Faces are spotted using HAAR detection accuracy (95%) in a real-time method, and features are extracted to accommodate tiny head movements of the subject while working in a real-time setting. Using pair-wise action point distances, we added points using Active Shape Model (ASM) identification.

274

G. Nagarajan et al.

24.5.2 Emotion Expression Classification The utilization of temporal information is one of the fundamental variations in classifier selection and design. The spatial domain approach is a classification approach that does not use time information. A typical spatial method is the artificial neural network. The entire image is used as the neural networks or image processing input. PCA and ICA are two examples. The general classification method can be employed as a spatial approach thanks to the use of the feature vector space approach. For face expression recognition, principal component analysis was used. The classifier was a support vector machine (SVM).

24.5.3 Current Emotion Detection System The need for a desired direction of the facial picture is indicated by the emotion detection. The genesis of facial emotional features has been determined using faces. Filtering and edge detection have been presented as methods for feature extraction. Then, using a Genetic Algorithm, the processed image was used to find some ideal parameters (GA). A new fitness function has been suggested for using GA to retrieve ocular parameters.

24.5.4 Facial Expression Recognition SVM (non-linear kernel) outperforms in a real-time context, according to our literature review. As a result, we use SVM to recognize face expressions by feeding it a collection of reduced features (1463). We can only distinguish three facial expressions with this method: neutral, yawning, and smiling. Because there are no overlapping qualities for these expressions, the sleep expression is incorrectly classed as neutral. To address this issue, we used the ensemble method, which entails creating three separate classifiers trained with: • The sleeping and neutral expression frames • Expression frames of neutral, smiling, and yawning.

24.6 Conclusion Right now, characterization task is utilized on student database to foresee the understudies division based on past database. The target of building up this venture is to anticipate the student conduct and performance. Foreseeing understudies performance is generally valuable to support the instructors and students have in any case

24 Artificial Intelligent Approach to Predict the Student …

275

improving their learning and educating process. The administration of the student subtleties will be especially simpler, productive, and less tedious. It will be simple for the workforce and student to get to the records and reports are present in the framework, the student and staff can have a different login id, at that point login and see the subtleties of student performance and conduct and create the measurable report, we see the student upgrades effectively and screen semester marks, grade focuses, lab work, reasonable imprint, little venture, extra-curricular/co-curricular exercises, task marks, arrangement subtleties, quality and shortcoming, and so forth. We had the option to accomplish adequate consequences of foreseeing understudies’ exhibition on a specific undertaking. The grouping models worked right now are utilized for anticipating and following understudies’ presentation and giving legitimate mediations and directions through the learning procedure.

References 1. Mangaroska, K., Giannakos, M.N.: Learning analytics for learning design: A systematic literature review of analytics-driven design to enhance learning. IEEE Trans. Learning Technol. (2018) 2. Lukarov, V., Chatti, M.A., Schroeder, U.: Learning analytics evaluation—Beyond usability. In: Rathmayer, S., Pongratz, H., (eds.) Proceedings of the DeLFI Workshops; CEUR Workshop Proceedings: Aachen, Germany, pp. 123–131 (2015) 3. Dawson, S., Gaševi´c, D., Siemens, G., Joksimovic, S.: Current state and future trends: A citation network analysis of the learning analytics field. In: Proceedings of the Fourth International Conference on Learning Analytics and Knowledge—LAK’14, Indianapolis, IN, USA, pp. 231– 240 (24–28 Mar 2014). ACM Press, New York, NY, USA 4. Shneiderman, B.: The eyes have it: A task by data type taxonomy for information visualizations. In: Proceedings of the 1996 IEEE Symposium on Visual Languages, Boulder, CO, USA (3–6 Sept 1996) 5. Cambruzzi, W., Rigo, S.J., Barbosa, J.L.V.: Dropout prediction and reduction in distance education courses with the learning analytics multitrail approach. J. Univers. Comput. Sci. 21, 23–47 (2015) 6. Agudo-Peregrina, Á.F., Iglesias-Pradas, S., Conde-González, M.Á., Hernández-García, Á.: Can we predict success from log data in VLEs? Classification of interactions for learning analytics and their relation with performance in VLE-supported F2F and online learning. Comput. Hum. Behav. 31, 542–550 (2014) 7. Papamitsiou, Z., Economides, A.A.: Learning analytics and educational data mining in practice: A systematic literature review of empirical evidence. J. Educ. Technol. Soc. 17, 49–64 (2014) 8. Romero, C., López, M.I., Luna, J.M., Ventura, S.: Predicting students’ final performance from participation in on-line discussion forums. Comput. Educ. 68, 458–472 (2013) 9. Prieto, L.P., Rodríguez Triana, M.J., Martínez Maldonado, R., Dimitriadis, Y.A., Gaševi´c, D.: Orchestrating learning analytics (OrLA): Supporting inter-stakeholder communication about adoption of learning analytics at the classroom level. Australas. J. Educ. Technol. 35, 14–33 (2019) 10. IMS. IMS Global Learning Learning and Tools Interoperability, v1.3. 2018. Available online: https://www.imsglobal.org/activity/learning-tools-interoperability (accessed on 29 Nov 2019) 11. Gray, G., Mcguinness, C., Owende, P.: Non-cognitive factors of learning as early indicators of students at-risk of failing in tertiary education. In: Non-cognitive Skills and Factors in Educational Attainment, pp. 199–237 (2016)

276

G. Nagarajan et al.

12. Sembiring, S., Zarlis, M., Hartama, D., Ramliana, S., Wani, E.: Prediction of student academic performance by an application of data mining techniques. In: International Conference on Management and Artificial Intelligence IPEDR, vol. 6, pp. 110–114 13. Campagni, R., Merlini, D., Sprugnoli, R., Verri, M.C.: Data mining models for student careers (Science Direct) Expert Syst. Appl. 42, 5508–5521 (2015) 14. Nirmalraj, S., Nagarajan, G.: Biomedical image compression using fuzzy transform and deterministic binary compressive sensing matrix. J. Ambient Intell. Humanized Comput. 1–9 (2020) 15. Ezhilarasi, R., Minu, R.I.: Automatic emotion recognition and classification. Proc. Eng. 38, 21–26 (2012) 16. Simpson, S.V., Nagarajan, G.: A fuzzy based co-operative blackmailing attack detection scheme for edge computing nodes in MANET-IOT environment. Future Gener. Comput. Syst. (2021) 17. Simpson, S.V., Nagarajan, G.: An edge based trustworthy environment establishment for internet of things: An approach for smart cities. Wireless Netw. 1–17 (2021) 18. Simpson, S.V., Nagarajan, G.: A table based attack detection (TBAD) scheme for internet of things: An approach for smart city environment. In: 2021 International Conference on Emerging Smart Computing and Informatics (ESCI). IEEE, pp. 696–701 (2021) 19. Kaur, S., Singla, J., Nkenyereye, L., Jha, S., Prashar, D., Joshi, G.P., El-Sappagh, S., Saiful Islam, M., Riazul Islam, S.M.: Medical diagnostic systems using artificial intelligence (ai) algorithms: Principles and perspectives. IEEE Access 8, 228049–228069 (2020) 20. Topal, I., Kür¸sad Uçar, M.: Hybrid artificial intelligence based automatic determination of travel preferences of Chinese tourists. IEEE Access 7, 162530–162548 (2019) 21. Ezzat, D., Ella Hassanien, A., Darwish, A., Yahia, M., Ahmed, A., Abdelghafar, S.: Multiobjective hybrid artificial intelligence approach for fault diagnosis of aerospace systems. IEEE Access 9, 41717–41730 (2021)

Chapter 25

Graph Based Automatic Keyword Extraction from Odia Text Document Mamata Nayak and Nilima Das

Abstract In natural language processing (NLP), keyword extraction is used as the mechanized cycle to recognize a set of terms that addresses the data being processed. Keywords are the collection of words that provide a compact representation of a document. These words are extensively used for automatic indexing. Since there is immense accessibility of Odia text, the importance of keyword identification from Odia script has been increased a lot. Realizing the importance of keyword extraction from Odia text this article has presented a simple unsupervised undirected weighted graph-based keyword extraction procedure for Odia text. The extracted keywords are analyzed by computing the weights of the nodes in the graph generated from the text. The performance of the proposed technique has been evaluated in terms of precision, recall, and F-measure. It is observed from the experimental result that the proposed graph-based technique can effectively extract keywords from Odia text with minimum computational complexity because of its implementation simplicity.

25.1 Introduction Text is unstructured data within digital forms, which may contain underlying information. Hidden evidence in text can be terms including keywords and key phrases. Keywords are the collection of words that provide a compact representation of a document. Also the keywords support anchors as hyperlinks between documents that enable users to quickly access related materials. These words are extensively used for automatic indexing, topic modeling, question answering system, summarization of document, automatic clustering as well as classification and many more. Since finding keywords physically is a prolonged process and costly, the computerized techniques can be used which can save time and economy. Furthermore, the M. Nayak (B) · N. Das Faculty of Engineering and Technology, Siksha ‘O’ Anusandhan (Deemed to be University), Bhubaneswar, Odisha, India e-mail: [email protected] N. Das e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2022 D. Mishra et al. (eds.), Intelligent and Cloud Computing, Smart Innovation, Systems and Technologies 286, https://doi.org/10.1007/978-981-16-9873-6_25

277

278

M. Nayak and N. Das

importance of keyword extraction for Odia script also plays a vital role. Odisha was recognized as an individual state on 1st April 1936, during the British rule and it consisted of the places where generally Odia is spoken. Odia is one of the primary languages in India. Odia is one of the languages to which the Government of India has awarded the distinction of classical language. Classical language status is given to languages which have a rich heritage and independent nature. There is hardly any research work found in the literature based on Odia language. This article presents a system that can extract keywords from a given Odia text which can be used for text summarization and other purposes. The organization of the remaining part of this research article is described as follows: Sect. 25.2 illustrates the existing literature on graph based approach as well as keyword/key phrase extraction for different languages, Sect. 25.3 describes the model used in this literature. As this literature is the first attempt toward finding of keywords, the data corpus used for implementation is illustrated in Sect. 25.4. Section 25.5 explains the models through examples with experimental results, and conclusion is given in Sect. 25.6.

25.2 Literature Survey Currently there have been a great number of researches on keyword extraction. For languages with rich dialects like English, a lot of work has been done to extract keywords from a given block of text. The algorithms of keyword extraction can be broadly divided into two categories, supervised and unsupervised. Some examples of unsupervised algorithms are TF-DF (Term Frequency-Inverse Document Frequency) [1], Text-rank [1] and LDA (Latent Dirichlet Allocation) [2]. The authors in [2] have combined the supervised technique SVM with unsupervised algorithms to optimize the keyword extraction process. They have used SVM Ranking to rank the candidate keywords after extracting their important features using the unsupervised techniques. In [3] the authors have projected a graph-based key-phrase extractor. They have used a basic and straightforward graph-based syntactic representation for text and web documents. For each different word only one vertex is formed irrespective of the number of times it is present in the text. Thus, every vertex in the graph is distinctive. Using directed graph it tries to extract multi-word key phrase from the text by giving importance to the order of word occurrence. The work in [4, 5] uses a KeyRank approach to extract proper key phrases from English documents. It explores every possible candidate for the key phrases from the text and then assigns them some ranks to make a decision for the top N key phrases. It uses a sequential pattern mining method with gap constraints in order to extract key phrase candidates for assigning Key-Rank. An effectiveness evaluation measure pattern frequency with entropy is also proposed for ranking the candidate key phrases. The work in [6] suggests an unsupervised graph-based keyword extraction method. The method is called as Keyword Extraction using Collective Node Weight which fixes the importance of a keyword by using a variety of effective factors. This method uses Node Edge rank centrality

25 Graph Based Automatic Keyword Extraction from Odia Text Document

279

with node weight that depends on different parameters like frequency, centrality, position and strength of neighboring nodes. These factors are used to calculate the significance of a node. The implementation of the model is divided into 4 stages, such as preprocessing, textual graph representation, node weight assignment and keyword extraction. In pre-processing stage the meaningless symbols are removed from tweets so that useful keywords can be taken out. In the second stage a graph is constructed in which one vertex represents one token. For each token there is a vertex in the graph. The edges are constructed for pairs of tokens present in the original texts without changing the order of appearance in the text. The model uses different significant parameters to estimate weight of a node based on the above mentioned parameters. The last phase is the keyword extraction which involves recognizing keywords from a text that can properly characterize the subject of the given text. In [7], the authors have used graph convolutional networks for text categorization. After building a single text graph for a corpus based on co-occurring words and related words, a text graph convolutional network is considered as the corpus. The text network is initialized with one-hot representation for word and document to learn the embeddings in the document. The generated features are trained with a supervised learning algorithm to classify new unlabeled documents. Presently, though there is an immense accessibility of Odia text, hardly any work has been done for keyword extraction from archives written in Odia script. Also, so far, there is no work on keyword search accomplished using graph data structure, even though the uncertain graphs have been widely used in many applications. Since there is an immense accessibility of Odia text, the importance of keyword identification from Odia script has been increased a lot. Realizing the importance of keyword extraction from Odia text this article has presented a simple unsupervised undirected weighted graph-based keyword extraction procedure for Odia text.

25.3 Unsupervised Techniques for Ranking and Keyword Extraction 25.3.1 Graph Based Text-Rank Model Text-Rank is a graph-based model which uses web-based page ranking method. It considers the document as a graph and every node in the graph represents a candidate for the keyword to be extracted from the document. An edge is formed between two words if they are present in the same sentence. The page-rank algorithm is used to calculate the weight of every node. The iterative formula [1, 8] for calculating the weight (rank) of every node is described as: W (V i) = (1 − f ) + f

j ∈(V i )

1 W Vj OU T V j

(1)

280

M. Nayak and N. Das

V represents the set of vertices. E denotes the set of edges. N is the total number of words. f is damping factor which is the probability of jumping from a given vertex to another random vertex in the graph. Its value is set between 0 and 1. Generally, the damping factor is taken as 0.85, as used in Text Rank implementation. OU T V j represents the set of outgoing links of node Vj. OU T V j is the out degree of Vj. In(Vi) is the set of inbound links of node Vi. W(vi) is the weight of node vi. The top vertices having higher ranks are considered as the keywords. Finally keywords are collapsed into multi-word key phrases.

25.3.2 TF-IDF TF–IDF stands for term frequency–inverse document frequency. It tries to establish the importance of a word in a text using numerical statistics. It can be used as weighting factors for retrieving information, mining text and modeling. The TF–IDF value raises proportionally as the number of times a word present in the text and is offset by the number of documents in the corpus that contains the word. TF–IDF is one of the most accepted term-weighting method used nowadays. 83% of text-based recommender systems in digital libraries use TF–IDF [9]. In this article the authors have used a graph-based Text-Ranking method for keyword extraction from a given Odia text. The Text-Ranking method has been used because of its simplicity, popularity and efficiency. Text-Rank is an algorithm based on Page-Rank, which often is used in keyword extraction and text summarization. Page-Rank is for webpage ranking, and Text-Rank is for text ranking. The webpage in Page-Rank is the text in Text-Rank, so the basic idea is the same. The proposed method uses a graph-based Text-Rank method to calculate the rank of the words present in the text. It estimates the importance of a node from its linked neighbors and their neighbors. The algorithm used here can be described as: Step 1 Step 2 Step 3 Step 4 Step 5 Step 6

Preprocessing of the document, i.e., for each sentence only consider the nouns and objects and discard the rest of the text. Form windows of size k each. For each ordered pair of words in the window a directed edge is constructed. Weight of each node is calculated as Eq. 1. Repeat step 4 for some finite number of times. After final iteration remove the words from the list having weights less than a pre-decided threshold value.

The text used as input to the system is shown below.

25 Graph Based Automatic Keyword Extraction from Odia Text Document

281

After extracting the stop words the text contains the following words. The stop words are the words in a sentence that are not useful to determine the importance. Only noun and verbs are considered in a sentence.

25.4 Implementation and Experimental Results Figure 25.2 shows the transliteration of the selected words. The first column represents the node number, the second column represents the corresponding word for that node and the third column is the transliteration of that word in English. The words are grouped into a fixed number of windows. The size of each window is taken as 3 here. The windows formed are: and so on. A graph is created in which every word represents a node. If a particular word appears more than once only one node is created for that word. That means if a word is new in text then a node is formed in the graph. An edge is added for two nodes (words) if they co-occur within a certain window. Figure 25.1 shows the graph constructed for the words that are selected after performing step 1 of the algorithm. The graph has been generated using the NetworkX package of python. The words along with their node numbers and their corresponding transliterations in English are presented in Fig. 25.2. Figure 25.3 shows matrix that contains the count of inbound links to a particular node from other nodes. For every node there is a row in the matrix. Every column for that row shows the number of inbound links from the node present in the column to the node in that row. Any pair of words in a window is considered to have an edge between them. Figure 25.4 shows the calculated weights of the nodes after the first iteration of the step 4 of the algorithm. Figure 25.5 shows the calculated weights of the nodes after the second iteration. It can be seen that the weights are changed at the end of each iteration. After a fixed number of iterations, the algorithm terminates. Figure 25.6 shows the calculated weights of the nodes after the final iteration. The nodes are arranged

282

M. Nayak and N. Das

Fig. 25.1 Graph of the matrix

Fig. 25.2 Transliteration of the selected words

NN

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19

Representation

Transliteration in English

samaja manisa swikara anyatha pratibha birodhare guna adhikara bikasara strusti antarnihita abashyaka nihati patharodha banchi manaba sumita sugandhita antaraya padiba

according to their weights. The final table contains the words having higher weights. The words are selected based on some threshold value, which is considered here as 0.9. The Fig. 25.7 shows the final keywords with their corresponding weights.

25 Graph Based Automatic Keyword Extraction from Odia Text Document

283

Fig. 25.3 Matrix representing the inbound and outbound links

Fig. 25.4 Weights of the nodes after 1st iteration

ି ମଣିଷ ସମାଜ ବିେରାଧେର ଅଧକାର ନିହାତି ୀକାର ପଡ଼ିବ ତ ଗୁଣ ି ଭା ବିକାଶର ପଥେରାଧ ମାନବ ସମ ୁ ିତ ସୁ ିତ

[0.43333333] [1.2125] [3.125] [1.0] [1.0] [0.7875] [0.575 [1.0] [0.77] [0.85833333] [1.0] [1.0] [1.0] [1.0] [0.7875 [0.575 [1.0] [1.14166667] [0.575] [0.15]

25.5 Result Analysis The proposed approach is used for indexing of documents written in Odia language. As no dataset is available for the said language, three different datasets have been created relevant to: geography, history and science referred as Doc1, Doc2 and Doc3. After preprocessing of the text, each dataset contains 5000 words correspondingly. Due to unavailability of the database the keywords are first selected manually to be compared with the predicted keywords. Some persons were invited to identify

284

M. Nayak and N. Das

Fig. 25.5 Weights of the nodes after 2nd iteration

ି

[0.43333333]

ମଣିଷ

[1.2125]

ସମାଜ ବିେରାଧେର

[3.125]

ଅଧକାର

[1.0]

ନିହାତି

[0.7875]

[1.0]

[0.575] ୀକାର ପଡ଼ିବ

[1.0] [0.37] [0.85833333] [1.0]

ଗଣ ୁ

[1.0]

ିଭା

[1.0]

ବିକାଶର ପଥେରାଧ

[1.0]

ମାନବ

[0.575]

ସମ ୁ ିତ ସୁ

[0.7875] [1.0]

ିତ

[1.14166667] [0.575] [0.15]

Fig. 25.6 Weights of the nodes after final iteration

ସମାଜ ମଣିଷ ୀକାର ିଭା ବିେରାଧେର ଗଣ ୁ ଅଧକାର ବିକାଶର

ନିହାତି ପଥେରାଧ ି ମାନବ ସମ ୁ ିତ ସୁ ିତ ପଡ଼ିବ

Fig. 25.7 Weights of the nodes after applying threshold

[2.76621013] [1.44976676] [1.13985002] [1.08970608] [0.9184044] [0.91475251] [0.89125502] [0.89114075] [0.86195244] [0.85345979] [0.83162975] [0.82452051] [0.75884009] [0.68552528] [0.64382644] [0.400325] [0.334875] [0.21375] [0.377865] [0.15]

ସମାଜ

[2.76621013]

ମଣିଷ

[1.44976676]

ୀକାର

[1.13985002]

ିଭା

[0.9184044]

ବିେରାଧେର

[0.91475251]

[1.08970608]

25 Graph Based Automatic Keyword Extraction from Odia Text Document

285

Table 25.1 Manual versus predicted results Odia text docs

Actual keywords found manually

Total keywords from EXP

Actual keywords from EXP. (TP)

Falsely predicted keywords (FP)

Missed keywords (FN)

Doc1

450

510

430

80

20

Doc2

400

460

360

100

40

Doc3

530

580

500

80

30

Table 25.2 Statistical parameters Docs

Precision in %

Recall in %

F-measure in %

w=3

w=5

w=3

w=5

w=3

w=5

Doc1

84

88

95

97

89

93

Doc2

78

84

90

93

83

89

Doc2

86

91

94

95

90

94

the keywords manually from the documents. The intersection of the sets identified by the persons for each document is taken into consideration. The resultant keywords are compared with the experimentally extracted keywords. The results of the experiments executed on these three documents are analyzed to test the performance of the proposed method. The measures used for evaluation of the keywords extracted by the proposed approach relevance to manually assigned keywords are the Precision, Recall and F-measure. Table 25.1 shows the values of these terms. The experimental values shown in this table are generated when the size of each window is taken as 3. Each row shows results for a particular document. Table 25.2 shows a comparison between the values of precision percentage, recall percentage and F-measure percentage for the results obtained with different window size. First column represents the percentage for a window size 3(w = 3) and the second column shows the percentage for a window size 5(w = 5). It can be observed that the results are improved with increased window size.

25.6 Conclusion In this paper the authors established an unsupervised graph-based Text-Ranking method for extracting keywords from a given Odia text. The empirical results suggest that the proposed approach has the best precision. Its step complexity is O(N * I) which is linear, where N represents the number of nodes in the graph and I is the total number of iterations. Therefore, it is better than other supervised algorithms which have high computational complexity because of their complex training process. The

286

M. Nayak and N. Das

method used here is also language independent and for this it can be used with any language. The main disadvantage of this method is the procedure that is used for the removal of the stop words. The removal of the stop words is a crucial step in the algorithm. If all the stop words can be removed from the text, the algorithm can be implemented efficiently to find the keywords. The authors have used Odia vocabulary to form a database of the stop words. However, some words are there like ’(padiba) and ‘ ’(nihati) which are extracted as keywords but literally they ‘ are not keywords. Hence, these kind of words should have been removed from the text in the preprocessing stage. In the future work, the authors would try to remove all such words from the text before finding the keywords. The proposed method is also not capable of finding key-phrases which are combination of keywords from a given text. So this may also be considered as an extended research direction for the authors.

References 1. Mihalcea, R., Tarau, P.: TextRank: Bringing order into texts. In: Proceedings of the 2004 Conference on Empirical Methods in Natural Language Processing, pp. 404–411 (2004) 2. Wang, Q., Sheng, V.S., Wu, X.: Document-specific keyphrase candidate search and ranking. Expert Syst. Appl. 97, 163–176 (2018) 3. Litvak, M., Last, M., Aizenman, H., Gobits, I., Kandel, A.: DegExt—A language-independent graph-based keyphrase extractor. Adv. Intell. Soft Comput. 86, 121–130 (2011) 4. Cai, X., Cao, S.: A keyword extraction method based on learning to rank. In: Proceedings—2017 13th International Conference on Semantics, Knowledge and Grids, SKG 2017, pp. 194–197 (2017) 5. Witten, I.H., Paynter, G.W., Frank, E., Gutwin, C., Nevill-Manning, C.G.: Kea: Practical automated keyphrase extraction. In: Design and Usability of Digital Libraries: Case Studies in the Asia Pacific, IGI global, pp. 129–152 (2005) 6. Biswas, S.K., Bordoloi, M., Shreya, J.: A graph based keyword extraction model using collective node weight. Expert Syst. Appl. 97, 51–59 (2018) 7. Yao, L., Mao, C., Luo, Y.: Graph convolutional networks for text classification (Sept 2018) 8. Yuan, Y., Wang, G., Chen, L., Wang, H.: Efficient keyword search on uncertain graph data. IEEE Trans. Knowl. Data Eng. 25(12), 2767–2779 (2013) 9. Beel, J., Langer, S., Gipp, B.: TF-IDuF: A novel term-weighting scheme for user modeling based on users’ personal document collections. Bibliothek der Universität Konstanz (2017)

Chapter 26

An Attempt for Wordnet Construction for Odia Language Tulip Das and Smita Prava Mishra

Abstract Wordnet is an effective and powerful tool used in natural language processing (NLP). It is also beneficial for retrieving information for the words with similar meanings, i.e., semantic processing. It consists of a database which contains words of related vocabulary of a language called lexical database. The words are grouped into synsets. Wordnet is very useful as it has a variety of technological applications in different languages. This motivated many researchers for construction of wordnet in different languages. Creation of wordnet for the low-resource languages like Odia becomes time-consuming and expensive if created from scratch. Thus, wordnets can be developed by implementing the format of the other wordnet. The words are translated into the destination language that is associated with the particular synsets. In this paper, an attempt has been made for implementing this by using the “Expansion Approach.” The main advantage in this method is that the resulting wordnet in the destination language is aligned to the source wordnet and the interlingual index (ILI). The main purpose for constructing a complete wordnet is to create a synonym set for a low-resource language like Odia.

26.1 Introduction Wordnet can be defined as a huge structure of words in the form of a graph in which nouns, verbs, adjectives, and adverbs arranged in synonym sets with different relationship linking the different sets [1, 2]. Wordnets are used to control different entity components on a web page. It does not have any particular domain. It can also be called as a “computerized dictionary” [3, 4]. They represent synsets by “conceptual semantics and lexical relationships” among words [5]. It classifies the specific language words into number of groups [6, 7]. Wordnet has a huge variety of technical applications. It is an effective resource for the researchers in the field T. Das (B) · S. P. Mishra Institute of Technical Education and Research, SOA University, Bhubaneswar, India S. P. Mishra e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2022 D. Mishra et al. (eds.), Intelligent and Cloud Computing, Smart Innovation, Systems and Technologies 286, https://doi.org/10.1007/978-981-16-9873-6_26

287

288

T. Das and S. P. Mishra

of computational linguistics, text processing, and other related areas. This makes wordnet more useful in natural language processing. This inspired many researchers to construct wordnet in different languages including low-resource languages like Odia. Merge approach and expansion approach are two different methods for wordnet construction. The structure of the paper is as follows. The current section introduces the topic. The following Sect. 26.2 describes the related work in the field of wordnet construction. Section 26.3 discusses the methods of wordnet construction and the procedure for Odia wordnet construction. Section 26.4 discusses the experimental setup. The advantages and disadvantages of expansion approach are discussed in Sect. 26.5. Some applications of wordnet are described in Sect. 26.6. Section 26.7 discusses the conclusion and future works that can be done in this area.

26.2 Related Work The Thai Wordnet is constructed taking the Princeton Wordnet (PWN) as source. This paper briefs about the “Semi-automatic Construction of Thai Wordnet” and the method that is applied for the Asian Wordnet. This wordnet can be generated by using the existing “bilingual dictionary.” The PWN synset is automatically aligned followed by manual translation after alignment [8]. The proposed method in “Automatic Wordnet Development for Low-Resource Languages using Cross-Lingual Word Sense Disambiguation (WSD)” was executed to construct the Persian language and the resulting wordnet was evaluated by performing several experiments [9]. In the paper “Experiences in Building the Konkani Wordnet using the Expansion Approach,” the Konkani language is the destination wordnet and Hindi Wordnet is the source language. The Konkani Wordnet was in its initial stage of development. 1969 core Hindi synsets have been included in the Konkani Wordnet [2]. There are many projects in progress to create wordnets in most of the Indian languages using this approach with Hindi Wordnet as the source synset.

26.3 Methods for Wordnet Construction and Procedure for Odia Wordnet Construction There are two main methodologies for the construction of wordnets [9, 10]. They are described below. i.

Merge Approach: In merge approach, wordnets are built from the scratch. It is a very tedious and costly approach. It takes a huge amount of time for construction but the accuracy of the resulting wordnet is more. This method is used for creation of high-resource languages.

26 An Attempt for Wordnet Construction for Odia Language

ii.

289

Expansion Approach: In expansion approach, wordnets are built from existing wordnets. It is easier to construct in comparison with merge approach. It saves a lot of time. Wordnets built may resemble with other wordnets. This method is used for creation of low-resource language synsets. In expansion approach, wordnet is created by translation of words in the synsets of previously existing wordnet to the destination language. Here, a central shared database common to all languages is maintained. In this, English wordnet is the source wordnet and the target wordnet is Odia wordnet [10, 11]. Manual Construction: In this step, the source synset is translated into the destination language synset word by word to find out if that word is present in the destination language [12, 13]. Automatic Synset Alignment: In this step, the “word-to-word equivalent” is very useful for choosing the right synonym for the right concept [14, 15]. Here, every single word is lexicalized to the existence of each concept for the chosen low-resource language.

Comparing both the approaches, wordnet construction for low-resource languages preferably intend to follow the “Expansion Approach.” Thus, an attempt here has been made to construct the Odia wordnet by implementing the same procedure.

26.4 Experimental Setup The solution approach for construction of Odia wordnet using is described below. The procedure for Odia wordnet construction is depicted in Fig. 26.1. To construct Odia wordnet, here Google Colaboratory: a web IDE for python is used. The natural language toolkit (NLTK) package is installed [7, 16]. The dataset of 200 English words was taken. Each word was translated into Odia, from which the definition of the word in Odia, its synsets, parts of speech is obtained along with the usage of the word in an example. The dataset below depicts how an Odia wordnet can be constructed from English wordnet. Table 26.1 represents the sample. Here, the synsets, parts of speech that is whether it is a noun, verb, adjective or adverb of a particular word can be found out followed by an example. Following the above steps for the other words, Odia wordnet can be constructed. An example of Odia wordnet is depicted in Fig. 26.2. During the translation process, the following significant characteristics have to be considered carefully. i. ii.

The concepts which may not exist in the source wordnet will be skipped in Odia wordnet [2]. Some concepts of the source wordnet may not be equal to words in Odia wordnet.

As wordnet has a huge contribution in natural language processing and here is an attempt made to construct a wordnet in Odia by using Python IDE. A set of words

290 Fig. 26.1 Procedure for construction of Odia wordnet

Table 26.1 English words and their synsets in Odia

T. Das and S. P. Mishra

26 An Attempt for Wordnet Construction for Odia Language

291

Fig. 26.2 Example of Odia wordnet

were taken in English as source language. The above example in Fig. 26.2 depicts the same. Similarly, Table 26.1 depicts examples of a sample dataset which is used for the construction of Odia wordnet.

26.5 Advantages and Disadvantages of Expansion Approach Though wordnet is a very old concept, its yet not available for some languages because of some of its pros and cons discussed below: Advantages In expansion approach, synsets are created with the reference from the previously existing wordnets in other related languages rather than creating them from the beginning which is done in merge approach [14]. This makes construction of wordnet easier in comparison with merge approach. It saves enormous amount of time [1, 13]. The wordnet created in expansion approach can be utilized in other research works where there is a lack of corpus [8]. Disadvantages The wordnets which are constructed using expansion approach are much influenced by the source language. This lessens the accuracy of the destination language [15]. Co-occurrence calculation among different words in the destination language needs a massive corpus which is generally not found in low-resource languages

292

T. Das and S. P. Mishra

[6]. Creation of wordnet using expansion approach requires a comparatively larger corpus. Language-specific concepts need processing in a different way if they are to be added to the wordnet [14].

26.6 Application Wordnets have many objectives like disambiguation of word sense, retrieval of information, classification of different texts automatically [17]. Wordnets can also be used for determining similar words, generation of crossword puzzles automatically, machine translation, etc. Wordnets can be used for interlinkage of different vocabularies. It can be used to summarize texts automatically. The usages are many fold in deed.

26.7 Conclusion and Future Works Wordnet is a very essential part of natural language processing. Wordnet may not have any direct applications in natural language processing but it is valuable for the NLP community. There are several databases available but wordnets are still important. Wordnet is basically developed in merge approach and expansion approach. By comparing both the approaches, it is observed that development of a wordnet for low-resource languages like Odia in expansion approach is easier than developing it using merge approach. Many words have a huge variety of meanings; therefore, linking of the contextual words is a difficult task. Linkage and conceptualization of a particular topic which is not present in the source language are a matter of concern. Lexicographers should be given the freedom for coining of new words. Coverage of synsets cannot resolve ambiguity. These are some of the challenges faced which can be taken care of while constructing wordnet in the future.

References 1. Fišer, D.: Leveraging parallel corpora and existing wordnets for automatic construction of the Slovene wordnet. In: Language and Technology Conference, pp. 359–368. Springer, Berlin, Heidelberg (2007 Oct 5) 2. Walawalikar, S., Desai, S., Karmali, R., Naik, S., Ghanekar, D., D’Souza, C., Pawar, J.D.: Experiences in building the Konkani wordnet using the expansion approach 3. Petrolito, T., Bond, F.: A survey of wordnet annotated corpora. In: Proceedings of the Seventh Global WordNet Conference, pp. 236–245 (2014 Jan) 4. Fadaee, M., Ghader, H., Faili, H., Shakery, A.: Automatic WordNet construction using Markov chain Monte Carlo. Polibits. 47, 13–22 (2013)

26 An Attempt for Wordnet Construction for Odia Language

293

5. Jha, N.K., Jethva, A., Parmar, N., Patil, A.: A review paper on deep web data extraction using WordNet. Int. Res. J. Eng. Technol. 3(3), 1003–1006 (2016) 6. Vossen, P., Bond, F., McCrae, J.P.: Toward a truly multilingual global wordnet grid. In: Proceedings of the Eighth Global WordNet Conference, pp. 25–29 (2016 Jan 1) 7. Stoyanova, I., Koeva, S., Leseva, S.: Wordnet-based cross-language identification of semantic relations. In Proceedings of the 4th Biennial International Workshop on Balto-Slavic Natural Language Processing, pp. 119–128 (2013 Aug) 8. Leenoi, D., Supnithi, T., Aroonmanakun, W.: Building a gold standard for Thai WordNet. In: Proceeding of The International Conference on Asian Language Processing 2008 (IALP2008), pp. 78–82. COLIPS (2008) 9. Taghizadeh, N., Faili, H.: Automatic wordnet development for low-resource languages using cross-lingual WSD. J. Artif. Intell. Res. 20(56), 61–87 (2016) 10. Shahzad, K., Pervaz, I., Nawab, A.: WordNet based semantic similarity measures for process model matching. In: BIR Workshops, pp. 33–44 (2018) 11. Nguyen, P.T., Pham, V.L., Nguyen, H.A., Vu, H.H., Tran, N.A., Truong, T.T.: A two-phase approach for building Vietnamese wordnet. In: Proceedings of the 8th Global WordNet Conference. Bucharest, Romania, pp. 259–264 (2016 Jan) 12. Saedi, C., Branco, A., Rodrigues, J., Silva, J.: Wordnet embeddings. In: Proceedings of the Third Workshop on Representation Learning for NLP, pp. 122–131 (2018 July) 13. Narayan, D., Chakrabarti, D., Pande, P., Bhattacharyya, P.: An experience in building the indo wordnet-a wordnet for Hindi. In: First International Conference on Global WordNet, Mysore, India, vol. 24 (2002 Jan) 14. Bouchlaghem, R., Elkhlifi, A., Faiz, R.: Tunisian dialect Wordnet creation and enrichment using web resources and other Wordnets. In: Proceedings of the EMNLP 2014 Workshop on Arabic Natural Language Processing (ANLP), pp. 104–113 (2014 Oct) 15. Miháltz, M., Hatvani, C., Kuti, J., Szarvas, G., Csirik, J., Prószéky, G., Váradi, T.: Methods and results of the Hungarian WordNet project. In: Proceedings of the Fourth Global WordNet Conference. GWC, pp. 387–405 (2008) 16. https://colab.research.google.com 17. Bond, F., Paik, K.: A survey of wordnets and their licenses. Small 8(4), 5 (2012)

Chapter 27

A Deep Learning Approach for Face Mask Detection Dibya Ranjan Das Adhikary, Vishek Singh, and Pawan Singh

Abstract The whole world is passing through a very difficult time since the outbreak of Covid-19. Wave after wave of this pandemic hitting people very hard across the globe. We have lost around 3.8 million lives so far to this pandemic. Moreover, the impact of this pandemic and the pandemic-induced lockdown on the lives and livelihoods of the people in the developing world is very significant. Till now there is no one-shot remedy available to stop this pandemic. However, spread can be controlled by social distancing, frequent hand sanitization, and using a face mask in public places. So, in this paper, we proposed a model to detect face mask of people in public places. The proposed model uses OpenCv module to pre-process the input images, it then uses a deep learning classifier MobileNetV3 for face mask detection. The accuracy of the proposed model is almost 97%. The proposed model is very light and can be installed on any mobile or embedded system.

27.1 Introduction The world since December 2019 is facing a global crisis due to Covid-19 pandemic. This disease is caused by a virus named “SARS-CoV-2” which is invisible and considered to be very deadly in nature. Covid-19 allegedly originated in Wuhan, China, and has spread globally since then and it had affected more than 114 countries in the world so far. This virus is highly communicable and it is spread through the air when people infected with Covid-19 sneezes or communicate with others [1]. The droplet from sneeze coming from the mouth and nose of the infected person infects the person in close proximity [2]. The World Health Organization (WHO) along with all the medical institutions, medical research centers, doctors are studying the virus since 2019 and have come up with a list of necessary things that can limit the spread of the covid-19 disease; the list includes wearing a mask for all the people above the D. R. D. Adhikary (B) · V. Singh · P. Singh Department of Computer Science and Engineering, Siksha ‘O’ Anusandhan Deemed to be University, Bhubaneswar, Odisha, India e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2022 D. Mishra et al. (eds.), Intelligent and Cloud Computing, Smart Innovation, Systems and Technologies 286, https://doi.org/10.1007/978-981-16-9873-6_27

295

296

D. R. D. Adhikary et al.

age of 2 years and maintaining a distance of at least 6 feet from other persons and regular washing of hands through soap or use sanitizers [3]. In order to break the chain of infection, we need to follow these measures. However, in a country like India with a population of 1.2 billion people it is very difficult to maintain social distancing among people in public places. So, in order to force these measures, several rules and regulations has been imposed. Still, people are not following these measures. Though it will be very hard to develop systems to detect whether people are following all the measures, a system can be developed to detect facial mask. So, face mask detection at public places has been the need of the hour, as detecting face mask at manual level is very difficult and engages lots of manpower. This paper aims at detecting peoples who are not wearing any masks or does not wear mask properly in public places. The system proposed in this paper would take video footage from CCTV cameras installed in public domain from these footage facial images are extracted and then given to a deep learning model which helps in detecting masks or without masks. This model uses Convolutional Neural Network (CNN) for extraction of features from images and then the same is learned by multiple hidden layers. When the proposed system architecture identifies people without masks it alarms the concerned authority so that necessary action can be taken. This paper has been arranged in the following manner. The recent work in this field is organized in Sect. 27.2. Section 27.3 describes the framework used for this work. In Sect. 27.4, the proposed methodology for the development of the entire system has been described. Section 27.5 contains the results of the proposed system. Lastly, the conclusion and future work is described in Sect. 27.6.

27.2 Related Works Face mask detection is the process of identifying whether a person is covering his/her face with masks or not. This method finds its similarity in detecting any specific object from a scene. There have been many useful contributions in the field of object detection. At first, a system was proposed by Waranusast et al. [4] for detection of motorcycle safety helmets in the year 2013. The proposed system first extract moving object and classified them as motorcyclist and other moving objects and then it further classified the motorcyclist class as motorcyclist with safety helmet and without it, using K-Nearest Neighbor (KNN). However, the proposed system is having an accuracy of only 74%. Later, various authors improvise this work using various techniques. Notably, Silva et al. [5] improvises the accuracy using Multi-Layer Perceptron, Rubaiyat et al. [6] used Circle Hough Transform (CHT) technique for detection of helmet for construction workers, Vishnu et al. [7] used CNN for an increased accuracy and Siebertet. al. [8] used RetinaNet to provide a very high accuracy.

27 A Deep Learning Approach for Face Mask Detection

297

Though there are systems that exist for other kinds of detection, Nieto-Rodríguez et al. [9] first proposed a model for detection of medical personnel without the surgical mask in an operating room. The proposed system first identifies faces and then determines faces with and without the surgical mask. However, due to unavailability of a proper dataset, the system was trained and tested on a limited set of data that was available. Similarly, Issenhuth et al. [10] proposed another such approach, which can detect surgical mask in the operating room more accurately. With the emergence of the Covid-19 virus, this problem has gained more importance and models after models are proposed to provide a more accurate solution. In this regard, Qin and Li [11] proposed super-resolution and classification networks (SRCNet) to classify image with facial mask and without it. Similarly, there exists another study [12] that uses deep learning to classify people with mask and without a face mask in the smart cities. But the accuracy of the system depends on the number of faces in a frame. Another such approach was proposed by Loey et al. [13] using a cocktail of deep learning and classical machine learning. The proposed model uses decision trees, Support Vector Machine (SVM), and ensemble algorithms for classification of the datasets. Singh et al. [14] proposed another similar model using YOLOv3 and Faster-RCNN (F-RCNN). The authors did a comparative study of both the models and found YOLOv3 to be more effective in normal setup.

27.3 Framework Used Before describing the proposed methodology, let’s have a look at the framework that has been used for developing this system. Convolutional Neural Network (CNN): It is a form of neural network that is mostly being used in image processing. The CNN is a spatial data filtering feed-forward network. This paradigm is most likely to be seen anywhere a computer is detecting objects in an image. This model can also be applied to projects involving natural language processing. Let’s say we have a complex image. The word convolutional in this network refers to the filtering process. A CNN simplifies the image so that it can be better processed and understood. There are numerous layers to a CNN. It has a rectified linear unit (ReLU) layer and a fully connected layer, much like any other neural network. As data passes through each layer of the network, the ReLU layer functions as an activation function, ensuring non-linearity. Without it, the data that is given into each layer would lose its dimensionality, which is something we want to keep. Meanwhile, you can do classification on the dataset using the fully connected layer. The most significant layer is the convolutional layer. It operates by applying a filter on a set of picture pixels in an array. A convolutional feature map is the result of this procedure. It is similar to looking through a window at an image, which helps us to see the details we would not otherwise be able to notice. This also speeds up

298

D. R. D. Adhikary et al.

processing by reducing the amount of parameters that the network must process. The output of this is the pooled feature map. These processes involve feature extraction, in which a network constructs a picture of the image input based on its own mathematical principles. We will need to move into the fully connected layer if we need to execute classification. We will need to flatten everything to accomplish this. Only linear data can be processed by a neural network with a more complicated collection of connections. A CNN can be trained in a variety of methods. Using auto-encoders is one of the most popular ways to do this. This allows us to compress data into a small space while executing calculations in the CNN’s first layer. Once this is complete, we must recreate the data using additional layers that up-sample it. Generative adversarial networks, or GANs, is another technique to train the model. MobileNetV3: It is a state-of-art computer vision recognition system, which is twice as fast as its predecessor, i.e., MobileNetV2 while maintaining the same accuracy level. Object identification, categorization, and semantic segmentation are all included. Deep intelligent separable convolution is used in the classifier with the purpose of decreasing the overall complexity. It also decreases the cost and the size of the model, making it suitable for devices having inadequate processing ability.

27.4 Proposed Methodology Figure 27.1 depicts the overall operation of the suggested model. The aim of the proposed model is to increase the efficiency without using too much of the resources. At first, the input data are cleaned and pre-processed using OpenCv module. For this work, we have used OpenCv as it has a Single Shot Multibox Detector (SSD) [15] with ResNet-18 [16] as its backbone architecture. With OpenCv module the system is so light that it can be used in embedded devices. Further, we have used a pre-trained model of MobileNetV3 [17] for detection of facial mask. Dataset Used: In order to predict whether a person is wearing a mask or not, first you have to train your model with an appropriate dataset. However, finding a right dataset is a very tiresome task; as only few datasets are available with a very limited number of images. Moreover, these datasets contain lots of noises. So in this work, we have used the dataset by Chandrika Deb [18] which is basically a collection of

Fig. 27.1 The overall architecture of the proposed model

27 A Deep Learning Approach for Face Mask Detection

299

Fig. 27.2 Sample images of people having mask and no mask from the dataset

Real-World Masked Face Dataset (RMFD), Kaggle datasets, and images collected using Bing Image Search API. The dataset contains 4095 numbers of images, with 2165 with masks and 1930 having no masks. Some samples images from the dataset used in this work are provided in Fig. 27.2. Data Pre-processing: The dataset we are utilizing comprises pictures with various shadings, various sizes, and various orientations. An excellent dataset determines the accuracy of the model trained on it. So, instead of using the dataset directly, we process the dataset to remove all the repetition and noisy images. First, we manually deleted all the repeated images. Next, all noisy images or corrupted images are deleted manually. Next, the Pre-Processing of data is done using OpenCv module. So, each image was converted to a greyscale image and re-sized since we need a common size for all images before applying to the neural networks. Pre-processing is a method that accepts a folder as input to the dataset, loads all the images, and resizes them to fit the model. The pictures are turned into tensors once the list is sorted. For quicker computation, it is transformed into a NumPy array. The accuracy of our model improved considerably after pre-processing. Once Preprocessing is done, the data augmentation is used to improve the model’s accuracy after it has been trained. Data Augmentation: To improve the accuracy of the proposed model, a large amount of data is required to train it. We employed data augmentation to expand the amount of data because the dataset we used was insufficient. We rotate and flip each image in the dataset in this phase in order to expand the dataset.

300

D. R. D. Adhikary et al.

27.5 Results and Discussion The experimentations of the models are executed using Jupyter Notebook with Python 3.8 kernel on a window 10 system having Intel core i5-9300HF, 8 GB RAM with NVIDIA GeForce GTX 1050 3 GB GDDR5 dedicated graphics. The whole system works in two steps, first it “detect” faces on an image/live video stream, using a trained face detector from Open CV and after that to pass the found faces on the "mask predictor" that classify the images into two classes namely “Mask” and “No Mask”. The system uses face mask detector training script in Python, which accepts the input dataset, loads and pre-processes the images, and labels them using Tensor Flow, Keras, and Sklearn. It would also fine-tune the model with MobileNetV3 classifier using pre-trained Image Net weights. To train and test the system we have used a custom dataset of 4095 images for both “Mask” and “No Mask”. We divided the whole data into two groups: training and testing. We set split size = 0.8 in this scenario, which indicates that 80% of the overall pictures are used as a training set for training the model. While, the remaining 20% of the photos go into the test set, and will be used for testing the accuracy of the model. The proposed model is evaluated using metrics like accuracy, recall, precision, confusion matrix, f1 score, and comparison of models. Matplotlib was used to create the confusion matrix. Out of 1638 photos used for validation, the confusion matrix in Fig. 27.3 correctly detected 698 true positives, 48 false negatives, two false positives, and 890 true negatives. Moreover, the proposed model achieved 96.94% accuracy, 99.71% precession, 93.56 recall, and 96.53% F1 score. Face mask detection is appropriately applied to every frame in the real-time video stream by the system developed using your webcam. It outputs the confidence probability of “Mask” or “No Mask” in the static images and real-time video streams examined. The system can be easily integrated with various embedded devices with limited computational capacity as it uses MobileNetV3 architecture. The screenshots Fig. 27.3 Confusion matrix

27 A Deep Learning Approach for Face Mask Detection

301

from the working of face mask detector have been supplied below in Figs. 27.4 and 27.5. The green bounding box around the person’s face indicates that he or she is wearing a mask and the red bounding box indicates that he or she is not wearing a mask. The value at the top of the rectangle is the confidence probability of wearing or not wearing a mask. To demonstrate the ability to detect facemask accurately, the proposed model was compared with existing state-of-art models like LeNet-5 [19], AlexNet [20], and SSDMNV2 [21]. Table 27.1 compares the accuracy of various models to that of the proposed model. It can be observed from the figure that the proposed model outperformed other compared models.

Fig. 27.4 Face masks detection in images with people with mask and without mask

Fig. 27.5 Face mask detection in images with people with mask

302 Table 27.1 Comparison of accuracy between different models

D. R. D. Adhikary et al. Model

Accuracy (%)

Year

LeNet—5

84.6

1999

AlexNet

89.2

2012

SSDMNV2

92.64

2020

Proposed model

96.94

2021

27.6 Conclusions and Future Scope This paper contains a proposed face mask detector that helps in detecting whether a person is using a face mask in public places. Deep learning methods are used in the proposed system to achieve this. The model is trained with a variety of facial images, both with and without masks. The accuracy of the proposed system is measured to be 96.94%. The proposed system has the indirect life-saving ability and can surely reduce the spread of coronavirus. The system can be installed in various places like parks, Shopping malls, Airports, Railway stations, etc. We have not implemented this in a real-world scenario. So, our next task will be implementing this in a real-world scenario. Further, we would like to modify this work to detect people without a helmet in motorbikes and people not wearing seatbelts in cars.

References 1. Dhand, R., Li, J.: Coughs and sneezes: Their role in transmission of respiratory viral infections, including SARS-CoV-2. Am. J. Respir. Crit. Care Med. 202(5), 651–659 (2020) 2. Kähler, C.J., Hain, R.: Fundamental protective mechanisms of face masks against droplet infections. J. Aerosol Sci. 148, 105617 (2020) 3. World Health Organization: Considerations for quarantine of individuals in the context of containment for coronavirus disease (COVID-19): Interim guidance, 19 March 2020 (No. WHO/2019-nCoV/IHR_Quarantine/2020.2). World Health Organization (2020) 4. Waranusast, R., Bundon, N., Timtong, V., Tangnoi, C., Pattanathaburt, P.: Machine vision techniques for motorcycle safety helmet detection. In: 2013 28th International Conference on Image and Vision Computing New Zealand (IVCNZ 2013), pp. 35–40. IEEE (2013) 5. Silva, R.R.V., Aires, K.R.T., Veras, R.D.M.S.: Helmet detection on motorcyclists using image descriptors and classifiers. In: 2014 27th SIBGRAPI Conference on Graphics, Patterns and Images, pp. 141–148. IEEE (2014) 6. Rubaiyat, A.H., Toma, T.T., Kalantari-Khandani, M., Rahman, S.A., Chen, L., Ye, Y., Pan, C.S.: Automatic detection of helmet uses for construction safety. In 2016 IEEE/WIC/ACM International Conference on Web Intelligence Workshops (WIW), pp. 135–142. IEEE (2016) 7. Vishnu, C., Singh, D., Mohan, C.K., Babu, S.: Detection of motorcyclists without helmet in videos using convolutional neural network. In: 2017 International Joint Conference on Neural Networks (IJCNN), pp. 3036–3041. IEEE (2017) 8. Siebert, F.W., Lin, H.: Detecting motorcycle helmet use with deep learning. Accid. Anal. Prev. 134, 105319 (2020)

27 A Deep Learning Approach for Face Mask Detection

303

9. Nieto-Rodríguez, A., Mucientes, M., Brea, V.M.: System for medical mask detection in the operating room through facial attributes. In: Iberian Conference on Pattern Recognition and Image Analysis, pp. 138–145. Springer, Cham (2015) 10. Issenhuth, T., Srivastav, V., Gangi, A., Padoy, N.: Face detection in the operating room: Comparison of state-of-the-art methods and a self-supervised approach. Int. J. Comput. Assist. Radiol. Surg. 14(6), 1049–1058 (2019) 11. Qin, B., Li, D.: Identifying facemask-wearing condition using image super-resolution with classification network to prevent COVID-19. Sensors 20(18), 5236 (2020) 12. Rahman, M.M., Manik, M.M.H., Islam, M.M., Mahmud, S., Kim, J.H.: An automated system to limit COVID-19 using facial mask detection in smart city network. In: 2020 IEEE International IOT, Electronics and Mechatronics Conference (IEMTRONICS), pp. 1–5. IEEE (2020) 13. Loey, M., Manogaran, G., Taha, M.H.N., Khalifa, N.E.M.: A hybrid deep transfer learning model with machine learning methods for face mask detection in the era of the COVID-19 pandemic. Measurement 167, 108288 (2021) 14. Singh, S., Ahuja, U., Kumar, M., Kumar, K., Sachdeva, M.: Face mask detection using YOLOv3 and faster R-CNN models: COVID-19 environment. Multimedia Tools Appl. 80(13), 19753– 19768 (2021) 15. Liu, W., Anguelov, D., Erhan, D., Szegedy, C., Reed, S., Fu, C.Y., Berg, A.C.: Ssd: Single shot multibox detector. In: European Conference on Computer Vision, pp. 21–37. Springer, Cham (2016) 16. Ayyachamy, S., Alex, V., Khened, M., Krishnamurthi, G.: Medical image retrieval using Resnet18. In: Medical Imaging 2019: Imaging Informatics for Healthcare, Research, and Applications, International Society for Optics and Photonics, vol. 10954, p. 1095410 (2019) 17. Qian, S., Ning, C., Hu, Y.: MobileNetV3 for image classification. In: 2021 IEEE 2nd International Conference on Big Data, Artificial Intelligence and Internet of Things Engineering (ICBAIE), pp. 490–497. IEEE (2021) 18. https://github.com/chandrikadeb7/Face-Mask-Detection/tree/master/dataset 19. LeCun, Y., Bottou, L., Bengio, Y., Haffner, P.: Gradient-based learning applied to document recognition. Proc. IEEE 86(11), 2278–2324 (1998) 20. Alom, M.Z., Taha, T.M., Yakopcic, C., Westberg, S., Sidike, P., Nasrin, M.S., Asari, V.K.: The history began from alexnet: A comprehensive survey on deep learning approaches. arXiv preprint arXiv:1803.01164 (2018) 21. Nagrath, P., Jain, R., Madan, A., Arora, R., Kataria, P., Hemanth, J.: SSDMNV2: A real time DNN-based face mask detection system using single shot multibox detector and MobileNetV2. Sustain. Cities Soc. 66, 102692 (2021)

Chapter 28

A Computational Intelligence Approach Using SMOTE and Deep Neural Network (DNN) Madhusmita Sahu and Rasmita Dash

Abstract The expertise of the spatial dissemination of tree species is decisive for accurately governing and observing tree-covered ecosystems, specifically in mixed forests of the modest zone. In our study, we focused on ASTER data, to classify tree species of modest forests in Japan. A dataset is imbalanced if the disproportion exists in the distribution of class levels of the dataset. Usually, these problems are ignored in the machine learning field that yields the poorest performance of the minority class level. So, to enhance the sensitivity of minority class levels, a synthetic minority over-sampling technique (SMOTE) is adopted along with the Deep Neural Network (DNN). The DNN is used to disseminate the forest cover dataset into four class levels. This model that comprises over-sampling (SMOTE) and DNN increases the accuracy to 94% and provides a way to enhance the dissimilation of identification of forest types and mapping over a large region.

28.1 Introduction The forest ecosystems are affected heavily by the increasing pressure of dynamic atmospheres such as wildfires, diseases, population growth, forest insects, and global warming. The development of wooded areas is based on abandoned and natural regeneration of agricultural land, because of afforestation, and reforestation. It is very important to demonstrate the classification of tree species for the management and maintenance of the forest area.

M. Sahu (B) Department of Computer Science and Information Technology, Institute of Technical Education and Research (ITER), Siksha ‘O’ Anusandhan (Deemed to be) University, Bhubaneswar, Odisha, India e-mail: [email protected] R. Dash Department of Computer Science and Engineering, Institute of Technical Education and Research (ITER), Siksha ‘O’ Anusandhan (Deemed to be) University, Bhubaneswar, Odisha, India e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2022 D. Mishra et al. (eds.), Intelligent and Cloud Computing, Smart Innovation, Systems and Technologies 286, https://doi.org/10.1007/978-981-16-9873-6_28

305

306

M. Sahu and R. Dash

The surface of the earth is observed by measuring the radiation by sensors. The signal relies on the approach of how the radiation merges in the atmosphere and terrestrial surface of the earth. With the evolution of spectrometers, multispectral and multitemporal remote sensed image is extracted with 12-bit or more quantization, increased signal-to-noise ratios, multi-directional sampling capabilities. This advancement accelerates the evolution of sophisticated models and initiates modern prospects for studying more detailed aspects of the environment with the help of these advanced sensors [1]. The multispectral data from ASTER sensors can be adapted to map forest types to determine their biochemical and biophysical properties to increase the lowest count of tree species in the forest. With the advancement of spatial resolution sensors, forest type classification is the most targeted research for forest inventory [2]. A dataset is not balanced if the proportion of classes is not equally distributed. Consider the illustration of pixel delineation of mammogram images, which are probably cancerous [3]. The evaluation of machine learning approaches is typically measured by adopting predictive accuracy. But this is not suitable for the imbalanced data as well as the cost fluctuates for different errors significantly. Consider a common dataset of mammography that dataset may contain 98% of regular and 2% of irregular pixels. A straightforward strategy of calculating the superiority class would deliver an anticipating accuracy of 98%. However, the essentiality of the strategy lacks a fairly high rate of equitable detection of the non-superiority class and permits for a minor error rate in the superiority class to get this. The issue of imbalanced class can be addressed in two different ways in the area of machine learning. Firstly, a specific cost can be incurred in training samples [4, 5] and secondly, the dataset can be resampled by increasing the count of minority classes or decreasing the count of majority classes [6, 7]. In our model, over-sampling has been chosen to enhance the count of training datasets and a deep learning technique is adopted for the classification of forest types. The performance is evaluated by using four classification metrics and compared. The organization of the paper is as follows- Sect. 28.2 highlights on methodologies involved in the proposed model. Section 28.3 describes the proposed model with a diagram and the result of the experiments are analyzed in Sect. 28.4. Finally, the conclusion part of the paper and references.

28.2 Methodology Used 28.2.1 Dataset In our study, the forest dataset that was possessed from the UCI depository comprises training data and testing data. These multispectral and multitemporal data are collected from the sensor ASTER. Based on their spectral properties of wavelength from visible to near-infrared, there are four types of the forest [8] in Table 28.1. These data contain 27 features that are listed below.

28 A Computational Intelligence Approach Using SMOTE …

307

Table 28.1 List of forest types with their class level Forest types

Class level

Sugi

0

Hinoki

1

Mixed deciduous

2

Nonforest land

3

Name of the feature

Information about each feature

b1–b9

ASTER image bands containing spectral information in the green, red, and near-infrared wavelengths for three dates (Sept. 26, 2010; March 19, 2011; May 08, 2011)

pred_minus_obs_S_b1—pred_minus_obs_S_b9

Predicted spectral values (based on spatial interpolation) minus actual spectral values for the ‘s’ class (b1–b9)

pred_minus_obs_H_b1—pred_minus_obs_H_b9

Predicted spectral values (based on spatial interpolation) minus actual spectral values for the ‘h’ class (b1–b9)

The output (forest type-map) can be used to identify and/or quantify the ecosystem services (e.g., carbon storage, erosion protection) provided by the forest.

28.2.2 Over-Sampling The training of a feed-forwarded neural network based on an imbalanced dataset may not determine each class perfectly [9]. In their paper, they stated that the neural network’s learning rate has been modified with the statistical information of classes in the dataset. They computed an attention factor from the dimension of training samples. This attention factor influences the learning rate of the network. They demonstrated it by increasing the number of minority classes to balance the training dataset. The accuracy of the minority class was enhanced. So, we have adopted the process of over-sampling in the pre-processing stage. In the over-sampling approach, the class with minor cardinal is over-sampled by generating more “synthetic” samples [10]. The synthetic samples of minority class are created through skew and rotation operations along with the line segments by using k-nearest neighbours (minority class). We have analyzed the result of four oversampling techniques as SMOTE, Borderline SMOTE, Borderline SMOTE SVM, and ADASYN.

308

M. Sahu and R. Dash

Table 28.2 Confusion matrix

Predicted negative

Predicted positive

Actual negative

TN

FP

Actual positive

FN

TP

28.2.3 DNN In the machine learning area, the structure of DNN involves deep layers with a high range of parameters has emerged. DNN expands the conventional neural network through immense layers and parameters. This DNN model has derived to different structures for different purposes such as convolutional neural networks (CNN), multiple-layer perceptron (MLP), CNN has evolved to have different structures for different applications, including MLP, CNN [11] and deep belief networks (DBN) adopted for recognition and image classification [12].

28.2.4 Performance Metrics The machine learning algorithm is generally evaluated through a confusion matrix as shown in Table 28.2. The columns and rows of that matrix indicate the predicted class and actual class. Where TN (True Negatives) represents the number of correctly classified negative samples, FN (False Negatives) represents the number of incorrectly classified as positive samples, TP (True Positives) represents the number of correctly classified positive samples, and FP (False Positives) represents the number of incorrectly classified of negative samples as positive. Based on these TN, FN, FP, TP factors, the following metrics are calculated as Accuracy =

TP + TN TP + TN + FP + FN

Sensitivity or Recall = Precision = F1 − score = 2 ×

TP TP + FN

TP TP + FP

Precision × Recall Precision + Recall

(28.1) (28.2) (28.3) (28.4)

28 A Computational Intelligence Approach Using SMOTE …

309

Fig. 28.1 Proposed model

28.3 Proposed Model The objective of our work is to build a model that can be adapted to handle the imbalance of datasets and classify them into four types of forest. This model would be useful to maintain the forest inventory. This model comprises many steps as shown in Fig. 28.1. The multitemporal and multispectral data are collected from the sensor ASTER, are in the form of training and testing datasets available in the UCI repository. Both datasets are concatenated and then normalized to the range (0,1) using the max–min technique. Then the whole dataset is partitioned into training, testing, and validation. The imbalanced training dataset is balanced through smote technique and used to train the DNN classification model for better efficiency of the minority class. The performance of the model is evaluated by adopting some classification metrics such as recall, precision, accuracy, F1-score, and support.

28.4 Experiments We have collected the dataset of Japanese forest that comprises a training dataset of size (198 × 27) and a testing dataset of size (327 × 27). 27 spectral features are used to identify the type of forest. As mentioned in Fig. 28.1, first the training dataset and testing dataset are concatenated to form a set of data of size (525 × 27). Then those

310

M. Sahu and R. Dash

Fig. 28.2 Class = 0, n = 119 (19.071%), Class = 3, n = 156 (25.000%), Class = 2, n = 68 (10.897%), Class = 1, n = 75 (12.019%)

values are normalized to the range (0,1) by using the max–min technique. Then the dataset is split into 80% of total data to the training data and 20% of total data to the testing data. The class of this training dataset is not equally distributed which is called an imbalance dataset. The imbalance of dataset can be visualized by bar chart and scatter plot of imbalance dataset in Fig. 28.2. The forest dataset consists of four types of classes, such as class 0, class 1, class 2, and class 3. The number of samples is not equally distributed in all classes, such that Class = 0, n = 119 (19.071%), Class = 3, n = 156 (25.000%) Class = 2, n = 68 (10.897%), Class = 1, n = 75 (12.019%) The balanced dataset can be generated through over-sampling techniques that are increasing the number of samples of the minority class. From Fig. 28.2, we have noticed that classes 0, 1, 2 are considered minority classes as compared to class 3. So, there is four over-sampling techniques have been used and compared in our experiment. The scatter plot is drawn to analyze the distribution of the sample of each class in SMOTE, Borderline SMOTE, Borderline SMOTE with SVM, and ADASYN in Figs. 28.3, 28.4, 28.5 and 28.6, respectively. The number of samples of the training dataset is increased as the number of samples is increased through over-sampling techniques. Now, the DNN is trained with a different number of training datasets, generated by each over-sampling technique, having equal distribution of classes. The structure of DNN contains one input layer of 27 neurons, 3 hidden layers with RELU activation functions, one output layer with SoftMax activation function, and also optimized by ADAM optimizer. The structure of DNN has shown in Fig. 28.7. The model can be evaluated through classification metrics. Tables 28.3, 28.4, 28.5, 28.6 and 28.7 show the output performance of the proposed model. From the above tables, we have examined the performance of the model without over-sampling, and with SMOTE, Borderline SMOTE, Borderline SMOTE with

28 A Computational Intelligence Approach Using SMOTE … Fig. 28.3 Class = 0, n = 156 (25%), Class = 3, n = 156 (25%), Class = 2, n = 156(25%), Class = 1, n = 156 (25%)

Fig. 28.4 Class = 0, n = 160, Class = 3, n = 160, Class = 2, n = 160, Class = 1, n = 160

Fig. 28.5 Class = 0, n = 153, Class = 3, n = 153, Class = 2, n = 153, Class = 1, n = 153

311

312

M. Sahu and R. Dash

Fig. 28.6 Class = 0, n = 155, Class = 3, n = 155, Class = 2, n = 155, Class = 1, n = 155

Fig. 28.7 Architecture of DNN

SVM, ADASYN shows 88, 94, 92, 89, 94, respectively. That means, using of oversampling technique the accuracy of the DNN is enhanced. But among the four oversampling techniques, next, we have to examine the sensitivity of minority class in each over-sampling technique. The sensitivity of minority class is demonstrated in Figs. 28.8, 28.9, 28.10 and 28.11 of corresponding over-sampling techniques.

28 A Computational Intelligence Approach Using SMOTE …

313

Table 28.3 Performance metrics of DNN without SMOTE Class

Precision

Recall

F1-score

Support

0

0.828

0.935

0.878

31

1

0.842

0.889

0.864

18

2

0.933

0.823

0.875

17

3

0.944

0.871

0.906

39

Accuracy

0.885

Table 28.4 Performance metrics of DNN with SMOTE Class

Precision

Recall

F1-score

Support

0

0.965

0.965

0.965

29

1

0.882

0.937

0.909

16

2

0.933

0.933

0.933

15

3

0.954

0.933

0.944

45

Accuracy

0.942

Table 28.5 Performance metrics of DNN with borderline SMOTE Class

Precision

Recall

F 1 -score

Support

0

0.937

0.909

0.923

33

1

0.905

0.950

0.926

20

2

0.833

1.000

0.909

15

3

0.970

0.891

0.929

37

Accuracy

0.923

Table 28.6 Performance metrics of DNN with borderline SMOTE and SVM Class

Precision

Recall

F 1 -score

Support

0

0.933

0.823

0.875

34

1

0.867

0.867

0.867

15

2

0.875

0.933

0.903

15

3

0.886

0.951

0.917

41

Accuracy

0.895

Table 28.7 Performance metrics of DNN with ADASYN Class

Precision

Recall

F 1 -score

Support

0

1.000

0.974

0.987

39

1

0.867

0.867

0.867

15

2

1.000

0.933

0.965

15

3

0.895

0.944

0.918

36

Accuracy

0.942

314

M. Sahu and R. Dash

Fig. 28.8 Comparison of the sensitivity of each class between DNN without SMOTE and DNN with SMOTE

Fig. 28.9 Comparison of the sensitivity of each class between DNN without SMOTE and DNN with borderline SMOTE

From the above figures, we have noticed that the sensitivity of classes 0, 1, and 2 is enhanced in SMOTE with an accuracy of 94%. In the rest of the cases of the oversampling technique, the sensitivity of minority classes is not enhanced uniformly, although they have good accuracy.

28 A Computational Intelligence Approach Using SMOTE …

315

Fig. 28.10 Comparison of the sensitivity of each class between DNN without SMOTE and DNN with borderline SMOTE SVM

Fig. 28.11 Comparison of the sensitivity of each class between DNN without SMOTE and DNN with ADASYN

28.5 Conclusion The results show that the proposed model that consists of DNN and SMOTE can improve the accuracy of a minority class as well as the overall accuracy. This model was used to increase the sensitivity of the minority class such as classes 0, 1, 2 with an accuracy of 94%. When SMOTE is not adopted, the sensitivity of the minority class is less and bears the accuracy of 88%. From this experiment, we have concluded that

316

M. Sahu and R. Dash

when the DNN is trained with the balanced dataset, it would comparatively increase the overall accuracy along with the minority class.

References 1. Pinty, B., Widlowski, J.L., Taberner, M., Gobron, N., Verstraete, M.M., Disney, M., Zang, H.: Radiation transfer model intercomparison (RAMI) exercise: results from the second phase. J. Geophys. Res. Atmos. 109(D6) (2004) 2. Goodenough, D., Li, J., Asner, G., Schaepman, M., Ustin, S., Dyk, A.: Combining hyperspectral remote sensing and physical modeling for applications in land ecosystems. In: 2006 IEEE International Symposium on Geoscience and Remote Sensing, pp. 2000–2004. IEEE (2006, July) 3. Woods, K.S., Solka, J.L., Priebe, C.E., Kegelmeyer Jr, W.P., Doss, C.C., Bowyer, K.W.: Comparative evaluation of pattern recognition techniques for detection of microcalcifications in mammography. In: State of The Art in Digital Mammographic Image Analysis, pp. 213–231 (1994) 4. Pazzani, M., Merz, C., Murphy, P., Ali, K., Hume, T., Brunk, C.: Reducing misclassification costs. In: Machine Learning Proceedings 1994, pp. 217–225. Morgan Kaufmann (1994) 5. Domingos, P.: Metacost: a general method for making classifiers cost-sensitive. In: Proceedings of the Fifth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 155–164, August 1999 6. Ling, C.X., Li, C.: Data mining for direct marketing: problems and solutions. In: Kdd, vol. 98, pp. 73–79, August 1998 7. Japkowicz, N.: The class imbalance problem: significance and strategies. In: Proceedings of the International Conference on Artificial Intelligence, vol. 56, June 2000 8. Masek, J.G., Hayes, D.J., Hughes, M.J., Healey, S.P., Turner, D.P.: The role of remote sensing in process-scaling studies of managed forest ecosystems. For. Ecol. Manage. 355, 109–123 (2015) 9. DeRouin, E., Brown, J., Beck, H., Fausett, L., Schneider, M.: Neural network training on unequally represented classes. Intelligent engineering systems through artificial neural networks, pp. 135–145 (1991) 10. Ha, T.M., Bunke, H.: Off-line, handwritten numeral recognition by perturbation method. IEEE Trans. Pattern Anal. Mach. Intell. 19(5), 535–539 (1997) 11. Sahu, M., Dash, R.: A survey on deep learning: convolution neural network (CNN). In: Intelligent and Cloud Computing, pp. 317–325. Springer, Singapore (2021) 12. Sahu, M., Tripathy, A.: A Map Based Image Retrieval Technique for Spatial Database System 1 (2012)

Chapter 29

Face Mask Detection in Public Places Using Small CNN Models Prabira Kumar Sethy, Susmita Bag, Millee Panigrahi, Santi Kumari Behera, and Amiya Kumar Rath

Abstract The spread of coronavirus can be prevented among the people in a crowded place by making face mask mandatory so that the droplets from the mouth and nose would not reach the masses nearby. The negligence of some people, i.e., by not wearing the mask, causes the spread of this pandemic. Therefore, persons who do not wear masks should be tracked at the entrance to public venues such as malls, institutions, and banks. The mechanism proposed warns if the individual is wearing or not wearing the mask. The proposed system is built in a small CNN model to integrate any low-end devices with minimal cost. The small CNN model like ShuffleNet and Mobilenetv2 are evaluated in Transfer Learning and Deep Learning but the Deep Learning model has better performance than the Transfer Learning. Again, the Deep Learning approach, i.e., mobilenetv2 plus Support Vector Machine achieved 99.5% accuracy, 99.01% sensitivity, 100% precision, 100% FPR, 99.51% F1 score, 99.01% MCC, and 99.01% kappa coefficient.

29.1 Introduction The worldwide COVID 19 coronavirus outbreak has made public mask use quite widespread. Masks are used to combat virus spread through air transmission. Some people hide their emotions from others by using mask on their faces; while some are using mask for self-consciousness about their appearance. Wearing a face mask or other covering over the nose and mouth reduces the chance of coronavirus infection by over 90% by reducing the forward distance traveled by exhaled breath, according to recent research from the University of Edinburgh [1]. According to the WHO, face mask wearing and taking other precautions like social distancing and frequent P. K. Sethy (B) · S. Bag Department of Electronics, Sambalpur University, Burla, Odisha 768019, India M. Panigrahi Department of ETC, Trident Academy of Technology, Bhubaneswar, Odisha 751024, India S. K. Behera · A. K. Rath Department of CSE, VSSUT, Burla, Odisha 768019, India © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2022 D. Mishra et al. (eds.), Intelligent and Cloud Computing, Smart Innovation, Systems and Technologies 286, https://doi.org/10.1007/978-981-16-9873-6_29

317

318

P. K. Sethy et al.

hand washing can help minimize the spread of the coronavirus. Many countries have enacted their restrictions regulating face masks in response to these instructions, although many people have refused to heed government directives. Police are struggling to apprehend these individuals and are unable to locate them all. Face mask detection and object detection are two computer vision problems that will aid police in identifying people, who are not wearing masks in public places and controlling them. The two main reasons behind the virus’s transmission, according to WHO, are respiratory droplets and physical contact between persons [2]. If an infected person sneezes or coughs, respiratory droplets can spread through the air and reach other people nearby (within 1 m). Every Government and public health department has issued warnings against social isolation and wearing masks [3]. The surgical masks, medical masks, procedural masks, and the designer masks in various shapes are all worn by people [4]. To make wearing a facemask compulsory, specific processes must be created that require people to put on a mask before entering public locations. Many machine learning algorithms have been used to evaluate faces for security, authentication, and surveillance in the past; however, widespread mask wearing can render such systems useless. The earliest attempts at face recognition were conducted in 1973 [5]. Handcrafted features and machine learning techniques were proposed by Nanni et al. to produce good classifiers for detection and recognition [6].

29.2 Related Work Mostly the previous research focused on face identification and reconstruction of the different images of the face. This research aims to develop a machine-based model to detect face mask in persons which can help to reduce the transmission of COVID-19. Scientists and researchers have found that using face masks reduces the spread of COVID-19 virus. In [7] a new method is adopted to wear a face mask. They were able to categorize three different types of wearing a face mask. Correct facemask wear, wrong facemask wear, and no facemask wear are the three different types which attained 98.60% accuracy. Principal Component Analysis (PCA) was used by Ejaz et al. [8] to identify the person on masked and unmasked facial recognition. They discovered that wearing masks had a significant impact on the accuracy of facial resonation utilizing the PCA. Considering larger datasets for proper detection of face masks, Akhil et al. [9] proposed an algorithm with 52,635 images with and without masks. The authors used original YOLO v4 and achieved the highest mean average precision of 71.69% among all YOLO versions. Tiny YOLO v4 has the greatest MAP of all tiny variations at 57.71%. In [10], the ResNet model is considered the baseline model, and transfer learning is implemented to obtain an accuracy of 98.2%. This proposed model achieves a low inference time and high accuracy, which is applicable for video surveillance. Emotions were classified in real-time in [11]. Models based on deep learning and classical machine learning, the authors of [12] created a 95% accurate face mask detection beep warning. Uncovering the truth about the usage of required medical masks in operating rooms was another group of authors’ goal.

29 Face Mask Detection in Public Places Using Small CNN Models

319

The proposed system’s accuracy was 95% [13]. More study is needed to prevent coronavirus dissemination via infected person’s droplets. Encircle the masks with green or red bounding boxes. When used with R-CNN, the authors in [14] achieved 55% precision for YOLOv3 and 62% precision for the faster R-CNN model, with 0.045 and 0.15 s inference time for the two models, respective. For embedded or mobile systems, YOLOv3 requires a lot of processing power and inference. In this way, a GAN-based network can remove facial masks and reconstruct the image. Nizam et al. [15] propose a method for generating a realistic full-face image.

29.3 Small CNN Architectures: MobileNetv2 and ShuffleNet 29.3.1 MobileNetv2 The MobileNet-v2 network is a residual network. A residual network is a Directed Acrylic Graph (DAG) network with residual (or shortcut) connections that avoid the major network layers. The convolutional neural network MobileNet-v2 was used to train a portion of the Image Net database. Compared to MobileNetV1 [16], the goal of implementing MobileNetV2 is to reduce computational complexity and learn parameters while maintaining equivalent accuracy. Figure 29.1 shows the operational flow of mobilenetv2.

29.3.2 ShuffleNet It is a new CNN architecture that performs better than MobileNetv2 when implemented for mobile device applications. ShuffleNet achieves low computational cost and limited power for computation, i.e., 10–150 MFLOPs [17]. Furthermore, ShuffleNet supports additional feature map channels, which aid in the encoding of more data and is particularly essential in the performance of very small networks. The ShuffleNet unit consists of a first 1 × 1 layer with point-wise group convolution followed by a channel shuffle operation to form the fundamental building block. The operational flow of ShuffleNet is illustrated in Fig. 29.2.

29.4 Materials and Methodology This section described the dataset details and adapted methodology.

320 Fig. 29.1 The operational flow of Mobilenetv2

Fig. 29.2 The operational flow of ShuffleNet

P. K. Sethy et al.

29 Face Mask Detection in Public Places Using Small CNN Models

321

Fig. 29.3 Images of people having face masks and no mask

29.4.1 Dataset Description This research is conducted on the publicly available dataset. The dataset contains 1006 numbers of images, with 503 masks and 503 having no masks. The dataset is COVID Face Mask Detection Dataset [18]. Figure 29.3 shows some samples from the dataset.

29.4.2 Methodology The methodology involves two approaches: transfer and deep learning. In the transfer learning approach, the small CNN model like mobilenetv2 and ShuffleNet are finetuned for classifying the images of people wearing or not wearing masks. That means the last layer of ShuffleNet and mobilenetv2 are refashioned for a two-class classification problem. The both the small CNN models are evaluated individually. The model in the transfer learning approach for face mask detection is represented in Fig. 29.4. Again, the deep learning approach is used to evaluate the small CNN models. In this approach, the last layer (SoftMax layer) is replaced by SVM. The SVM has the advantage of enhancing the CNN model’s performance as it has error correction with a one-vs-all coding design. The deep learning approach for face mask detection is depicted in Fig. 29.5.

322

P. K. Sethy et al.

Fig. 29.4 Face mask detection using small CNN model in transfer learning

Fig. 29.5 Face mask detection using small CNN with SVM

29.5 Result and Discussion The two small CNN models, i.e., ShuffleNet and Mobilenetv2, are evaluated in both transfer learning and deep learning approaches. The models are executed in window 10, core i5, 5th generation, 8 GB RAM Laptop with in-built NVIDIA GEFORCE in MATLAB 2019a platform. The training parameters, i.e., mini-batch size of 32, initial learning rate of 0.001, maximum epoch of 10, and validation frequency of 5, are used along with the learning method, namely Stochastic gradient descent with momentum (SGDM). In addition, in deep learning, the deep features of MobileNetV2 and ShuffleNet are fed to the SVM for detection of face masks. The training and validation progress in ShuffleNet and Mobilenetv2 for the transfer learning approaches are shown in Figs. 29.6 and 29.7, respectively. Again, the confusion matrix in the transfer learning approach and deep learning approach of ShuffleNet and Mobilenetv2 are depicted in Figs. 29.8 and 29.9, respectively. Further, the evaluation scores are based on accuracy, sensitivity, precision, FPR, F1 score, MCC, and kappa coefficient, recorded in Table 29.1.

29 Face Mask Detection in Public Places Using Small CNN Models

323

Fig. 29.6 Training and validation progress of ShuffleNet

Fig. 29.7 Training and validation progress of Mobilenetv2

29.6 Conclusion The alert of wearing the mask at the entrance of public visiting places is necessary to protect oneself against the coronavirus. The droplets released from mouth and nose from an affected person could infect the front person if sneezing or coughing occurs in crowded areas and in public transports. Therefore, it becomes extremely necessary to wear a full mask properly while traveling or in any public place. The proposed model efficiently detects the people not wearing masks properly and has achieved 99.5% accuracy, 99.01% sensitivity, 100% precision, 100% FPR, 99.51%

324

P. K. Sethy et al.

Fig. 29.8 Confusion matrix of ShuffleNet (left) and Mobilenetv2 (right) in transfer learning

Fig. 29.9 Confusion matrix of ShuffleNet (left) and Mobilenetv2 (right) in deep learning

Table 29.1 Performance measures in % of small CNN models Classification models

Accuracy

Sensitivity

Shufflenet

75.25

61.39

Precision

FPR

F1 score

MCC

Kappa coefficient

52.55

50.50

84.93

10.89

71.26

Mobilenetv2

71.78

59.41

78.95

15.84

67.80

44.96

43.56

Shufflenet-SVM

98.51

98.02

99.00

0.99

98.51

97.03

97.03

Mobilenetv2-SVM

99.50

99.01

100.00

100.00

99.50

99.01

99.01

F1 score, 99.01% MCC, and 99.01% kappa coefficient. This work may extend to detect the face mask in a real-time video surveillance in a mass gathering.

29 Face Mask Detection in Public Places Using Small CNN Models

325

References 1. Godoy, L.R.G., et al.: Facial protection for healthcare workers during pandemics: a scoping review. BMJ Glob. Health. 5(5) (2020) 2. WHO Coronavirus Disease (COVID-19) Dashboard., https://covid19.who.int/. Last Accessed 21 July 2021 3. Qin, B., Li, D.: Identifying facemask-wearing condition using image super-resolution with classification network to prevent COVID-19. Sensors 20(18), 5236 (2020) 4. Esposito, S., Principi, N., Leung, C.C., Migliori, G.B.: Universal use of face masks for success against COVID-19: evidence and implications for prevention policies. Eur. Resp. J. 55(6) (2020) 5. Jia, Y., Shelhamer, E., Donahue, J., Karayev, S., Long, J., Girshick, R., Darrell, T.: Caffe: convolutional architecture for fast feature embedding. In: Proceedings of the 22nd ACM International Conference on Multimedia, pp. 675–678 (2014) 6. Nanni, L., Ghidoni, S., Brahnam, S.: Handcrafted vs. non-handcrafted features for computer vision classification. Pattern Recogn. 71, 158–172 (2017) 7. Inamdar, M., Mehendale, N.: Real-time face mask identification using facemasknet deep learning network. Available at SSRN 3663305 (2020) 8. Ejaz, M.S., Islam, M.R., Sifatullah, M., Sarker, A.: Implementation of principal component analysis on masked and non-masked face recognition. In: IEEE International Conference on Advances in Science, Engineering and Robotics Technology, pp. 978-1-7281-3445-1/19 (2019) 9. Kumar, A., Kalia, A., Verma, K., Sharma, A., Kaushal, M.: Scaling up face masks detection with YOLO on a novel dataset. Optik 239, 166744 (2021) 10. Sethi, S., Kathuria, M., Kaushik, T.: Face mask detection using deep learning: an approach to reduce risk of coronavirus spread. J. Biomed. Inform. 120, 103848 (2021) 11. Hussain, S.A., Al Balushi, A.S.A.: A real time face emotion classification and recognition using deep learning model. J. Phys. Conf. Ser. 1432(1), 012087. IOP Publishing (2020) 12. Maurya, P., Nayak, S., Vijayvargiya, S., Patidar, M.: COVID-19 face mask detection. In: 2nd International Conference in Advanced Research in Science, pp. 20–34. Engineering and Technology, France, Paris (2020) 13. Nieto-Rodríguez, A., Mucientes, M., Brea, V.M.: System for medical mask detection in the operating room through facial attributes. In: Iberian Conference on Pattern Recognition and Image Analysis, pp. 138–145. Springer, Cham (2015) 14. Singh, S., Ahuja, U., Kumar, M., Kumar, K., Sachdeva, M.: Face mask detection using YOLOv3 and faster R-CNN models: COVID-19 environment. Multimedia Tools Appl. 80(13), 19753– 19768 (2021) 15. Din, N.U., Javed, K., Bae, S., Yi, J.: A novel GAN-based network for unmasking of masked face. IEEE Access 8, 44276–44287 (2020) 16. Howard, A.G., Zhu, M., Chen, B., Kalenichenko, D., Wang, W., Weyand, T., Andreetto, M., Adam, H.: Mobilenets: efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861 (2017) 17. Zhang, X., Zhou, X., Lin, M., Sun, J.: Shufflenet: an extremely efficient convolutional neural network for mobile devices. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6848–6856 (2018) 18. https://www.kaggle.com/prithwirajmitra/covid-face-mask-detection-dataset

Chapter 30

LSTM-RNN-Based Automatic Music Generation Algorithm R. I. Minu, G. Nagarajan, Samarjeet Borah, and Debahuti Mishra

Abstract In recent years, recurrent neural network models have defined the normative process in producing efficient and reliable results in various challenging problems such as translation, voice recognition, and image captioning, thereby making huge strides in learning useful feature representations. With these latest advancements in deep learning, RNNs have garnered fame in computational creativity tasks such as even that of music generation. In this paper, we investigate the generation of music as a sequence-to-sequence modeling task. We compare the performance of multiple architectures to highlight the advantages of using an added attention mechanism for processing longer sequences and producing more coherent music with a better quality of harmony and melody. The dataset used for training the models consists of a suitable number of MIDI files downloaded from the Lakh MIDI dataset. To feed the MIDI files to the neural networks, we use piano roll as the input representation. The recurrent neural networks learn long-term temporal specific structures and patterns present in musical data. Once we are done with training, we can generate the musical data using a primer of musical notes.

30.1 Introduction A human brain can easily interpret sequential data due to our ability to utilize the contextual information provided by the understanding of previous inputs. Traditional neural networks are plagued by the loss of comprehension of precious preceding information. Recurrent neural networks evade this issue by using looping’s to allow R. I. Minu SRM Institute of Science and Technology, Kattankulathur, Tamilnadu, India G. Nagarajan Sathyabama Institute of Science and Technology, Chennai, Tamilnadu, India S. Borah (B) Sikkim Manipal Institute of Technology, Sikkim Manipal University, Sikkim, India D. Mishra ITER, Siksha ‘O’ Anusandhan University, Bhubaneswar, India © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2022 D. Mishra et al. (eds.), Intelligent and Cloud Computing, Smart Innovation, Systems and Technologies 286, https://doi.org/10.1007/978-981-16-9873-6_30

327

328

R. I. Minu et al.

for the persistence of information and are widely used for natural language processing problems [1, 2]. Music is after all the ultimate language. Many well-known artists have created compositions with creativity and unwavering intent. Musicians like Johann Sebastian Bach are renowned for their prolife rent talent in generating music that contains a very strong degree of underlying musical structure [3, 4]. Our work is an attempt to develop and compare recurrent neural network models which concisely convey the idea of tunefulness and consonance that can also similarly learn such musical structure. We do so by providing a purposeful policy to embody notes in music by using piano roll matrices as our input and target representations and build interesting networks involving the use of multi-layered LSTMs and encoder-decoders with and without an added attention layer. Therefore, our goal is to evaluate generative models that harness the power of recurrent neural network models to produce musical pieces with tunefulness and consonance that are pleasing to the ear just like if it is composed by humans [5, 6]. Our goal is to use the best and latest innovations in deep learning for the generation of music. We are very excited to find new tools for music creation. This will enable anyone with little musical knowledge to be able to create their own complex music easily and quickly [7, 8].

30.2 Related Work 30.2.1 Melody-RNN Melody-RNN is developed at Google’s open-source project Magenta. Project Magenta is constantly working on finding new techniques to generate art and music, possibly generating pleasing music without any human intervention needed. The Melody-RNN is built as a two-layered LSTM network. At present, Melody-RNN comes in three variations [9–11]. First is the normal two-layered LSTM network, which utilizes one hot encoded note sequence extracted from MIDI melodies as input and output targets to the LSTM; second is Lookback-RNN, it initiates customized inputs with labels to let the model comfortably extract patterns which take place across first and second bars; the final one is Attention RNN, it contains an added attention layer to inform the model to focus on important aspects of past data without the need to save all of that data in the RNN cell’s current state [12–14].

30.2.2 Biaxial-RNN Biaxial-RNN is Daniel Johnson’s impressive RNN music composition undertaking [15, 16]. Its blueprint outlines properties such as:

30 LSTM-RNN-Based Automatic Music Generation Algorithm

329

• Creation of polyphonic music based on the choice of meaningful collection of notes or chords. • Orderly repetition of identical notes, i.e., playing A# two times is fundamentally different than keeping A# pressed for two beats. • Understanding Tempo: The ability to create music based on a predefined time signature. • Note-independent: Being able to transpose up and down musical notes with same framework of network. An important point to be noted is that plenty of the RNN-based music generation techniques is independent of time but dependent on notes. However, if we go up one octave, we must also similarly wish that the model produces a near similar musical piece rather than something extremely divergent. The Biaxial-RNN model contains dual-axis (time axis and note axis) to help permit historical information to be properly utilized for both the time axis and note axis [17, 18].

30.2.3 WaveNet The WaveNet model has been heavily influenced by earlier works at Google, precisely a model named Pixel-RNN that was also created by Google’s DeepMind division. WaveNet is special because it uses raw audio input as the training data and even in a single second of the sound, there are more than 16,000 samples and therefore making a single prediction conditioned on all the preceding samples is an extremely daunting task. WaveNet circumvents this issue by using a fully convolutional neural network along the time axis, wherein the convolutional layers are dilated such that its receptive field can expand exponentially with increasing deepness and cover hundreds of timesteps within each stack of computation in the network. The output is, therefore, more meaningful and is far better than the latest text-to-speech models, diminishing the distance between human-level performance by more than 50%. During the training phase, the input data is actual waveforms. After the training is completed, we can perform sampling from the network’s generated probability distribution to create authentic music. Developing up samples like this at each timestep is very computationally straining, but it is equally important for producing highly convincing sounding music. The WaveNet model also has to have a large amount of training data so that it can learn to produce good quality musical pieces [19, 19].

30.3 System Architecture We have implemented four system architectures as follows

330

R. I. Minu et al.

30.3.1 LSTM Recurrent neural networks face a problem while training on long sequences; they are not able to remember long-term dependencies. They face the problem of vanishing gradients because in recurrent neural networks we use back propagation through time (BPTT) for calculating gradients. In BPTT after each time step, the gradients become smaller or larger as we propagate backward through the network because recurrence brings repeated multiplication which causes gradients to become very small or very large, leading to the layers at the starting of the neural networks being updated very less compared to the layers at the end of the network. The solution to this is to use the Long short- term memory (LSTM) model. The core concept in LSTMs is to make sure information resides within protected memory cells that are not damaged or affected by the normal flow of data in the recurrent network. The dilemma of writing to, reading from, and forgetting, via the operation of the gates, is controlled at the meta-level and is learned along the training process. Thus, the gates have their own weights which are differentiable and enable for BPTT and allow for the normative process of learning—hence, every LSTM cell adapts in keeping its weights on the basis of its input data to reduce the overall loss. We utilized a Long-Short-Term-Memory (LSTM) recurrent neural network (RNN) as shown in Figs. 30.1 and 30.2 with two layers and dropout to create a regularized design to predict the next musical connotation for each time step. This architecture lets the user choose the different hyperparameters such as sequence length, sizes of each batch, and rates for learning. This whole work was performed using the TensorFlow library.

Fig. 30.1 LSTM architecture

30 LSTM-RNN-Based Automatic Music Generation Algorithm

331

Fig. 30.2 LSTM model architecture

30.3.2 LSTM with Attention A very challenging problem in utilizing machine learning to produce long sequences, such as music, is creating long-term structure. Long-term structure comes easily to people, but it is difficult for neural networks. It seems that in absence of longterm structure, the music generated by neural nets is far off and random. Humans translating a rather long sentence usually pay special attention to the word that they are presently translating. LSTM neural networks perform this functionality via an added attention mechanism, stressing on important portions of the sequential data they are provided with. The implementation is shown in Figs. 30.3 and 30.4 The attention mask is created using a mechanism based on the content. The attention layer throws an inquiry concerning important aspects to focus on. Every input sequence is multiplied to the resulting inquiry to create a tally, that entails the quality of matching the resulting inquiry. These tallies are converted into an attention mask using the SoftMax function to produce the attention distribution. The mask is then fed to the output layers to produce the next note for each time step. This whole work was performed using the TensorFlow library.

30.3.3 Encoder-Decoder An encoder-decoder architecture is a special type of neural net with a set of hidden layers and an important constraint that we have used is that the output nodes are

332

R. I. Minu et al.

Fig. 30.3 LSTM with attention

Fig. 30.4 LSTM with attention model architecture

of same dimension as that of input nodes. The decoder is meant to use the latent representation or embedding of the input layer produced by the encoder. Training an encoder-decoder network is performed using conventional supervised learning. In reality, the power of the encoder-decoder lies in its ability to learn the encapsulated information for all the input elements in a more compressed latent representation. The hidden layer is made such that it has lesser units as compared to the dimensions of the input data, the encoder’s job is to compress data and the decoder’s job is to produce, as well as possible, the best next step prediction using the compressed data. This enables the encoder-decoder network to uncover important and useful features for encoding the data into latent variables. The latent variables extracted are also called embedding vectors. Once the net is trained, extracting features from an input

30 LSTM-RNN-Based Automatic Music Generation Algorithm

333

Fig. 30.5 Encoder-decoder architecture

is simple. A feed-forward pass of the network allows us to obtain all of the activations of the output layer. In our model, we have utilized a LSTM of two layers with dropout and batchnormalization to produce an encoder that converts our data into its latent representation and similarly a decoder which enables us to make predictions of the subsequent note for each time step based on the embedding produced by the encoder with the help of repeat vector layer to pass the embedding vector with the appropriate shape to the decoder as illustrated in Fig. 30.5. The time distributed layer makes it possible for our model to distribute the predictions calculated over discrete timesteps and produce the output with the specified sequence length. This whole work was performed using the TensorFlow library along with Keras.

30.3.4 Encoder-Decoder with Attention The encoder-decoder design for recurrent neural networks such as LSTMs is turning out to be a very useful approach for a wide variety of sequence-to-sequence prediction tasks in the field of natural language processing. It is however burdened by input data of high sequential length as shown in Fig. 30.6. Attention technique is intended as a way to overcome this drawback of the encoderdecoder architecture in which an embedding of the input data to a vector of fixed length is decoded at each output time step, therefore, increasing the ability of the model to learn a lengthy sequence. As an alternative of creating a single fixed context embedding of the input data, the attention layer creates a vector of context that is fitted especially toward every output prediction. The decoder then outputs one result at a time, which is combined with more layers corresponding to the attention weights until it outputs a (y_hat), i.e., the prediction for the output at the present time step. This added attention layer

334

R. I. Minu et al.

Fig. 30.6 Encoder-decoder with attention

basically calculates the measure of how each encoded sequence suits the decoder’s prediction at each timestep and helps the model generalize better to longer sequences. Our deep learning implementation was done in pytorch.

30.4 IV Implementation In this segment, we shall discuss training the network and how we produce the musical content. The procedure is shown in Fig. 30.7. Dataset—The Lakh MIDI dataset houses 176,581 MIDI files. The objective of this dataset is to promote large-scale music information retrieval. Data preprocessing—First we downloaded a suitable number of MIDI files from the Lakh MIDI dataset. Using the Mido library defined in python we extracted the tempo (beats per minute) and resolution (ticks per second) of each track. With this information we crafter a piano roll matrix of shape—(number of possible notes {49}, sequence of timesteps {50}) for each track. This piano roll representation was used throughout all our models. We then created input and target sequences from the piano roll matrices such that assuming a sliding window of 50, the first 50 columns are fed to the model as the input and the next 50 are the targets which the model tries to capture. In such a manner we form the dataset on which the model is trained. A point to be noted about piano roll representation is that it allows for

Fig. 30.7 Flow diagram of music generation

30 LSTM-RNN-Based Automatic Music Generation Algorithm

335

capturing of multiple notes to be played simultaneously and is, therefore, an ideal form of input representation to use for generating music. • Training- Our model comprises different architectures involving LSTMs and autoencoders with and without an added attention mechanism as specified in the above section. The model needs to be trained for extracting the true underlying conditional probability distribution of the musical connotations that can be played for a particular step in time, it is essentially governed based on the notes played in preceding time steps. The predictions of the model at each time step t is the probability of playing a musical note at that time step t based on prior note choices. Hence, the network is basically estimating the maximum likelihood of each note in a training sequence within the bounds of the conditional distribution. As we are generating polyphonic music, i.e., multiple notes being on at the same time, this is a multilevel classification problem and therefore we utilize the binary cross-entropy loss function. We have executed our training in TensorFlow, Keras, and pytorch. Once training has been completed, and the probability distribution is absorbed from the model and we perform sampling from this distribution which in turn allows us to generate new sequences. The network generates a unique sequence which is projected one step in time into the future. The respective prior inputs for every time step are used to help the LSTM layers compose the note in the next time period. A primer melody is provided as a sample based on which the next notes are picked from the distribution learned. • Data Post Processing- The output of our model is in piano roll representation. This output is converted once more to MIDI using the python Mido library. The output notes are pruned at each timestep to be configured with respect to the tempo, resolution, and user-preferred sequence length. The MIDI file is then saved in the project folder and ready to be played by a media playback application such as windows media player.

30.5 Performance Evaluation For measuring the performance evaluation for our models, we are using the binary cross-entropy loss on the training and test set to see whether the model is learning sequence prediction correctly. Our problem is of multilevel classification as for each time step there can be multiple notes that could be on at each time step so our target vector is a vector of 0’s and 1’s with a fixed number of labels C. This task is treated as C different binary and independent classification problems where each output neuron decides whether a class will be present in our output.

336

R. I. Minu et al.

Binary cross-entropy loss is also called sigmoid cross-entropy loss as we apply a sigmoid activation on our output before calculating the cross-entropy loss which is calculated as CE = −

‘ C =2

ti log( f (si )) = −t1 log( f (s1 )) − (1 − t1 ) log(1 − f (s1 ))

(30.1)

i=1

For each class we are calculating the binary cross-entropy as the output of each class is independent of the other classes. With the sigmoid function, the output of the network represents a Bernoulli probability distribution for our class C j . After training our model we can set a threshold value for converting our output into the final representation containing a vector of 1’s and 0’s by setting all the values greater than or equal to the threshold as 1 and all the values less than the threshold as 0. P(c j |xi ) =

1 1 + exp −z j

(30.2)

We can clearly see that the loss is decreasing with more complex models when trained with the same dataset, the LSTM model has the highest validation loss and when we add attention to the model the loss decreases considerably so we can conclude that more complex models like LSTM with attention and encoder-decoder with attention give better performance than simpler models. We also measure the perplexity of the models which is the weighted geometric average of the inverse of the probabilities, the formula for perplexity is exp

1 p(x)loge p(x)

(30.3)

It gives a measure of how well your model is able to predict the next sequence given the initial sequence and the complexity of the probability distribution learned. We found that when we used simple LSTM the value of perplexity was high 3.5 as it should be and when we used the more complex model with attention the value of perplexity decreased to 1.2 so our model is good at predicting the next sequence. The result analysis was shown from Figs. 30.8, 30.9, 30.10 and 30.11. We also performed listening studies on the output produced and concluded that the model which was trained for longer durations produced better output in most cases and complex models such as encoder-decoder with attention generally performed better quality than simpler models such as vanilla LSTM.

30 LSTM-RNN-Based Automatic Music Generation Algorithm Fig. 30.8 Loss plot for LSTM model

Fig. 30.9 Figure 12 Loss plot for LSTM with attention

Fig. 30.10 Loss plot for encoder-decoder

337

338

R. I. Minu et al.

Fig. 30.11 Loss plot for encoder-decoder with attention

30.6 Conclusion In this paper, we have described multiple architectures for generating polyphonic music. Through this project, we are able to use deep learning methods to create new music which is trained on existing music. We observed that adding an attention layer in both vanilla LSTMs and encoder-decoder-based architectures lower the loss and improve the accuracy of the model for longer sequences and improve the model’s ability at extracting useful features focusing on the important aspects in musical data. The models which we have used are able to generate music with both tunefulness and consonance and can be considered pleasant to the ear as if created by a human being. The models provide a good balance between local and global structures present in the data. The main application that we hope to achieve for this project is to foster the creative process using machine learning and discover new and exciting rhythms and patterns in music which can improve the quality of music generated by humans.

References 1. Roberts, A., Engel, J., Raffel, C., Hawthorne, C., Eck, D.: A hierarchical latent vector model for learning long-term structure in music. arXiv preprint arXiv:1803.05428 (2018) 2. Engel, J., Resnick, C., Roberts, A., Dieleman, S., Norouzi, M., Eck, D., Simonyan, K.: Neural audio synthesis of musical notes with wavenet autoencoders. In: International Conference on Machine Learning (pp. 1068–1077). PMLR (July 2017) 3. Huang, A., Wu, R.: Deep learning for music. arXiv preprint arXiv:1606.04930 (2016) 4. Yang, L.C., Chou, S.Y., Yang, Y.H.: MidiNet: a convolutional generative adversarial network for symbolic-domain music generation. arXiv preprint arXiv.org/abs/1703.10847 (2017)

30 LSTM-RNN-Based Automatic Music Generation Algorithm

339

5. Hadjeres, G., Pachet, F., Nielsen, F.: Deepbach: a steerable model for bach chorales generation. In: International Conference on Machine Learning (pp. 1362–1371). PMLR (2017, July) 6. Jaques, N., Gu, S., Turner, R.E., Eck, D.: Generating music by fine-tuning recurrent neural networks with reinforcement learning (2016) 7. Jaques, N., Gu, S., Turner, R.E., Eck, D.: Tuning recurrent neural networks with reinforcement learning (2017) 8. Boulanger-Lewandowski, N., Bengio, Y., Vincent, P.: Modeling temporal dependencies in highdimensional sequences: application to polyphonic music generation and transcription. arXiv preprint arXiv:1206.6392 (2012) 9. Yamshchikov, I.P., Tikhonov, A.: Music generation with variational recurrent auto encoder supported by history. arXiv preprint arXiv:1705.05458 (2017) 10. Yu, L., Zhang, W., Wang, J., Yu, Y.: Seqgan: sequence generative adversarial nets with policy gradient. In: Thirty-first AAAI conference on artificial intelligence (February 2017) 11. Nayebi, A., Vitelli, M.: Gruv: algorithmic music generation using recurrent neural networks. Course CS224D: Deep Learning for Natural Language Processing (Stanford) (2015) 12. Vuong, A., Evans, M.: RNN Music Composer 13. Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) 14. Govidan, N., Rajasekaran Indra, M.: Smart fuzzy-based energy-saving photovoltaic burp charging system. Int. J. Ambient Energy 39(7), 671–677 15. Sajith, P.J., Nagarajan, G.: Optimized intrusion detection system using computational intelligent algorithm. In: Advances in Electronics, Communication and Computing, pp. 633–639. Springer, Singapore (2021) 16. Nirmalraj, S., Nagarajan, G.: An adaptive fusion of infrared and visible image based on learning of sparse fuzzy cognitive maps on compressive sensing. J. Ambient Intell. Human. Comput., pp. 1–11 (2019) 17. Indra, M.R., Govindan, N., Satya, R.K.D. N., Thanasingh, S.J.S.D.: Fuzzy rule based ontology reasoning. J. Ambient Intell. Human. Comput., pp. 1–7 (2020) 18. Minu, R.I., Nagarajan, G., Suresh, A., Jayanthila Devi, A.: Cognitive computational semantic for high resolution image interpretation using artificial neural network (2016) 19. Nirmalraj, S., Nagarajan, G.: Biomedical image compression using fuzzy transform and deterministic binary compressive sensing matrix. J. Ambient Intell. Human. Comput, pp. 1–9 (2020)

Chapter 31

A KNN-PNN Decisioning Approach for Fault Detection in Photovoltaic Systems Kirandeep Kaur, Simerpreet Singh, Manpreet Singh Manna, Inderpreet Kaur, and Debahuti Mishra Abstract With the advancement in PV systems, it becomes essential to recognize and categorize defects in order to ensure safety of Photovoltaic systems. The goal of any power generation plant is to maximize energy production while minimizing losses and running expenses. The biggest challenge faced by the researchers is to identify various faults of PV systems. Over the recent years, a large number of methods have been already proposed by experts for identifying faults. However, these systems were not efficient enough and had several drawbacks such as less fault tolerance, limited data used, etc. that degrade their performance. In order to overcome these problems, this paper presented an effective approach to detect and classify faults. The proposed model utilizes two datasets with different faults present in them. This data is then categorized into two parts for training and testing purpose. The data collected for training purpose is then given to the proposed PNN-KNN network. The model gets trained by this historical data in order to predict and categorize the various faults when testing data is given to it. Once the model is trained and tested, its performance is evaluated in terms of various performance parameters like accuracy, precision, recall, and F score. The performance of the proposed PNN-KNN model is analyzed in the MATLAB software. The simulation outcomes prove that the proposed model is more accurate and precise in detecting and classifying faults.

K. Kaur (B) · S. Singh Department of Electrical Engineering, Bhai Gurdas Institute of Engineering and Technology, Sangrur, Punjab, India M. S. Manna SLIET Longowal, Punjab, India I. Kaur Great Alliance Foundation, Punjab, India D. Mishra Department of Computer Science and Engineering, Siksha ‘O’ Anusandhan (Deemed to be) University, Bhubaneswar, Odisha, India e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2022 D. Mishra et al. (eds.), Intelligent and Cloud Computing, Smart Innovation, Systems and Technologies 286, https://doi.org/10.1007/978-981-16-9873-6_31

341

342

K. Kaur et al.

31.1 Introduction Photovoltaic (PV) generating systems have made great development and achieved popularity as a major RES in recent times [1]. Solar panels, also known as PV cells, are comprised of semiconducting materials which can directly convert light from the sun into energy. The PV system consists of one or more solar (PV) cells, a digital to analog energy converter (also called as an inverter), a monitoring system to keep the Photovoltaic modules in place, electrical interfaces, and installation of other components make up a system. Whenever sunlight reaches the panels, it absorbs and detaches electrons, which then flow to form a direct electrical current (DC) [2]. An MPPT, solar concentrators, energy management software, battery system solar tracker, charger, and other equipment may be included as options. A tiny Photovoltaic system can power an individual consumer or a single isolated equipment, such as a light or a weather station. The basic diagram of solar power plant is shown in Fig. 31.1. The low price of solar power, and the improved reliability and efficiency of the electronics related to these generating systems, are two of the key causes for the Photovoltaic industry’s rapid expansion. Supervision and effective performance monitoring of grid-connected Photovoltaic systems are essential for ensuring efficient power harvesting and dependable electricity output at affordable prices [3].

Fig. 31.1 Solar power plant diagram

31 A KNN-PNN Decisioning Approach for Fault Detection …

343

31.1.1 Types of PV Systems Solar Photovoltaic systems are categorized depending on the kind of architecture and the installation’s purpose or goal. The basic PV system is categorized into three categories, these are hybrid, grid-linked and stand-alone. Stand-alone PV system: One of the most prevalent forms of Photovoltaic system in the world is a stand-alone system, which is often placed in houses and small businesses. These systems are not linked to the grids or utility in any way. Inverters, storage units, charge controllers, and modules, are all parts of these networks [4]. Grid-linked PV systems: A grid-linked system may be a stand-alone plant or a medium- and large solar system is known as a Photovoltaic power plant. These plants are typically megawatts in size and encompass many acres of land. Additional elements of Photovoltaic power plants include power condition devices that form (power and voltage quality problems) the electricity before linking to the utility [5]. Hybrid PV systems: Photovoltaic systems are combined with other energygenerating devices including hydro plants, wind turbines, and gas or diesel engines, to create hybrid Photovoltaic system. These additional units are designed to support the Photovoltaic system throughout bad weather and, in most cases, at night [6]. Other than this, another type of Photovoltaic system exists, like direct Photovoltaic system which does not use an inverter to convert DC to alternate current. However, the performance of these power-generating systems gets degraded because of faults that can occur anytime. Therefore, it is extremely important to detect and identify faults in PV modules in order to reduce production losses by decreasing the amount of time the system is operating below its maximum power point [7].

31.1.2 Effect of Weather on PV Power Generation A Photovoltaic system is effective if electronic equipment can extract a substantial amount of energy from a Photovoltaic generator. The ability to collect the optimum amount of energy from a solar device becomes critical. As a result, the precision of the MPPT controller has become a critical aspect in the functioning of equipment for effective Photovoltaic system applications. Because the quality of the sunlight might change over time during the day, MPPT control is difficult, and it plays a significant part in deciding the amount of solar energy needed by the PVG. The climate plays a significant role in determining the amount of solar energy available. Maximum power point tracking problems can be caused by issues with maximum power point tracking power systems.

344

K. Kaur et al.

31.1.3 Faults in PV Modules Fault categorization in Photovoltaic systems usually occurs on both the AC and DC sides of the system. Different forms of failures can damage a PV system, resulting in power loss. Two types of defects could be recognized based on fault duration: Flaws that are both temporary and permanent. Temporary flaws, such as shadowing, fade away with time or can be manually removed in the case of dust, leaf, or bird droppings. After that, the Photovoltaic system resumes its normal operation. Permanent defects, such as open- and short circuits, line-to-line faults, etc. last for a long time and degrade the performance of the entire PV system [8]. Open-Circuit Fault is caused due to open circuit arises when a route carrying current linked in series with the load is mistakenly destroyed or uncovered. An open circuit failure can happen for a variety of causes, including when the wires connecting the panels break or tear. Whereas, Line-Line Fault occurs when two sites of differing potentials in Photovoltaic panels or grids unexpectedly short circuit. As a consequence of the electrical mismatch between Photovoltaic panels, overcurrent will flow into the damaged string [9]. Furthermore, Ground faults are caused by accidental links between the ground and circuits is known as a ground fault. The current will flow to the ground via the shortcut, resulting in reduced output current and voltage [10]. In order to detect these faults at early stages, several methods have been proposed by various researchers over the recent years which are explained in the next section.

31.2 Literature Survey A wide range of solutions were presented by different experts in order to identify and detect the faults that can occur in PV system at early stages, some of them are discussed here; Badr et al. [11], presented an automatic Defect Prediction and Identification based on SVM for the Photovoltaic system in a grid-linked PV system. Two power processing stages were used in the Photovoltaic system: two-level VSI and a three-phase and DC-DC boost converter to get the most out of the Photovoltaic system by raising the voltage output. Kharage et al. [12], developed MSD “MultiResolution Signal Decomposition” process and two ML approaches, that is FL and KNN in order to detect and identify Line-Line and Line-Ground faults in the solar panel cluster. Sunil Rao et al. [13], used NN techniques for defect identification from monitoring sensors that detect information and react at every particular unit in more depth to increase the efficiency of the system. ArefEskandari et al. [14], presented a unique and smart fault detection and classification technique based on ML for LG and LL failures on the direct current side of Photovoltaic system. Zhihua Li et al. [15], proposed “SSTDR” Spread spectrum time domain reflectometry to detect and locate LG and LL approximation open-circuit faults in operational electrical systems, such as energy systems and Photovoltaic systems. Andrianajaina et al. [16],

31 A KNN-PNN Decisioning Approach for Fault Detection …

345

proposed a defect detection method “ANOVA” for a Solar Power System. André EugênioLazzaretti et al. [17], described a Monitoring System (MS) that measures electrical and atmospheric factors to create real-time and statistical data, enabling the estimation of plant efficiency metrics. Soffiah et al. [18], presented the detection of existence of short-circuit fault in the Photovoltaic panels that was linked to the grid in order to minimize personnel attempt and processing time in identifying the existence of fault. Pradeep Kumar et al. [19], provided a simple defect detection approach depending on the known data of array current and voltage based on wavelet packets to detect and distinguish various faults from partial shadowing conditions. From the literature survey conducted, it is analyzed that a large number of fault detection techniques were proposed by various researchers in order to identify and detect faults in PV systems at early stages. AsPV is widely used these days as it is very effective and cuts the costs to a greater extent. The maintenance of PV is very easy and it also gives many advantages which makes it a popular area of research for the researchers. However, these systems were not efficient enough and had many drawbacks that degrade their overall performance. In traditional systems, main focus was on faults that are caused during operations. But other than this there are number of factors that can degrade the performance in the system. Therefore, working with on system faults does not provide fault tolerance, so, there is need to consider at least one more fault-causing area other than on system faults. Moreover, in traditional systems, the model was trained and tested by utilizing only one dataset, which means that if there will be any different case for evaluation that may cause issue, it cannot be detected accurately. Thus, above limitations of the existing model arise a requirement of designing a novel approach which not only overcome the shortcomings but also enhance the model by upgrading the system.

31.3 Proposed Model In order to overcome the limitations of the traditional systems, a new and effective model has been proposed in this paper to detect faults in PV systems. The proposed system is focusing on working with different types of faults other than on system faults. The proposed model works on two datasets instead of one. The weather-based dataset is also considered along with existing dataset used in traditional approaches. The reason behind using this dataset is that, for an instance if there will be winter season, the PV system will not be able to provide sufficient current and fault may arise, in such condition if the intelligent network will not be trained with such information it will not detect the fault which degrades the performance of the system. The more information or cases with different scenarios will be available for the intelligent network, the more efficient the system will be in detecting faults. In addition to this, the proposed scheme utilizes KNN classifier along with the PNN in order to enhance the fault detection rate. The reason for using KNN classifier with PNN is that, PNN is working effectively if possible cases are given during training but in case there will be different inputs which is not related to trained

346 Fig. 31.2 Flow diagram of proposed model

K. Kaur et al.

Start

Read dataset from .csv/.mat and perform preprocessing

Split data training and testing

Train PNN network and perform testing

Train KNN and perform testing

Combine the decision outcomes for final test results, and evaluate the performance

End

network, in that case, might PNN will not be able to provide effective results, therefore KNN will give results on basis of nearest neighbour algorithm and finally the classification or detection decision of both classifiers will be combined to get final detection results which will be more efficient and accurate. The flow of the proposed model is as shown below in Fig. 31.2.

31.3.1 Dataset Used In the proposed work, two datasets are utilized instead of one. The first dataset is taken from the GitHub in which information regarding the various internal faults and external disturbances is included. Both internal faults and external disturbances result in dip in current and voltage readings. The second dataset that is utilized in the proposed work is also taken from the GitHub in which information is collected for 16 days from the functioning of a gridtied solar plant for both problematic and regular conditions. This data is then spitted into two “.mat” files, one is called as dataset_elec.mat file in which information for DC electrical data for current and voltage is present and the another is called as

31 A KNN-PNN Decisioning Approach for Fault Detection …

347

Table 31.1 Sample dataset 1 used in proposed work Irradiance

Temp

Current A

Current B

Voltage A

Voltage B

1.3729

2.3816

0.0608

0.0073

0.7143

0.555

1.3604

2.3816

0.0615

0.0064

0.6944

0.5427

1.5118

2.3883

0.0646

0.0067

0.711

0.5583

1.5534

2.392

0.0628

0.0067

0.6991

0.5465

1.4355

2.392

0.0606

0.0076

0.7035

0.5553

Table 31.2 Sample dataset 2 used in proposed work Current

Voltage

7.7816

258.875

7.8348

256.882

8.2151

260.152

8.2384

259.731

8.5608

247.045

0.0985

1.9035

0.1101

0.7783

dataset_amb.mat in which information about temperature, irradiance, and faults is available. The sample data for both datasets are given below in Tables 31.1 and 31.2.

31.3.2 Working The proposed model starts by gathering the information regarding various faults such as degradation, line to line, short circuit, etc. that can degrade the performance of the entire PV system. In this paper, two datasets are utilized and their impact on current and voltage readings is also analyzed. The information gathered is then divided into two categories one for training the network and other for testing purposes. The training data trains the model in order to detect and identify faults at earlier stages and to enhance the accuracy of the system. Once the model is trained, the next step is to evaluate its performance, for this testing data is provided to the network. The next step that is followed by proposed PNN-KNN system is classification which is done by using the KNN-PNN classifiers. This classification rate obtained in the model is then evaluated to check how efficiently and accurately the system can detect faults. The performance of the proposed PNN-KNN system is determined in terms of various performance parameters and is described in detail in the next section.

348

K. Kaur et al.

31.4 Results and Discussion The performance of the proposed PNN-KNN model is tested and validated in the MATLAB simulation software for the two datasets. The simulation outcomes were obtained and compared with the conventional model in terms of various dependency factors such as accuracy, precision, recall, and Fscore. The detailed description of the outcomes is discussed in this section.

31.4.1 Performance Evaluation for Dataset A The effectiveness of the suggested PNN-KNN approach is analyzed in terms of various performance metrics like accuracy, recall, precision, and Fscore. The proposed PNN-KNN model performs exceptionally well for all parameters whose accuracy, recall, precision, and Fscore values came out to be close to 100%. These results make the proposed approach more accurate and effective in detecting faults in PV systems. Also, the performance of the proposed PNN-KNN model is observed and compared with traditional models in terms of accuracy and is shown in Fig. 31.3. Figure 31.3, demonstrates the comparison graph for accuracy by the proposed PNN-KNN model and traditional Tree, KNN, SVM, and ANN models. From the graph, it is observed that the value of accuracy obtained by the conventional Tree 110 100

97.19

100

93.55

90

84.94

86.14

80

Accuracy(%)

70 60 50 40 30 20 10 0 Tree

kNN

SVM

Different Algorithm

Fig. 31.3 Comparison graph for accuracy

ANN

PNN+kNN

31 A KNN-PNN Decisioning Approach for Fault Detection … Table 31.3 Performance evaluation in terms of Accuracy for Dataset A

Algorithm

349 Accuracy (%)

Tree

84.9400

kNN

86.1400

SVM

93.5500

ANN

97.1900

PNN + kNN

100.0000

and KNN model is equal to 84.9–6.1% respectively. Whereas, the value of accuracy achieved by the traditional SVM and ANN models came out to be 93.5–97.1% respectively. On the other hand, the value of accuracy in the proposed PNN-KNN model came out to be 100%, which means that the proposed model is more accurate in detecting faults. The exact values achieved by the proposed PNN-KNN model and conventional models is given in Table 31.3.

31.4.2 Performance Evaluation for Dataset B The performance of the proposed PNN-KNN model is analyzed and compared with the conventional ELM, SA-RBF-ELM, and PNN models in terms of accuracy and is shown in Fig. 31.4. Figure 31.4 represents the comparison graph for accuracy for proposed PNN-KNN model and traditional ELM, SA-RBF-ELM, and PNN models. From the graph, it is observed that the value of accuracy attained in the traditional ELM, SA-RBF-ELM, and PNN models came out to be 97, 97.2 and 100% respectively. However, in case of 110 100

97

97.25

100

100

90

Accuracy(%)

80 70 60 50 40 30 20 10 0 ELM

SA-RBF-ELM

PNN

Different Algorithm

Fig. 31.4 Comparison graph for wind power generation

PNN+kNN

350 Table 31.4 Performance evaulation in terms of Accuracy for Dataset B

K. Kaur et al. Algorithm

Accuracy (%)

ELM

97

SA-RBF-ELM

97.25

PNN

100

PNN + kNN

100

ELM

97

proposed PNN-KNN model, the value of accuracy achieved is 100%, making it more accurate and efficient model. The specific values obtained for accuracy are given in Table 31.4. From the graphs and tables, it is observed that the proposed PNN-KNN model is more accurate and effective in determining faults.

31.5 Conclusions In this paper, an effective and novel approach is proposed, as PNN-KNN in order to detect and identify different faults of PV systems. The performance of the proposed model is analyzed and validated in the MATLAB simulation software. The simulation results were obtained and compared with traditional models in terms of accuracy, precision, recall, and Fscore. The results obtained for first dataset in terms of accuracy for traditional Tree, KNN, SVM, and ANN systems came out to be 84.94, 86.14, 93.55 and 97.19% respectively. However, the values obtained for accuracy in the proposed PNN-KNN model came out to be 100%. Similarly, the results were obtained for the second dataset. The value of accuracy achieved by the conventional ELM, SA-RBF-ELM, and PNN models is 97, 97.25 and 100% respectively. Whereas, the value of accuracy obtained by the proposed PNN-KNN model came out to be 100%. This proves that the proposed PNN-KNN model is more efficient and effective in identifying faults in the PV system. As a future scope, the work can be done with combined faults of different weather in a single model.

References 1. Muñoz-Cerón. E, Guglielminotti. L, Mora-López. L, Kichou, S., Silvestre, S.: Comparison of two PV array models for the simulation of PV systems using five different algorithms for the parameters identification. Renew Energy (2016) 2. Sankonatti, J., srividya, P.: Fault monitoring system for photovoltaic modules in solar panels using LabVIEW. In: 2018 International Conference on Recent Innovations in Electrical, Electronics & Communication Engineering (ICRIEECE), pp. 2366–2369 (2018) 3. Basnet, B., Chun, H., Bang, J.: An intelligent fault detection model for fault detection in photovoltaic systems. J. Sensors 2020:6960328, 11 (2020)

31 A KNN-PNN Decisioning Approach for Fault Detection …

351

4. Shams el-dein, M.Z., Kazerani, M., Salama, M.M.A.: Optimal photovoltaic array reconfiguration to reduce partial shading losses. Sustainable Energy, pp. 145–153 (2013) 5. Ramdé, E.W., Tapsoba, G., Thiam, S., Azoumah. Y.: Siting guidelines for concentrating solar power plants in the Sahel: case study of Burkina Faso. Solar Energy, pp. 1545–1553 (2010) 6. Accessed online. https://www.renewableenergyworld.com/baseload/the-advantages-of-photov oltaic-systems/#gref on June 08 2021 7. Sánchez-Pacheco, F.J., Mora-López, L., Dominguez-Pumar, M., Silvestre, S., Kichou. S.: Remote supervision and fault detection on OPC monitored PV systems. Sol Energy, pp. 424–33 (2016) 8. Appiah, A.Y., Ayawli, B.B.K., Frimpong, K., Zhang, X.: Review and performance evaluation of photovoltaic array fault detection and diagnosis techniques. Int. J. Photoenergy. 2019:6953530, 19 (2019) 9. Lehman, B., Mosesian, J., de Palma, J., Lyons, R., Zhao. Y.: Line– line fault analysis and protection challenges in solar photovoltaic arrays. IEEE Trans. Ind. Electron. 60, 3784–3795 (2013) 10. Dunlop, J.P.: Photovoltaic systems, 2nd ed. Orland Park, IL, USA: Amer. Tech., (2010) 11. Abdel-Khalik, A.S., Badr, M.M., Hamad, M.S., Hamdy, R.A.: Fault detection and diagnosis for photovoltaic array under grid connected using support vector machine. In: IEEE Conference on Power Electronics and Renewable Energy (CPERE), pp. 546–553 (2019) 12. Korde, P.N., Kharage, S.S.: Improved fault detection and location scheme for photovoltaic systems. In: 2019 IEEE International Conference on Electrical, Computer and Communication Technologies (ICECCT), 2019, pp. 1–4 (2019) 13. Spanias, A., Tepedelenlioglu, C., Rao. S.: Solar array fault detection using neural networks. In: 2019 IEEE International Conference on Industrial Cyber Physical Systems (ICPS), pp. 196–200 (2019) 14. Eskandari, A., Milimonfared, J., Aghaei, M.: Fault detection and classification for photovoltaic systems based on hierarchical classification and machine learning technique. In: IEEE Transactions on Industrial Electronics (2020) 15. Wu, C., Li, Z., Yang, Z.: Simulation of fault detection in photovoltaic system based on SSTDR. In: 2020 5th International Conference on Power and Renewable Energy (ICPRE), pp. 257–261 (2020) 16. Razafimahefa, D.T., Sambatra, E.J.R., Heraud, N., Andrianajaina, T.: ANOVA fault detection method in photovoltaic system. In: 2020 7th International Conference on Control, Decision and Information Technologies (CoDIT), pp. 37–41 (2020) 17. André Eugênio, L., et al.: A monitoring system for online fault detection and classification in photovoltaic plants. Sensors (2020) 18. Soffiah, K., Manoharan, P.S., Deepamangai, P.: Fault detection in grid connected PV system using artificial neural network. In: 2021 7th International Conference on Electrical Energy Systems (ICEES), pp. 420–424 (2021) 19. Nagamani C.I.G., Saravana, I.G.K., Pradeep, B., Jaya Bharata, R.M.: Online fault detection and diagnosis in photovoltaic systems using wavelet packets. IEEE J. Photovoltaics 8(1), pp. 257– 265 (2018)

Chapter 32

Detecting Phishing Websites Using Machine Learning A. Sreenidhi, B. Shruti, Ambati Divya, and N. Subhashini

Abstract The entire world is digitizing at a rapid pace. However, the ever-evolving transformation comes with its fair share of vulnerabilities, opening the doors wider for cybercriminals. One of the common types of attacks criminals indulge in these days is phishing. It involves creation of websites to dupe unsuspecting users into thinking that they are on a legitimate site, making them disclose confidential information like bank account details, usernames, and passwords. This paper aims to find out accurately if a URL is reliable or not, in other words, if it is a phished one or otherwise. Machine Learning (ML) based models provide an efficient way to detect these phishing attacks. This research paper focuses on using three different ML algorithms—Logistic Regression, Support Vector Machine (SVM), and Random Forest Classifier in order to find the most accurate model to predict whether a given URL is safe or not. To achieve this, the respective models are trained using a pre-existing data set and then tested as to whether they can accurately classify the websites or not. The algorithms are also compared based on performance measures like Precision, Accuracy, F1 Score, and Recall to deduce which one of the three is most efficient and reliable for classification and prediction.

32.1 Introduction Phishing is a social engineering attack that is often used to steal a variety of confidential user data like login credentials and bank details. The attacker generally masquerades as a trusted entity, duping the victim into opening an expertly disguised message. The victim is further led to malicious links, which may lead to the installation of malware or initialize ransom ware attacks to reveal sensitive information. A specific kind of phishing attack called email phishing makes use of legitimate-looking fake websites constructed to look identical to real sites to trick the victim into divulging sensitive information. Once the bait is taken, cybercriminals can use the information for fake account creation or identity theft. These attacks can have devastating results A. Sreenidhi · B. Shruti · A. Divya · N. Subhashini (B) School of Electronics Engineering, Vellore Institute of Technology, Chennai, India e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2022 D. Mishra et al. (eds.), Intelligent and Cloud Computing, Smart Innovation, Systems and Technologies 286, https://doi.org/10.1007/978-981-16-9873-6_32

353

354

A. Sreenidhi et al.

if left unnoticed. For individuals, this may include unauthorized purchases, stealing of funds, or identity theft. The main aim of the phishers is to make the websites look as credible as possible so that it is almost indistinguishable from the real site. Although there are a few tricks like checking the URL, the graphics of the site, checking who owns the website, etc. they are not a sure-shot mechanism for accurate detection. Hence, usage of proper detection methods becomes imperative. Detection of phishing domains is basically a classification problem, i.e., it can have only two possible outcomes. Any given website will either be a phishing website or a harmless one. Hence, labeled data with samples marked as genuine or fake websites is needed in the training phase. The dataset which will be used in the training phase is imperative to build a successful detection mechanism. It must contain samples whose classes are precisely known beforehand. This is where machine learning (ML) comes into view. With ML, cybersecurity systems facilitate pattern analysis, active learning for prevention of attacks, and dynamic response [6]. It aids cybersecurity teams in being more dynamic in threat prevention and responding to active attacks in realtime. In short, ML makes cybersecurity simpler, dynamic, inexpensive, and much more efficient [12]. In this work, ML algorithms namely, Logistic Regression, SVM, and Random Forest are trained with a pre-existing data set to be able to classify websites as genuine or fraudulent. These algorithms are then compared on the basis of accuracy, precision, F1, and recall score and the better choice is concluded by listing out the respective pros and cons. Logistic Regression, Random Forest, SVM algorithms are trained using the same data set consisting of various features. The output label values indicate if the given URL is a phishing URL or not. In the result column, a value of −1 denotes a phishing website and “1” represents a normal website. Each row in the dataset represents a website through various metrics. Hence, the ability of the ML algorithms to accurately classify these items into the respective classes is to be verified. Further, the paper is structured as follows. Section 32.2 describes the related works. Section 32.3 provides the details of the data set and methodology; Section 32.4 describes the results of the tests and discusses the results. Section 32.5 concludes the paper and provides directions for future work.

32.2 Literature Survey The author [1] proposes a rule-based method for detecting phishing webpages, generated by studying numerous phishing websites. Decision Tree and Logistic Regression are utilized for rule application and to achieve maximum accuracy and optimum false positives and negatives. The rule set, however, is premature and is likely to produce false alarms. The authors [2] describe an approach for phishing attack detection by hyperlink analysis from the HTML source code incorporating various hyperlink-specific features for detection. Phishing and non-phishing websites dataset is utilized for

32 Detecting Phishing Websites Using Machine Learning

355

evaluation of performance. This approach has high accuracy in detection of phishing websites as logistic regression classifier gives high accuracy. A model to detect phishing attacks using random forest and decision tree was proposed by the authors [3]. A standard dataset was used for ML training and processing. To analyze the attributes of the dataset, feature selection algorithms like Principal Component Analysis (PCA) have been utilized. Maximum accuracy was achieved through the Random Forest algorithm. The paper [4] involves the determination of the features imperative for phishing detection. Fuzzy Rough Set is used in order to select most effective features from the available data sets. Random Forest classification gives the highest F-measure by FRS feature selection. This paper concludes that in absence of inquiry from external sources, a fast-phishing detection mechanism can be achieved with robustness toward zero-day attacks. Phishing data set’s feature using techniques of feature selection like gain ratio, Relief-F, information gain, and RFE for feature selection was studied in paper [5]. Diverse ML algorithms like bagging, SVM, NB, RF, NN, and k-nearest neighbors with PCA are used on remaining and proposed features. Classification accuracy is improved through proposed features with stacking and PCA. A literature survey about phishing website detection was done in paper [6]. It was concluded that tree-based classifiers are most suitable among others. Among all tree-based classifiers, DT and RF are found to be the best with increase in dataset. They have studied about 18 research papers for concluding the best algorithm. The paper [7] involved the use of various file matching algorithms to determine whether the website is phishing or not. The string alignment and file matching techniques tested include Deep MD5 Matching, Main Index File MD5 matching, phishDiff, and Syntactical Fingerprinting. It was concluded that more than 90% detection rates could be achieved with reasonable False Positive Rates. The authors [8] presented a content-based approach using the Random Forest algorithm. Tenfold cross validation using different dataset sizes was implemented. They concluded overall accuracy value was extremely high, which implies that this method works effectively with large datasets. FPR was negligible. A comparison between three classifier algorithms, Feature selection algorithm, Relief attribute selection, and Random Forest Classifier, in terms of accuracy was done by the authors [9]. In Feature selection algorithm, the procedure starts with empty dataset where one feature at a time gets added recursively till the accuracy is at its best. In Relief attribute selection algorithm, the weight of each feature is calculated using Euclidean distance and Manhattan distance. The Random Forest Classifier has information gain, entropy, decision nodes, roof nodes, and leaf nodes as the measured components. The authors [10] proposed a system with feature extraction using NLP techniques. During the tests, Random Forest, Sequential Minimal Optimization, and Naive Bayes algorithms are used. The hybrid approach was more successful than the other tests. In the hybrid approach, the RF was more successful than the other algorithms tested with high success rates.

356

A. Sreenidhi et al.

The authors [11] have used five algorithms namely Decision tree, K-Nearest Neighbors, NB, RF, and SVM. The used algorithms detected the phishing attacks using ML by classifying the features in dataset. The performance metrics based on which they compared the efficiency of the algorithm is accuracy. It was concluded that even with good performance, implementation time is more. Therefore, it is hard to conclude which method is better. The authors [12] have implemented four classifiers Decision tree, NB, SVM, and NN using MATLAB tool. The classifiers were used to detect phishing URLs and to group the URLs under the category of phishing, suspicious or legitimate. The results compared: Accuracy, TPR, and FPR for URL samples. Usage of an ensemble of trees in decision trees model can improve accuracy. Various approaches have been used till date for the detection of phishing learning algorithms. Random forest and decision tree seem to be popular models for the same. However, comparative study has not been made so far including models like Random Forest, Logistic Regression, and SVM to determine the most efficient classifier. To research the same, the implementation of the three ML algorithms was taken up to determine their efficiency. In the data set used, a vast collection of various features is available, which minimizes the chance of outliers. The efficiency of the above three algorithms is then compared on the basis of various performance metrics and execution time.

32.3 Methodology This paper involves the comparison of three ML algorithms, Logistic Regression model, Random Forest Classifier, and SVM. The three ML methods are analyzed to detect if a given URL is a phished one or not. In this work, a predefined data set (phishcoop.csv) downloaded from UCI ML repository is used. It consists of 31 columns, with 30 features and one target. For an efficient classifier, a huge data set with relevant categories is needed. Hence, these 30 features are filtered down to 10 for the ease of computation. Each website entry will have a value of 0,1 or −1 for all the features. Value “−1” in the result column denotes a phishing website and a value of “1” denotes otherwise (Fig. 32.1). Each row in the dataset represents each website using various features as shown in Fig. 32.2.

Fig. 32.1 Preview of data set “phishcoop.csv”

32 Detecting Phishing Websites Using Machine Learning

357

Fig. 32.2 Features of the phishing websites data set

Fig. 32.3 Flowchart of system

Usage of models is achieved with the help of inbuilt libraries of python. The three classifiers are imported from an ML library “sklearn”. It features various ML algorithms and allows to split the data into testing and training data using the function “test_train_split” imported from the “model_selection” class. The required ML algorithm can be directly imported and the training data is used to train the respective models. After completion of training, testing is done using the testing data obtained in the previous step. As shown in Fig. 32.3, initially, feature extraction is performed from the given dataset. The selected features are expected to contain the appropriate information from the input URL such as IP address, URL length, etc. They are marked “1” when the particular URL shows a trait of a genuine URL and “−1” for the fake URL. Further, the data is split into testing (20% of dataset for testing the model) and training data (80% of dataset for training the model). Next, the three classifiers are built and trained using the training data sampled at the start. Once the models are trained for use, the testing data is passed through it. A confusion matrix is created for each model to determine accuracy, precision, F1-Score, and recall. Comparison is done between the three algorithms on the basis of the performance metrics and execution time to conclude which of the three is the best and most efficient model for detecting phishing websites.

358

A. Sreenidhi et al.

32.4 Results and Discussion To analyze the performance of the algorithms, evaluation parameters like Accuracy, Precision, Recall, F1 Score, Execution time are used. Accuracy is the ratio of all correctly predicted observations to the total number of observations. The ratio of all correctly predicted positive outcomes to the overall positive observations gives the Precision value. The ratio of all correctly predicted positive outcomes to the count of actual class is evaluated by the recall score. It considers only correctly predicted positive observations. The weighted average of Precision and Recall is the F1 score. Both false positives and false negatives are considered in this case. The confusion matrix of the algorithms displays the number of True Positives (TPs), True Negatives (TNs), False Positives (FPs), and False Negatives (FNs) in the form of a 2 × 2 matrix. It is color-coded similar to a heatmap. The confusion matrix of all the three algorithms is shown in Figs. 32.4, 32.5 and 32.6. Using the values obtained from the confusion matrix, the performance metrics: accuracy, precision, F1 score, and recall are calculated. To calculate these values, their respective functions are imported from the “sklearn.metrics” class. Finally, the execution time is also printed. Table 32.1 and Fig. 32.7 show the comparison of the output from all the classifiers. It is observed that the values of the performance metrics namely, Accuracy, Precision, Recall, and F1-Score are maximum for the Random Forest Classifier. Random Forest

Fig. 32.4 Confusion matrix-random forest

32 Detecting Phishing Websites Using Machine Learning

Fig. 32.5 Confusion matrix-logistic regression

Fig. 32.6 Confusion matrix-support vector machine

359

360

A. Sreenidhi et al.

Table 32.1 Comparison of metrics Metrics

RF model

SVM

LR

Accuracy

97.01%

96.43%

91.68%

Precision

95.94%

95.23%

91.15%

Recall

98.66%

98.33%

93.73%

F-measure

97.28%

96.75%

92.42%

Execution time

1.0048 s

5.1109 s

0.0809 s

Fig. 32.7 Graphical representation of model evaluation metrics

operates with an accuracy of 97.01% which is the highest of all models, followed by SVM (96.43%) and finally Logistic Regression (91.68%). Accuracy correlates to the percentage of correctly predicted observations, hence, Random Forest correctly predicts approx. 97 out of 100 samples correctly. It is also observed, that the execution time of SVM is the highest, i.e., it takes the most time to train the classifier. The quickest trainable model was observed to be Logistic Regression. It has the least training time, but with the downside of reduced accuracy. Accuracy is compromised in lieu of quick model building. In situations where extremely quick training time is desired with an acceptable compromise in accuracy values, Logistic Regression is the best option for detection of phishing websites. In the case of high-performance requirements as well as acceptable training time, Random Forest is the best choice. It offers the highest performance metrics among the three algorithms, along with low training time, which makes it an optimum classifier for phishing detection. Overall, Random Forest offers the most efficient features of the three algorithms, hence would be the most preferred model for further use.

32 Detecting Phishing Websites Using Machine Learning

361

32.5 Conclusion and Future Work This research work compares three ML Algorithms-Random Forest, Logistic Regression, and SVM to detect if the website is phished or genuine. Random Forest is found to be the most accurate and efficient model out of the algorithms compared, and the results obtained by this model are the most reliable. It was observed that with the increase in dataset, the accuracy of each model can be increased, but the computation time also increases. SVM has the second-highest accuracy but the drawback is that the execution time of SVM is the highest and takes the most time to train the classifier. Logistic Regression, though has less accuracy was found to be having the least execution time and hence could be used where a fast-training time is essential. In future, more features can be added to the dataset, which can lead to more improvement in the models implemented. Combining ML models with other phishing detection techniques can also be considered. For avoiding the problem of overfitting, clustering can be used to identify corrupted data samples and outliers.

References 1. Basnet, R.B., Sung, A.H., Quingzhong L.: Rule-based phishing attack detection. In: Proceedings of the International Conference on Security and Management Steering Committee of World Congress in Computer Science (World Comp) (2011) 2. Jain, A.K., Gupta, B.B.: A machine learning based approach for phishing detection using hyperlinks. J AIHC 10, 2015–2028 (2019) 3. Alam, M.N., Sarma, D., Lima, F.F., Saha, I., Ulfath, R.-E., Hossain, S.: Phishing attacks detection using machine learning approach. In: 2020 (ICSSIT), pp. 1173–1179 (2020). doi: https:// doi.org/10.1109/ICSSIT48917.2020.9214225 4. Mahdieh, D.D., Zabihimayvan.: Fuzzy rough set feature selection enhance detection of phishing attack. In: 2019 Systems (FUZZ-IEEE). IEEE (2019) 5. Zamir, A., Khan, H.U., Iqbal, T., Yousaf, N., Aslam, F., Anjum, A., Hamdani, M.: Phishing web site detection using diverse machine learning algorithms. Electron. Libr. 38(1), 65–80 (2019). https://doi.org/10.1108/EL-05-2019-0118 6. PurviPujara, et al.: Int J S Res CSE & IT.: Phishing website detection machine learning: a review. In: 2018 Sept-Oct-2018 3(7), 395–399 7. Wardman, B., Stallings, T., Warner, G., Skjellum, A.: High-performance content-based phishing attack detection.2011 eCrime Researchers Summit (2011). doi:https://doi.org/10.1109/ecrime. 2011.6151977 8. Akinyelu, A.A., Adewumi, A.O.: Phishing email classification using random forest ML technique. J. Appl. Math. 2014, 1–6 (2014). https://doi.org/10.1155/2014/425731 9. Joshi, A.N., Pattanshetti, T.R.: Phishing attack detection using feature selection techniques. In: International Conference on Communication and Information Processing (ICCIP-2019), College of Engineering, Pune, Wellesley Road, Shivaji Nagar, Pune, India 10. Buber1, E., Diri1, B., Sahingoz2, O.: NLP based phishing attack detection from URLs, 1 Computer Engineering Department, YildizTechical University, Istanbul, Turkey, 2 Computer Engineering Department, Istanbul Kultur University, 34158 Istanbul, Turkey, Chapter, March 2018

362

A. Sreenidhi et al.

11. Ahmad1, S.W., Ismail2, M., Sutoyo3, E., Kasim4, S., Mohamad5, M.: Comparative performance of machine learning methods for classification on phishing attack detection. Int. J. Adv. Trends Comput. Sci. Eng. 9(1.5) 2020 12. Kulkarni1, A., Brown, L.L.: Phishing websites detection using machine learning III2 Department of Computer Science, The University of Texas at Tyler Tyler, (IJACSA). Int. J. Adv. Comput. Sci. Appl. 10(7) (2019)

Chapter 33

Designing of Financial Time Series Forecasting Model Using Stochastic Algorithm Based Extreme Learning Machine Sarbeswara Hota, Arup Kumar Mohanty, Debahuti Mishra, Pranati Satapathy, and Biswaranjan Jena Abstract The financial forecasting research domain includes many directions, out of which forecasting of currency exchange rate attracts many researchers. The researchers have used various neural networks for the development of forecasting models due to the nonlinear and chaotic nature of exchange rate data. Different stochastic optimization algorithms have been hybridized with neural networks for further optimization of the network parameters. In this paper, Extreme Learning Machine (ELM) technique is applied to predict the exchange rate and the recently developed Sine Cosine Algorithm (SCA) is hybridized with this technique. The proposed ELM-SCA model is studied with respect to its prediction performance on a sample dataset for one-day ahead and seven-day ahead prediction. The experimental results demonstrate that the suggested ELM-SCA model produces better results than the ELM model in forecasting of SGD to INR.

S. Hota (B) Department of Computer Application, Siksha O Anusandhan Deemed to be University, Bhubaneswar, Odisha, India A. K. Mohanty Department of Computer Science and Information Technology, Siksha O Anusandhan Deemed to be University, Bhubaneswar, Odisha, India D. Mishra Department of Computer Science and Engineering, Siksha O Anusandhan Deemed to be University, Bhubaneswar, Odisha, India P. Satapathy Department of Computer Science and Applications, Utkal University, Bhubaneswar, Odisha, India B. Jena Tata Consultancy Services, Bhubaneswar, Odisha, India © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2022 D. Mishra et al. (eds.), Intelligent and Cloud Computing, Smart Innovation, Systems and Technologies 286, https://doi.org/10.1007/978-981-16-9873-6_33

363

364

S. Hota et al.

33.1 Introduction There is a visible difference between developed countries and underdeveloped countries. The big difference lies in the economic condition [1]. A numerous factors play major role in determining the economic condition of a country. The currency exchange rate is one of the major factors [2]. The currency exchange rate of a country is influenced by the political and social situation including COVID-19 pandemic. Since the last two decades, researchers in the field of financial engineering domain have developed forecasting models for the time series data prediction including the exchange rate prediction [3]. The models involving statistical algorithms were not able to capture the nonlinearity present in the time series data [4]. The literatures demonstrate the use of statistical and different neural network models for exchange rate forecasting task [5, 6]. The authors in [6] applied multi-layer perceptron for the short-term exchange rate forecasting of three different currencies. In [7], the cascaded FLANN model was proposed for the exchange rate forecasting domain and produced better results than FLANN model as demonstrated by the simulation results. In [8], one of the stochastic algorithms, i.e., Jaya algorithm was used with Extreme Learning Machine (ELM) to forecast the exchange rate. Similarly in [9], Water Cycle Algorithm was used with FLANN model for the same purpose. During the learning process of Single Hidden Layer Feed Forward Neural network (SLFN) using ELM, the randomness used in assigning the input weights and hidden biases lead to slow convergence. So different bio-inspired optimization algorithms have been used to optimally determine the ELM input weights. The purpose of this paper is to use the Sine Cosine Algorithm (SCA) for determining the initial weights of ELM. The organization of the paper is as follows. Sections 33.2, 33.3 and 33.4 describe the methodologies, experimental study, and conclusion of this work, respectively.

33.2 Methodology The SLFN with ELM model and the SCA algorithm are discussed as follows.

33.2.1 SLFN with ELM For SLFN, ELM is applied for the training mechanism [10]. ELM is applied in various domains due to its learning speed efficiency and fast convergence ability. The input layer weights and hidden layer biases are assigned randomly. The weights of the hidden layer are calculated using Moore–Penrose generalized inverse [11, 12]. Procedure: ELM

33 Designing of Financial Time Series Forecasting Model …

1. 2. 3.

Consider a data set D = x j , y j | ∈ R d , j = 1, 2, . . . , n . Initialize input weight wi and bias bi (i = 1, 2, . . . , N ). Eq. (33.1) is applied to compute the hidden layer output as H. ⎛

g(W1 · X 1 + b1 ) g(W2 · X 1 + b2 ) ⎜ g(W1 · X 2 + b1 ) g(W2 · X 2 + b2 ) ⎜ H = ⎜ .. .. ⎝ . . g(W1 · X n + b1 ) g(W2 · X n + b2 ) 4.

365

⎞ . . . g(W N · X 1 + b N ) . . . g(W N · X 2 + b N ) ⎟ ⎟ ⎟ .. ⎠ ... .

(33.1)

. . . g(W N · X n + b N )

Determine the output weight β, where β = H † T , where H † is the MP generalized inverse and T = (y1 , y2 , y3 , . . . .., yn )T .

33.2.2 SCA Algorithm SCA is a recently proposed stochastic optimization algorithm which is used for finding solutions to various optimization problems [13]. SCA has also been used in various applications [14, 15]. This algorithm is motivated by two of the trigonometric functions, i.e., sine function and cos function. These functions are used in updating the positions of the candidate solutions. The various symbols used in this algorithm are given in Table 33.1. The Eq. (33.2) is used for determining the next positions. E i ( j + 1) =

E i ( j) + r1 sin(r2 ) +

r3 E best(i) ( j) − E i ( j)

, r4 < 0.5 E i ( j) + r1 cos(r2 ) + r3 E best(i) ( j) − E i ( j) , r4 ≥ 0.5

(33.2)

During updating position, the random variable r4 whose value lies between 0 and 1 equally determines whether to use sine function or cos function. The value of r1 determines whether the next solution lies between E(j) and Ebest or outside of this range. The parameter r3 puts a random weight to the current solution that defines Table 33.1 Symbol description

Symbol

Description

j

Current generation

E( j)

Current solution

E( j + 1)

Next solution

E best

Individual best solution achieved so far

r1 , r2 , r3 , r4

Random numbers

k

Constant

TotGen

Total number of generations

366

S. Hota et al.

the distance to the best solution. The parameter r 2 decides the distance towards the solution or outward from it. For getting the global optimal solution, the value of the parameter r1 is updated using Eq. (33.3). r1 = k − j ∗

k . TotGen

(33.3)

This algorithm starts with randomly initialized solutions and the initial best individual solution is derived using the Mean Square Error (MSE) value. The solutions are updated using Eq. (33.2) and then the value of r1 using Eq. (33.3). This process is repeated for a maximum number of generations.

33.2.3 ELM-SCA Model The ELM-SCA model has been used to optimize the performance of ELM [14]. In this paper, the ELM-SCA model is trained with 80% data and validated with 20% data. The training data are used in ELM to train the model. The input layer weights and hidden layer biases are optimally found using SCA algorithm. In SCA, the fitness function is considered as the MSE. After stopping criterion, the test data is used to determine the performance. The architecture of this proposed ELM-SCA model is shown in Fig. 33.1.

Fig. 33.1 Architecture of ELM-SCA model

33 Designing of Financial Time Series Forecasting Model …

367

Algorithm: ELM-SCA 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12.

Initialize N and G as population size and number of generations. Randomly initialize N solutions for input weights and hidden biases. Evaluate the MSE for each solution Ej ∈ N. Update the best solution Ebest. T=1 Repeat steps 7 to 13 While ( T < G) Repeat steps 8 to 12 For i = 1 to N Update the parameters r1, r2, r3, r4. Update the position of each solution using Eq. (3) Calculate the fitness using MSE and update the best solution Ebest. T is incremented by 1. Return the best solution that represents input weights and hidden biases.

The architecture of this proposed ELM-SCA model is shown in Fig. 33.1.

33.3 Experimental Study The simulation study involves the procedures to preprocess the dataset and model validation. The currency exchange rate dataset of Singapore Dollar (SGD) to Indian Rupee (INR) from 04-11-2003 to 03-12-2018 is collected in this paper. This dataset contains 4000 data values. The statistical features, i.e., mean, variance are added for the forecasting task. The mini-max normalization algorithm is used to normalize the considered currency dataset. The model is trained with 80% data and validated with 20% data. The role of SCA is to find out the input layer weights and biases of the network. The population size is 50 and no. of generations is 100. For the performance metrics of ELM-SCA, Root Mean Square Error (RMSE) and Mean Absolute Percentage Error (MAPE) are calculated. The results are shown in Tables 33.2 and 33.3. The comparison of actual and predicted rates for both the models are shown in Fig. 33.2. Table 33.2 The architecture of this proposed results of 1-day ahead forecasting

Table 33.3 MSE Results of 7-days ahead forecasting

Methods

RMSE

MAPE

ELM

0.27654

0.4267

ELM-SCA

0.1859

0.2342

Methods

RMSE

MAPE

ELM

0.35467

0.2643

ELM-SCA

0.2663

0.1695

368

S. Hota et al.

Fig. 33.2 Actual and Calculated rate for (a) 1-day ahead using ELM model (b) 7-day ahead using ELM model (c) 1-day ahead using ELM-SCA model and (d) 7-day ahead using ELM-SCA model

Tables 33.2 and 33.3 show the performance measure values of ELM-SCA model. The RMSE and MAPE values are 0.1859 and 0.2842, respectively for 1-day ahead prediction. Similarly, the RMSE and MAPE values of ELM-SCA model are 0.2663 and 0.1695, respectively for 7-day ahead prediction. These results show efficiency of the ELM-SCA model over the ELM model in this work.

33.4 Conclusion In this work, the ELM-SCA hybrid method is proposed to forecast the exchange rate. The dataset is normalized using mini-max normalization. The ELM-SCA model is trained with 80% data and validated with 20% data. The training data are used in ELM to generate the MSE in one iteration. The input layer weights and biases

33 Designing of Financial Time Series Forecasting Model …

369

are optimally found out using SCA algorithm. Then the testing data is considered to validate the proposed model. For validation, different performance measures are evaluated to find the forecasting performance. The empirical results demonstrate that the proposed ELM-SCA model provided better results in forecasting of exchange rate than the basic ELM model.

References 1. Henrique, B.M., Sobreiro, V.A., Kimura, H.: Literature review: machine learning techniques applied to financial market prediction. Exp. Syst. Appl. (124), 226–251 (2019) 2. Anastasakis, L., Mort, N.: Exchange rate forecasting using a combined parametric and nonparametric self-organizing modelling approach. Exp. Syst. Appl. 36(10), 12001–12011 (2009) 3. Gardner, E.: Exponential smoothing: the state of the art. J. Forecast. 4(1), 1–28 (1985) 4. Rojas, I., Valenzuela, O.R., Guillén, F.A, Herrera, J., Pomares, H., Pasadas, M.: Soft-computing techniques and ARMA model for time series prediction. Neurocomputing 71, 519–537 (2008) 5. Tealab, A.: Time series forecasting using artificial neural networks methodologies: a systematic review. Future Comput. Inform. J. 3, 334–340 (2018) 6. Galeshchuk, S.: Neural networks performance in exchange rate prediction. Neurocomputing 172, 446–452 (2016) 7. Majhi, R., Panda, G., Sahoo, G.: Efficient prediction of exchange rates with low complexity artificial neural network models. Exp. Syst. Appl. 36(1), 181–189 (2009) 8. Das, S.R., Mishra, D., Rout, M.: A hybridized ELM-Jaya forecasting model for currency exchange prediction. J. King Saud Univ. Comput. Inform. Sci. (2017) 9. Hota, S., Mohanty, A.K., Satapathy, P., Mishra, D.: Currency exchange rate forecasting using FLANN-WCA model. In: 2019 International Conference on Applied Machine Learning (ICAML), pp. 114–119 (2019) 10. Huang, G.B., Zhu, Q.Y., Siew, C.K.: Extreme learning machine: a new learning scheme of feed forward neural networks. Neural Netw. 2, 985–990 (2004) 11. Zhu, Q.Y., Qin, A.K., Suganthan, P.N., Huang, G.B.: Evolutionary extreme learning machine. Pattern Recogn. 38(10), 1759–1763 (2005) 12. Mirjalili, S.: SCA: a sine cosine algorithm for solving optimization problems. Knowl. Based Syst. 96, 120–133 (2016) 13. Nayak, D.R., Dash, R., Majhi, B., Wang, S.: Combining extreme learning machine with modified sine cosine algorithm for detection of pathological brain. Comput. Electr. Eng. 68, 366–380 (2018) 14. Nayak, D.R., Dash, R., Majhi, B., Wang, S.: Discrete ripplet-II transform and modified PSO based improved evolutionary extreme learning machine for pathological brain detection. Neurocomputing 282, 232–247 (2018)

Chapter 34

Twin Support Vector Machines Classifier Based on Intuitionistic Fuzzy Number Parashjyoti Borah, Ranjan Phukan, and Chukhu Chunka

Abstract Twin support vector machines (TWSVM) and the improved version of it, twin bounded support vector machines (TBSVM), are based on the idea of constructing two-class proximal hyperplanes placed at least unit relative distance away from the opposite class data samples. Both inherently possess outlier and noise sensitivity which is common in most of the traditional machine learning classification algorithms. Intuitionistic fuzzy twin support vector machines (IFTSVM), on the other hand, apply the concept of fuzzy membership determined in the same feature space as the mapped input. However, IFTSVM utilizes the fuzzy membership values only for those opposite class samples that violate the constraint that they have to be at a minimum of unit distance from the class proximal hyperplane. IFTSVM does not consider the proximal terms of the objective functions which also are affected by outliers and noise. This work studies incorporating intuitionistic fuzzy membership to the proximal terms of the objective functions that ensure the class hyperplanes are derived at minimal proximity. Furthermore, the proposed work employs costsensitive learning so that the error cost of each class is balanced even in case of datasets with high imbalance ratio. Experimental study proves the effectiveness of the proposed algorithm.

34.1 Introduction Machine learning has gained tremendous research interest in the recent years. Among the available state-of-the-art machine learning techniques, support vector machine (SVM) [1] is one very powerful classification algorithm originally designed to handle binary class problems. SVM constructs a pair of parallel class boundary hyperplanes P. Borah (B) · C. Chunka Department of Computer Science and Engineering, Siksha ‘O’ Anusandhan (Deemed to be University), Bhubaneswar, Odisha, India R. Phukan O/O Climate Research and Services, India Meteorological Department, Ministry of Earth Sciences, Pune 411005, India © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2022 D. Mishra et al. (eds.), Intelligent and Cloud Computing, Smart Innovation, Systems and Technologies 286, https://doi.org/10.1007/978-981-16-9873-6_34

371

372

P. Borah et al.

which are defined by the support vectors. Although SVM could achieve high generalization performance, its high training cost comes as a disappointment as the upper bound of training time is in cubical order of the training dataset size. Several studies have been carried out to address this issue of SVM and some of these studies could be availed in [2, 3]. Another interesting idea for partially resolving the high computational complexity issue of SVM is building two non-parallel hyperplanes in a way that is proximal to the respective class with the other class samples falling at a minimum of unit distance. This idea was proposed by Jayadeva et al. [4] and the classification technique was named as twin support vector machines (TWSVM). On solving smaller-sized optimization problems, TWSVM is evident to be approximately four times faster than SVM in theory and even faster than that experimentally. However, unlike SVM, TWSVM tries to minimize the empirical loss and does not follow the structural risk minimization (SRM) principle. Thus, TWSVM does not concern with optimizing the model complexity which also might lead to overfitting problems. An improvement on TWSVM is proposed in [5] to address the aforementioned issues of TWSVM. The improvement may be referred to as TBSVM in short. The only major and significant difference of TBSVM over TWSVM is that the objective functions of TBSVM include the regularization terms to optimize the model complexity. Basically, TBSVM can be viewed as an improvement of TWSVM. Despite its high generalization ability, SVM is prone to noise and outlier sensitivity issues and also lacks imbalance learning capability. One way of handling the issue of sensitivity to noisy data and outliers is by utilizing the degree of belongingness of a sample determined using some fuzzy membership functions. Thus, choosing an appropriate membership function may assign lower membership values to the outlier and noise samples as compared to the more representative samples. A way of incorporating fuzzy membership concept into the SVM formulation is discussed in [6]. Subsequently, many fuzzy-based methods are proposed for SVM and its variants [7, 8]. Similar to SVM, TWSVM and TBSVM also do not incorporate inherent mechanisms to tackle noise and outlier issues as well as class imbalance problems. Fuzzy membership concept can be used in the non-parallel variants of SVM and [9–11] are some of those evidences. Very recently, intuitionistic fuzzy twin support vector machines (IFTSVM) [12] are proposed which is also a fuzzy membershipbased method of TBSVM. The degree of membership of the samples is determined in the same feature space where the class hyperplanes are obtained. Thus, IFTSVM adapts to the non-linearity of a binary classification problem. However, like most of the fuzzy membership-based twin variants of SVM, IFTSVM utilizes the membership values of only those samples of the opposite class that violates the constraint of being at a minimum of unit distant away from the proximal hyperplanes. It would not be unfair to say that the outlier/noise samples of one class could also have the undesired influence on deriving its final proximal hyperplane. Utilizing intuitionistic fuzzy membership values in the proximal term of the objective functions of IFTSVM could offer significant control over the noise/outlier samples of the class. With that, assigning lower membership values to the noise/outlier points will weaken the influence of these points in minimizing the proximity of the class hyperplanes. In his paper, we propose an improvement on the existing IFTSVM formulations

34 Twin Support Vector Machines Classifier Based on Intuitionistic …

373

by applying intuitionistic fuzzy membership values not only to the error term but also to the proximal terms of the objective functions of TBSVM. Therefore, the first improvement on the efficient IFTSVM is the utilization of the fuzzy membership by encompassing them in the proximal terms along with the error terms of the objective functions of IFTSVM. Another inherent issue of the traditional machine learning classification algorithms is the inability to handle class imbalance problems. Some studies to address this issue for SVM and TWSVM are available at [13–15]. In a very recent study, authors have proposed affinity and class probability-based fuzzy support vector machine (ACFSVM) [14] which performs efficiently for class imbalance learning of SVM. In [12], the authors have suggested integrating class imbalance learning into IFTSVM as one future scope. The second improvement on IFTSVM is incorporating class imbalance learning by integrating cost-sensitive learning in IFTSVM formulations. In cost-sensitive learning, the error cost is scaled in a way to balance the aggregate costs of both classes. Synthetic Minority Over-sampling Technique (SMOTE) [16] is another efficient technique for resolving the class imbalance problem of classifiers. Due to its effectiveness, many studies have been carried out applying SMOTE in recent years [17–19]. SMOTE under-samples the majority class and over-samples the minority class by generating synthetic minority class samples. Thus, the balance between the classes is maintained. SMOTE is a technique that could be applied in the pre-processing phase of data and then the prepared data could be fed to the machine learning algorithms for training. It does not incorporate class imbalance learning into the problem formulation of the classification algorithm itself, thus, sometimes increasing the overhead. However, unlike SMOTE, cost-sensitive learning embeds mechanisms in the optimization problem of the machine learning classifiers so that the aggregate costs of error contributed to the classes in the optimization problem are balanced. In our proposed formulation, the aggregate cost of the proximal loss and the error loss of the objective functions are normalized using scaling constants. Thus, the proposed improvements increase robustness of IFTSVM against noises and outliers, and also equip class imbalance learning for imbalanced datasets. The organization of the remaining part of the paper is such that: the background works and related works to the current study are briefed in Sect. 34.2, the improved formulations of improved IFTSVM are elaborated and discussed in Sect. 34.3, numerical experiments and results are provided in Sect. 34.4, and the paper concludes with a brief summary of the overall study and findings in Sect. 34.5.

34.2 Background Review We discuss the background works of the proposed approach in this section. The input vectors xi ∈ n , i = 1, 2, . . . , m and the corresponding output yi ∈ {+1, −1} constitutes the input matrix X = (x1 , x2 , . . . , xm )t . Based on the class labels of the input samples, the input matrix X is separated to form the data matrices A and B,

374

P. Borah et al.

respectively consisting of samples of the +1 and the −1 class. The two-dimensional matrices A and B are of size (m 1 × n) and (m 2 × n), respectively.

34.2.1 Intuitionistic Fuzzy Membership Calculation The fuzzy membership function employed in this study is proposed in [20]. The degree of membership for the class samples is determined in the same input/feature space where the class hyperplanes are derived. The degree of membership depends not only on a sample’s degree of belongingness to its own class but also the degree of non-belongingness to the opposite class. The membership degree and the nonmembership degree are determined as follows. The Membership Function: Considering ϕ(xi ) be the feature mapping function that projects an input sample xi from the input space of given dimensions to a feature space with higher number of dimensions than the input space, the degree of membership of xi is obtained as below: ⎧ ||ϕ(xi ) − C + || ⎪ ⎪ , ⎨1 − r+ + δ f (xi ) = ⎪ ||ϕ(xi ) − C − || ⎪ ⎩1 − , r− + δ

yi = +1 (34.1) yi = −1

where C + (C − ) is the class center and r + (r − ) is the radius of the +1(−1) class. The small adjustable parameter δ > 0 is used to avoid division by zero situation. The separation between two points xi and x j in the feature space is calculated as, D(ϕ(xi ), ϕ(x j )) = ||ϕ(xi ) − ϕ(x j )|| = ϕ(xi )t ϕ(xi ) + ϕ(x j )t ϕ(x j ) − 2ϕ(xi )t ϕ(x j ).

(34.2)

Choosing an appropriate kernel function k(u, v) = ϕ(u) · ϕ(v) = ϕ(u)t ϕ(v), we get, D(ϕ(xi ), ϕ(x j )) =

k(xi , xi ) + k(x j , x j ) − 2k(xi , x j ).

(34.3)

Similarly, the class centers can be determined as, ±

m 1 C = ± ϕ(xi ), m i=1 ±

yi = ±1

(34.4)

where m + = m 1 , m − = m 2 and the radiuses are computed as, r ± = max {||ϕ(xi ) − C + ||}, i = 1, 2, . . . , m ± . yi =±1

(34.5)

34 Twin Support Vector Machines Classifier Based on Intuitionistic …

375

Degree of Non-membership Calculation: The non-membership degree of a data sample is basically a factor of the ratio of the number of data points belonging to the opposite class to the total number of data points present in the neighborhood of that sample. The non-membership function can be given as below: fˆ(xi ) = (1 − f (xi ))ρ(xi )

(34.6)

where ρ(xi ) is the ratio of the number of opposite class data points present in the neighborhood of a certain radius π > 0 around the sample xi to the total number of training data points present within that neighborhood, expressed as follows: ρ(xi ) =

|{x j | ||ϕ(xi ) − ϕ(x j )|| ≤ π, yi = y j }| . |{x j | ||ϕ(xi ) − ϕ(x j )|| ≤ π }|

(34.7)

Here, |·| denotes the cardinality of the set ·. It can be seen that the non-membership values fall in the range [0, 1). Fuzzy Membership Calculation: Finally, a sample xi gets its fuzzy membership value based on the following equation:

si =

⎧ f (xi ), ⎪ ⎪ ⎪ ⎪ ⎨ 0, ⎪ ⎪ ⎪ ⎪ ⎩

fˆ(xi ) = 0 f (xi ) ≤ fˆ(xi )

1 − fˆ(xi ) , Otherwise. 2 − f (xi ) − fˆ(xi )

(34.8)

Interested readers can visit [12, 20] for further details on membership calculation.

34.2.2 IFTSVM As discussed in the previous section, the primary objective of TWSVM [4] is to construct two non-parallel hyperplanes that are at the closest possible distance to the respective class samples and at the same time, possibly maintain a minimum of unit distance from the opposite class samples with some tolerance. The optimization problem of TWSVM tries to minimize the proximal loss of the corresponding class samples in the objective functions while trying to restrict the other class samples from the class hyperplane at unit distance by setting a constraint. The opposite class samples that violate this constraint are support vectors and the amount by which this constraint is violated represents the error loss (slack variable). Further improvement is made on TWSVM by adding the regularization terms in its objective functions and TBSVM was proposed [5]. IFTSVM [12] utilizes the fuzzy membership values of these support vectors determined using Eq. (34.8) to lower the influence of outlier and noise points. With this, the outlier and noise points get lower membership values and

376

P. Borah et al.

consequently are made to contribute less in the optimization process. In proposing IFTSVM [12], this fuzzy membership concept is applied to the more stable variant, i.e., TBSVM [12]. The primal pair of optimization problems of IFTSVM are given below: 1 1 ||Pω1 + b1 e1 ||2 + C1 ||ω1 ||2 + C2 s2t ξ2 2 2 s.t. − (Qω1 + b1 e2 ) + ξ2 ≥ e2 , ξ2 ≥ 0

(34.9)

1 1 ||Qω2 + b2 e2 ||2 + C3 ||ω2 ||2 + C4 s1t ξ1 2 2 s.t. (Pω2 + b2 e1 ) + ξ1 ≥ e1 , ξ1 ≥ 0

(34.10)

min

and min

where, for i = 1 and 2 corresponding to the +1 and the −1 class, respectively, ωi are the weight vectors, bi are the intersection points or biases, ei are unit vectors of appropriate length, 0 are the vectors of zeroes of appropriate length, ξi are the slack vectors (also called vectors of slack variables), and si are the vectors of fuzzy membership values. C1 , C2 , C3 , C4 are the user-specified penalty parameters. In the linear case, the matrices P and Q are the input matrices A and B of the +1 class and the −1 class, respectively. However, in the case of nonlinear IFTSVM, the matrices P and Q represent the kernel matrices k(A, Xt ) and k(B, Xt ) formed using an appropriately chosen kernel function k(•, ∗) [12]. Considering the augmented matrices G = [P e1 ] and H = [Q e2 ], the dual optimization problem pair of IFTSVM is expressed below: min s.t.

1 t α H(Gt G + C1 I)Ht α − e2t α 2 0 ≤ α ≤ C 2 s2

(34.11)

1 t β G(Ht H + C3 I)Gt β − e1t β 2 0 ≤ β ≤ C 4 s1 .

(34.12)

and min s.t.

where α and β are the vectors of dual variables called the vectors of Lagrange’s multipliers, and I is an identity matrix.

34 Twin Support Vector Machines Classifier Based on Intuitionistic …

377

34.3 Proposed Twin Support Vector Machines Classifier Based on Intuitionistic Fuzzy Number IFTSVM discussed above handles the noise and outlier sensitivity issue of twin SVMs to a great extent. However, it is observed that the proximal terms of the objective functions of (34.9) and (34.10) also involve the class samples which makes these terms have influences on the noise and outlier samples. Thus, utilizing the fuzzy membership values in these proximal terms along with the error terms will provide us with a more complete and robust methodology. Therefore, the first improvement on the efficient IFTSVM is the utilization of the fuzzy membership of own class samples by encompassing them in the proximal term of the objective functions of IFTSVM. Another very common issue that traditional machine learning algorithms fail to address is the handling of datasets with high-class imbalance ratio. In [12], authors have suggested integrating class imbalance learning into IFTSVM as one future scope. The second improvement on IFTSVM is incorporating class imbalance learning by integrating cost-sensitive learning in IFTSVM formulations. The aggregate cost of the proximal loss and the error loss of the objective functions are normalized using a scaling constant proposed in [15]. Thus, the proposed improvements increase robustness of IFTSVM against noises and outliers, and also equip class imbalance learning for imbalanced datasets. The primal optimization problem pair of the proposed IFTSVM is presented below: min s.t.

1 1 (Pω1 + b1 e1 )t Z1 (Pω1 + b1 e1 ) + C1 (||ω1 ||2 + b12 ) + C2 z2t ξ2 2 2 − (Qω1 + b1 e2 ) + ξ2 ≥ e2 ,

(34.13)

ξ2 ≥ 0 and 1 1 (Qω2 + b2 e2 )t Z2 (Qω2 + b2 e2 ) + C3 (||ω2 ||2 + b22 ) + C4 z1t ξ1 2 2 (34.14) s.t. (Pω2 + b2 e1 ) + ξ1 ≥ e1 , ξ1 ≥ 0. min

The membership values in problem formulations (34.13) and (34.14) are proposed to be computed as below: zk =

m sk , k = 1, 2. 2m k

(34.15)

k = 1 is for the +1 and k = 2 is for the −1 class. The scaling constant

Here, m normalizes the aggregate loss from each class and thus the optimization cost 2m k

378

P. Borah et al.

is balanced even though the class imbalance ratio is high. To fit the membership values into the proximal terms of the objective functions, the diagonal matrices Zk = diag(zk ) were formed from the membership vectors. Let us first consider the primal problem defined in (34.13). The Lagrangian equation corresponding to (34.13) is obtained as, 1 1 (Pω1 + b1 e1 )t Z1 (Pω1 + b1 e1 ) + C1 (||ω1 ||2 + b12 ) + C2 z2t ξ2 2 2 − αt (−(Qω1 + b1 e2 ) + ξ2 − e2 ) − γt ξ2 (34.16)

L(ω1 , b1 , ξ2 ) =

where, α and γ are the Lagrange’s multiplier vectors. By taking u1 =

ω1 , we can b1

re-write Eq. (34.16) in a simplified form as, 1 1 (Gu1 )t Z1 (Gu1 ) + C1 ||u1 ||2 + C2 z2t ξ2 2 2 − αt (−Hu1 + ξ2 − e2 ) − γt ξ2 .

L(u1 , ξ2 ) =

(34.17)

Taking the partial derivatives of (34.16) w.r.t. u1 and ξ2 then setting them to 0, we have, δL(u1 , ξ2 ) = 0 => u1 = −(Gt Z1 G + C1 I)−1 Ht α δu1

(34.18)

δL(u1 , ξ2 ) = 0 => C2 z2 − α − γ = 0 δξ2

(34.19)

Replacing the expression for u1 from (34.18) in (34.17) and using (34.19), the dual minimization problem corresponding to (34.13) is obtained as, min s.t.

1 t α H(Gt Z1 G + C1 I)−1 Ht α − e2t α 2 0 ≤ α ≤ C2 z2 .

Now, considering u2 =

ω2

(34.20)

, the dual optimization problem corresponding to b2 (34.14) can be obtained following a similar approach as above, which is given as, min s.t.

1 t β G(Ht Z2 H + C2 I)−1 Gt β − e1t β 2 0 ≤ β ≤ C4 z1 .

(34.21)

where β is the vector of Lagrange’s multipliers. After solving the quadratic programming problem in (34.21), the augmented vector u2 is obtained as,

34 Twin Support Vector Machines Classifier Based on Intuitionistic …

u2 = (Ht Z2 H + C3 I)−1 Gt β

379

(34.22)

Finally, an unseen sample x is classified to a certain class based on its proximity to the class hyperplanes. The closer hyperplane wins the race and the sample x is assigned to the corresponding class. The classification functions (linear and nonlinear) for the proposed approach are defined as below:

xt ωi + bi For linear case: ψ(x) = arg min ||ωi ||2 i=1,2 k(xt , Xt )ωi + bi For nonlinear case: ψ(x) = arg min . ||ωi ||2 i=1,2

(34.23) (34.24)

34.4 Numerical Experiment Numerical experiments are conducted on 15 real-world benchmark datasets so that the competency of the proposed IFTSVM could be tested for real-world problem solving. To conduct a rigorous result analysis, generalization performance of the newly proposed IFTSVM approach is verified in comparison to the performance of state-of-the-art as well as some very recent related algorithms. The algorithms chosen for performance comparison are SVM, TWSVM, TBSVM, ACFSVM, and IFTSVM. The datasets are collected from the UCI [21] and the KEEL [22] repositories. For this experiment, 15 real-world datasets from different application areas are selected. The datasets are normalized to [0, 1] so that all features get equivalent importance. As real-world datasets are mostly linearly non-separable, we have conducted the experiments for the nonlinear case only. Although nonlinear kernels perform better than linear kernel in most cases, experiments with linear kernel could be conducted. The Gaussian kernel function [15] is one of the most widely used and effective nonlinear kernel functions in kernel-based learning methods. Its effectiveness in SVM-type learning for different application areas is demonstrated in [23–26]. The datasets collected for this study are from diverse fields and hence, in our experiments for this study, the Gaussian kernel is used due to its applicability and effectiveness in diverse areas. The proposed approach improves IFTSVM in terms of robustness to noise and outlier sensitivity and also enhances class imbalance learning. The evaluation measure considered for performance comparison is Area Under Receiver’s Curve (AUC) as AUC concerns both the classes and is a good representative measure for imbalanced datasets. The popular k-fold cross-validation approach is adopted with k = 5. Table 34.1 presents the classification performance of the compared methods and the proposed IFTSVM in terms of percentage AUC values. The standard deviations of AUC values for each dataset are also presented. In Table 34.1, the performance evaluation results of each method on each dataset are presented as “AUC ± Standard

75.1468 ± 5.6811

91.7167 ± 7.3613 90.7434 ± 6.0631

91.7928 ± 12.6125 93.7297 ± 13.2782 94 ± 13.4164

74.4776 ± 4.8519

92.4134 ± 9.9001

93.6008 ± 6.0323

Ecoli-0-1_vs_2-3-5 (244 × 7)

Ecoli-0-1_vs_5 (240 × 6)

Ecoli-0-1-4-7_vs_5-6 (332 × 94.1038 ± 6.4945 6)

Ecoli-0-2-3-4_vs_5 (202 × 7)

93.4137 ± 7.2015

91.2381 ± 6.0196

95.6561 ± 5.0542

94.3127 ± 7.8672

75.7231 ± 3.8442

71.5579 ± 4.5851

92.0925 ± 7.3847

94.2442 ± 6.9499

93.6203 ± 8.5939

75.7518 ± 6.6926

73.1756 ± 7.3125

Proposed IFTSVM

97.4689 ± 2.5309

95.1583 ± 8.5554 95.7368 ± 8.8156

91.4396 ± 5.6693 93.114 ± 6.8617

95.1907 ± 5.1481 95.8834 ± 5.256

94.3127 ± 7.8672 94.3127 ± 8.2106

75.6951 ± 3.5731 76.2286 ± 2.6219

73.4565 ± 6.1539 72.429 ± 6.7089

72.5641 ± 2.0522 73.0567 ± 2.1998

97.653 ± 2.2993

72.8538 ± 3.8385 73.8482 ± 4.734

87.8029 ± 1.6618 87.6268 ± 2.4392

IFTSVM

Ecoli-0-4-6_vs_5 (203 × 6)

92.1955 ± 6.065

90.4733 ± 8.6018

93.3355 ± 4.4418

93.341 ± 5.5649

92.8013 ± 4.107

(continued)

93.6058 ± 4.772

Ecoli-0-2-6-7_vs_3-5 (224 × 93.2726 ± 12.9881 90.9286 ± 18.9178 91.1786 ± 19.0342 93.0226 ± 12.8678 92.6642 ± 5.9363 94.9511 ± 9.9511 7)

94.0702 ± 8.593

69.4583 ± 7.4665

71.3891 ± 5.9184

97.4689 ± 2.5309 73.8614 ± 3.0115

Pima-Indians-Diabetes (768 × 8)

97.3697 ± 2.2293 72.1791 ± 2.5943

Heart-c (297 × 13)

96.9677 ± 1.8092 70.9012 ± 2.9419

97.1167 ± 1.9286

69.8645 ± 5.4831

86.6987 ± 3.5911

72.8814 ± 3.0438

71.1742 ± 1.5966

87.681 ± 2.3813

ACFSVM

German (1000 × 24)

70.8741 ± 3.8669

Bupa or Liver-Disorders (345 71.1275 ± 5.6082 × 6)

TBSVM

Ecoli (336 × 7)

87.0294 ± 3.3803

86.7141 ± 2.5016

Australian Credit (690 × 14)

TWSVM

SVM

Dataset (#Samples × #Attributes)

Table 34.1 Classification performance of SVM, TWSVM, TBSVM, ACFSVM, IFTSVM, and the proposed approach using Gaussian kernel

380 P. Borah et al.

92.2087 ± 6.4733 92.1955 ± 7.1042

94.9352 ± 2.5342

Ecoli4 (336 × 7)

Ecoli-0-1-3-7_vs_2-6 (311 × 92.9117 ± 4.9986 7)

87.4644 ± 4.2892

89.8119 ± 5.9408

Ecoli2 (336 × 7)

TWSVM

SVM

Dataset (#Samples × #Attributes)

Table 34.1 (continued)

93.7081 ± 5.1339

94.5525 ± 6.0929

91.0057 ± 4.0993

TBSVM

94.0731 ± 4.6244

96.2588 ± 4.4852

92.389 ± 3.7306

ACFSVM

Proposed IFTSVM

94.4059 ± 3.7792 94.9752 ± 4.3601

95.0288 ± 6.2369 98.735 ± 0.7006

91.7665 ± 5.7934 92.1226 ± 5.3752

IFTSVM

34 Twin Support Vector Machines Classifier Based on Intuitionistic … 381

382

P. Borah et al.

Table 34.2 Performance ranks of SVM, TWSVM, TBSVM, ACFSVM, IFTSVM, and the proposed approach using Gaussian kernel Dataset

SVM

TWSVM

TBSVM

ACFSVM

IFTSVM

Proposed IFTSVM

Australian credit

5

4

2

6

1

3

Bupa or liver-disorders

4

5

3

6

2

1

Ecoli

5

6

4

2.5

1

2.5

German

3

6

5

1

4

2

Heart-c

5

6

4

2

1

3

Pima-Indians-diabetes

6

5

3

2

4

1

Ecoli-0-1_vs_2-3-5

6

5

2

4

2

2

Ecoli-0-1_vs_5

5

6

2

4

3

1

Ecoli-0-1-4-7_vs_5-6

1

6

5

3

4

2

Ecoli-0-2-3-4_vs_5

3

6

5

4

2

1

Ecoli-0-2-6-7_vs_3-5

2

6

5

3

4

1

Ecoli-0-4-6_vs_5

5

6

3

2

4

1

Ecoli2

5

6

4

1

3

2

Ecoli4

4

6

5

2

3

1

Ecoli-0-1-3-7_vs_2-6

5

6

4

3

2

1

Average rank

4.27

5.67

3.73

3.03

2.67

1.63

Deviation”. Performance ranks of the algorithms on each dataset and their average ranks are computed and provided in Table 34.2. It is observed that the proposed approach could outperform the others in most of the datasets and achieves the best average rank among others.

34.5 Conclusion Improvements on IFTSVM are carried out to increase robustness against outliers and noises and to incorporate class imbalance learning. The proposed approach fits fuzzy membership degrees of the own class samples in the proximal terms of the objective functions to make it more robust against noises and outliers. Further, cost-sensitive learning is implemented for class imbalance learning. The proposed approach is tested on 15 real-world benchmark datasets of different imbalance ratios. The numerical experiments establish the applicability of the proposed approach. Further improvements could be carried out in the future to scale it to large-scale datasets and extension to multiclass classification problems.

34 Twin Support Vector Machines Classifier Based on Intuitionistic …

383

References 1. Cortes, C., Vapnik, V.: Support-vector networks. Mach. Learn. 20(3), 273–297 (1995) 2. Suykens, J.A., Vandewalle, J.: Least squares support vector machine classifiers. Neural Process. Lett. 9(3), 293–300 (1999) 3. Mangasarian, O.L., Musicant, D.R.: Lagrangian support vector machines. J. Mach. Learn. Res. 1, 161–177 (2001) 4. Jayadeva, Khemchandani, R., Chandra, S.: Twin support vector machines for pattern classification. IEEE Trans. Pattern Anal. Mach. Intell. 29(5), 905–910 (2007) 5. Shao, Y.H., Zhang, C.H., Wang, X.B., Deng, N.Y.: Improvements on twin support vector machines. IEEE Trans. Neural Netw. 22(6), 962–968 (2011) 6. Lin, C.F., Wang, S.D.: Fuzzy support vector machines. IEEE Trans. Neural Netw. 13(2), 464– 471 (2002) 7. Wang, Y., Wang, S., Lai, K.K.: A new fuzzy support vector machine to evaluate credit risk. IEEE Trans. Fuzzy Syst. 13(6), 820–831 (2005) 8. Liu, J., Zio, E.: A scalable fuzzy support vector machine for fault detection in transportation systems. Expert Syst. Appl. 102, 36–43 (2018) 9. Chen, S.G., Wu, X.J.: A new fuzzy twin support vector machine for pattern classification. Int. J. Mach. Learn. Cybern. 9(9), 1553–1564 (2018) 10. Borah, P., Gupta, D., Prasad, M.: Improved 2-norm based fuzzy least squares twin support vector machine. In: 2018 IEEE Symposium Series on Computational Intelligence (SSCI), pp. 412–419. IEEE (2018) 11. Chen, S., Cao, J., Huang, Z., Shen, C.: Entropy-based fuzzy twin bounded support vector machine for binary classification. IEEE Access 7, 86555–86569 (2019) 12. Rezvani, S., Wang, X., Pourpanah, F.: Intuitionistic fuzzy twin support vector machines. IEEE Trans. Fuzzy Syst. 27(11), 2140–2151 (2019) 13. Fan, Q., Wang, Z., Li, D., Gao, D., Zha, H.: Entropy-based fuzzy support vector machine for imbalanced datasets. Knowl.-Based Syst. 115, 87–99 (2017) 14. Tao, X., Li, Q., Ren, C., Guo, W., He, Q., Liu, R., Zou, J.: Affinity and class probability-based fuzzy support vector machine for imbalanced data sets. Neural Netw. 122, 289–307 (2020) 15. Borah, P., Gupta, D.: Robust twin bounded support vector machines for outliers and imbalanced data. Appl. Intell. 1–30 (2021) 16. Chawla, N.V., Bowyer, K.W., Hall, L.O., Kegelmeyer, W.P.: SMOTE: synthetic minority oversampling technique. J. Artif. Intell. Res. 16, 321–357 (2002) 17. Douzas, G., Bacao, F.: Geometric SMOTE a geometrically enhanced drop-in replacement for SMOTE. Inf. Sci. 501, 118–135 (2019) 18. Fernández, A., Garcia, S., Herrera, F., Chawla, N.V.: SMOTE for learning from imbalanced data: progress and challenges, marking the 15-year anniversary. J. Artif. Intell. Res. 61, 863–905 (2018) 19. Pan, T., Zhao, J., Wu, W., Yang, J.: Learning imbalanced datasets based on SMOTE and Gaussian distribution. Inf. Sci. 512, 1214–1233 (2020) 20. Ha, M., Wang, C., Chen, J.: The support vector machine based on intuitionistic fuzzy number and kernel function. Soft. Comput. 17(4), 635–641 (2013) 21. Bache, K., Lichman, M.: UCI machine learning repository (2013) 22. Alcalá-Fdez, J., Sanchez, L., Garcia, S., del Jesus, M.J., Ventura, S., Garrell, J.M., Otero, J., Romero, C., Bacardit, J., Rivas, V.M., Fernández, J.C.: KEEL: a software tool to assess evolutionary algorithms for data mining problems. Soft. Comput. 13(3), 307–318 (2009) 23. Tay, F.E., Cao, L.: Application of support vector machines in financial time series forecasting. Omega 29(4), 309–317 (2001) 24. Kim, K.J.: Financial time series forecasting using support vector machines. Neurocomputing 55(1–2), 307–319 (2003)

384

P. Borah et al.

25. Huang, Z., Chen, H., Hsu, C.J., Chen, W.H., Wu, S.: Credit rating analysis with support vector machines and neural networks: a market comparative study. Decis. Support Syst. 37(4), 543– 558 (2004) 26. Kavzoglu, T., Colkesen, I.: A kernel functions analysis for support vector machines for land cover classification. Int. J. Appl. Earth Obs. Geoinf. 11(5), 352–359 (2009)

Chapter 35

Automatic Detection of Epileptic Seizure Based on Differential Entropy, E-LS-TSVM, and AB-LS-SVM Sumant Kumar Mohapatra and Srikanta Patnaik

Abstract The experimental work is to create an accurate machine learning system to detect the epileptic seizure. In this research work, Bonn university data sets are utilized. The differential entropy as feature is extracted from recorded EEG signal using iterative filtering decomposition (IFD) method. Next, the DE feature is inputted to adaptive boost LS-SVM (ABLSSVM) and enhanced LS-SVM (ELSSVM) classifier to classify the EEG segments. The proposed technique IFD-DE-ELSSVM outperforms as compared to the IFD-DE-ABLSSVM technique in all respect. This method also shows intense outcomes as compared to other existing methods on the same EEG data sets the proposed method will become a valuable experimental tool in clinical application and benefit epilepsy patients.

35.1 Introduction Epileptic seizures are generated due to abnormal neuronal activities [1]. There are different machine learning existing methods are developed to detect seizure. Mahmodian et al. [2] proposed method using cross-bispectrum technique which achieves sensitivity, specificity, and accuracy as 95.8%, 96.7%, and 96.8% respectively. In [3] authors proposed method by using differential entropy and an attention model. Sirpal et al. [4] proposed a method in multimodal EEG-fnirs recordings. In [5], a deep learning technique is proposed which gets sensitivity of 92.7% and Specificity of 90.8%. In [6], authors present a ABLSSVM classification approach. Nkengfack et al. [7] proposed a method using polynomial transforms, LDA, and SVM. In [8], authors describe an epileptic seizure method using iterative filtering decomposition method. In [9] authors present a machine learning approach that gets sensitivity, Specificity, and accuracy of 9%, 100%, and 99.5% respectively. This paper is organized as Sect. 35.2 covers the materials and methods which include S. K. Mohapatra (B) · S. Patnaik Siksha ‘O’ Anusandhan Deemed to be University, Bhubaneswar, India S. Patnaik e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2022 D. Mishra et al. (eds.), Intelligent and Cloud Computing, Smart Innovation, Systems and Technologies 286, https://doi.org/10.1007/978-981-16-9873-6_35

385

386

S. K. Mohapatra and S. Patnaik

Fig. 35.1 Plot of recorded EEG signal of 2000 samples

preprocessing, feature extraction, and performance computation. Section 35.3 sets the classifiers used for the output response and Sect. 35.4 elaborates the results and discussion. Finally, the conclusion of this paper is summarized in Sect. 35.5.

35.2 Materials and Methods 35.2.1 Clinical Dataset In this experimental analysis, Bonn University EEG database [10] is used. The whole database is divided into five sets as A, B, C, D, and E. The EEG sets A and B are recorded from healthy subjects and C and D EEG sets are preictal EEG signals. Finally, the set E is recorded from epilepsy patients.

35.2.2 Preprocessing The recorded EEG signals are preprocessed before decomposition. Due to 50 Hz power line signal, the EEG signal is affected. So, it is highly necessary to reject the power line by using the IIR notch filter and Butterworth bandpass filter centered at 40 Hz. The sudden spikes arise from the recorded are filtered out by MA filters. The filtered EEG signal is shown in Fig. 35.1.

35.2.3 Feature Extraction In this experimental study, the feature Differential Entropy (DE) is extracted by using the intermediate frequency (IMF) technique. This has the ability to create number of zero-crossing extrema. These IMFs are generated by Iterative Filter Decomposition

35 Automatic Detection of Epileptic Seizure Based …

387

(IFD) method [11] inspired by EMD. The differential entropy is expressed as ∞ h(x) = −∞

1

2

− (x−μ) 2σ 2

e √ 2π σ 2

1

2

− (x−μ) 2σ 2

log √ e 2π σ 2

∞ 1 log 2π eσ 2 (35.1) dx = 2 −∞

where π and e are constant.

35.2.4 Performance Computation This experimental work assesses the performance of the proposed classifiers using sensitivity (SEN%), Specificity (SPE%), accuracy (ACC%), Positive predictive value (PPV%), Matthew’s correlation coefficient (MCC%) Area under curve (AUC) and Execution Time in second. The computational equations are, SEN% =

TP % TP + FN

(35.2)

SPE% =

TN % TN + FP

(35.3)

TP TP + TN + FP + FN

(35.4)

TP TP + FP

(35.5)

(TPXTN) − (FNXFP) T1XT2

(35.6)

ACC% =

PPV% = MCC =

35.3 Classifiers The proposed method for classification of normal EEG, interictal EEG, and ictal EEG signals, Adaboost LS-SVL(ABLSSVM) [5] and enhanced LSTSVM (ELSTSVM) [12] are used. The 10fold cross-validation is used. For more experimental work 90– 10% of training to testing data is used and the average of the statistical parameters are inputted as the output response.

388

S. K. Mohapatra and S. Patnaik

35.4 Results and Discussion The step-wise proposed model is represented in Fig. 35.2. The whole experimental analysis is implemented through MATLAB R2015a. The 10-fold cross-validation is implemented to calculate each statistical performance. The extracted features are changed to matrix of 240 vector forms and using t-test, the rank of each feature is calculated. The significance of differential entropy, p < 0.001 is important as compared to other entropy-based features. The classifiers used for this experimental analysis are most significant on the basis of RBF kernel, linear kernel and polynomial kernel. The above mentioned kernel functions produce better results in terms of statistical parameters. For computation, there are seven statistical parameters are used. There are two well-known classifiers are used as ABLSSVM and ELSTSVM. Then the output of these classifiers is compared with LSSVM and LS-TSVM. The selected feature, differential entropy is inputted to LSSVM achieved 91.13% of specificity, 94.56% of sensitivity, and 94.4% of accuracy with 17.45 s of execution Fig. 35.2 Step-wise representation of the proposed model

35 Automatic Detection of Epileptic Seizure Based …

389

Table 35.1 Comparative computation of different classifiers Classifiers

Statistical parameters SPE %

SEN %

ACC %

PPV %

MCC %

Execution time (s)

AUC

LS-SVM

91.13

94.55

94.4

91.86

75.28

17.45

0.952

LS-TSVM

94.52

96.22

96.48

92.35

82.18

15.23

0.968

ABLSSVM

97.45

97.43

97.51

94.50

89.32

4.8

0.983

ELSTSVM

99.3

97.23

99.39

99.53

97.68

2.3

0.999

time. Then using LS-TSVM, we achieved 94.52% of specificity, 96.22% of sensitivity, and 96.48% of accuracy with 15.23 s of execution time. Table 35.1 shows the performance of proposed methods. The ABLSSVM classifier achieved 97.45% specificity, 97.43% of sensitivity, and 97.51% of accuracy with 4.8 s of execution time. The achieved AUC is 0.983. Finally, the ELSTSVM achieve outperformed results as 99.3% specificity, 97.23% of sensitivity, and 99.39% of accuracy with 2.3 s of execution time. The achieved AUC is nearly 1. Table 35.2 shows a comparative analysis between state-of-the-art methods with the proposed method. The ROC performance plot between sensitivity and specificity is shown in Fig. 35.3. This plot shows a simple comparative analysis between four classifiers in terms of AUC. Figure 35.3 shows the computed AUC values of LSSVM, L STSVM, ABLSSVM, and ELSTSVM are 0.952, 0.968, 0.983 and 0.999 respectively. Figure 35.4 shows the performance evaluation plot of different classifiers summarized in Table 35.1. It shows that during 32 iterations, LSSVM takes 17.45 s as highest execution time compared to other classifiers.

35.5 Conclusion In this experimental work, two methods are proposed to detect the epileptic seizure in the EEG recordings. In the first method, the DE is inputted to ABLSSVM classifier which gets 97.51%, 97.45%, and 97.43% of accuracy, specificity, and sensitivity respectively. In the other method, DE is inputted to ELSTSVM classifier which gets 99.39%, 99.3%, and 97.23% of accuracy, specificity, and sensitivity respectively. With very less execution time of 2.3 s. This proposed method can be used for different clinical applications in future.

Author(s)

N. Mahmoodian

Jian Zhang

Sirpal et al.

Usman et al.

Al-Hadeethi, Hanan et al.

Laurent Chanel

Deba Prasad Dash

M. Savadkoohi et al.

–

–

References

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

Proposed method-1

Proposed method-2

Differential entropy, ELSTSVM

Differential entropy, ABLSSVM

SVM

IFD, hidden Markov model

LDA and SVM

Adaptive based LS-SVM

Deep learning technique

Fnirs, Multimodal EEG

Differential entropy attention model

Cross bispectrum

Existing method

Table 35.2 Comparative analysis of the state of the arts with the proposed method

99.39

97.51

99.5

99.60

97.8

99

NA

NA

95.12

96 0.8

Accuracy (%)

99.3

97.45

99

96.64

98

99

92.7

89.7

93.84

95.8

Sensitivity (%)

97.23

97.43

100

99.28

98.12

99

90.8

95.5

95.84

96.7

Specificity (%)

390 S. K. Mohapatra and S. Patnaik

35 Automatic Detection of Epileptic Seizure Based …

Fig. 35.3 Step wise ROC Plot shows the outputs of different classifiers

Fig. 35.4 Plot shows the comparative analysis of statistical parameters

391

392

S. K. Mohapatra and S. Patnaik

References 1. Ngugi, A., et al.: Incidence of epilepsy a systematic review and meta-analysis. Neurology 77(10), 1005–1012 (2018) 2. Mahmoodian, N., Boese, A., Friebe, M., Haddadnia, J.: Epileptic seizure detection using crossbispectrum of electroencephalogram signal. Seizure 66, 4–11 (2019) 3. Zhang, J., Wei, Z., Zou, J., Hao, F.: Automatic epileptic EEG classification based on differential entropy and attention model. Eng. Appl. Artif. Intell. 96, 103–112 (2020) 4. Sirpal et al.: fNIRS improves seizure detection in multimodal EEG-fNIRS recordings. J. Biomed. Opt. 24(5), 1–18.051408 (2019) 5. Usman, et al.: Epileptic seizures prediction using deep learning techniques. IEEE Access 8, 39998–40007 (2019) 6. Al-Hadeethi, H., et al.: Adaptive boost LS-SVM classification approach for time-series signal classification in epileptic seizure diagnosis applications. Expert Syst. Appl. 161, 113676 (2020) 7. Nkengfack, L.C.D., et al.: EEG signals analysis for epileptic seizures detection using polynomial transforms, linear discriminant analysis and support vector machines. Biomed. Signal Process. Control 62, 102–113 (2020) 8. Dash, D.P., Kolekar, M.H., Jha, K.: Multi-channel EEG based automatic epileptic seizure detection using iterative filtering decomposition and Hidden Markov model. Comput. Biol. Med. 116, 103–115 (2020) 9. Savadkoohi, M., Oladunni, T., Thompson, L.: A machine learning approach to epileptic seizure prediction using electroencephalogram (EEG) signal. Biocybernetics Biomed. Eng. 40(3), 1328–1341 (2020) 10. Andrzejak, R.G., Lehlertz, K., Mormann, F., Rieke, C., David, P., Elger, C.E.: Indications of nonlinear deterministic and finite dimensional structures in time series of brain electrical activity: dependence on recording region and brain state. Phys. Rev. E 64(6), 061907 (2020) 11. Sharma, R.R., et al.: Automated system for epileptic EEG detection using iterative filtering. IEEE Sens. Lett. 2, 1–4 (2018) 12. Ganaie, M.A., et al.: LSTSVM classifier with enhanced features from pre-trained functional link network. Appl. Soft Comput. J. 93, 106305 (2020)

Chapter 36

Classification of Arrhythmia ECG Signal Using EMD and Rule-Based Classifiers Prakash Chandra Sahoo and Binod Kumar Pattanayak

Abstract The objective of our work is to classify normal ECG signal and non-ECG signal from an arrhythmia ECG signal using Empirical mode Decomposition and rule-based classifiers which is used to give the exact diagnosis for a heart patient. Here, first of all, we used clinical datasets for experimental works. Then we have used eight types of features are extracted from ECG signals. Finally, all the features are combined. Next, with the help of EMD, all the features into the classifiers to detect normal ECG signal and non-ECG signal. After a healthy experiment, we found ANFIS method achieved excellent performance compared to other classifiers. The average area under curve (AUC) is exactly 1. The ANFIS method is able to produce the best performance of detecting outcomes for proper clinical diagnosis in health applications.

36.1 Introduction ECG signal is occurred due to heart muscle stimulation. Generally, electrocardiogram is used for the cardiac patients to check the condition of heart. By the help of electrocardiogram, we can get the actual precise analysis. There are five bases parts of ECG signal waves; those are S waves, R waves, Q waves, P waves, and T waves. Except this, we can find another wave which is known as U wave. Generally, we can find U wave very rarely. The P wave is used for atrial depolarization at the same time S waves, R waves, Q waves signify QRS complex. Electrocardiogram for a particular person may vary at different time period, or for a certain period of time, the signals of two separate persons are same. Hence, the change in the electrocardiogram signal plays a vital role for the most major part of the analysis and segregation of the disease [1, 2]. In 2013 Balasundaram et al. [3] used a method called singular P. C. Sahoo (B) · B. K. Pattanayak Department of Computer Science and Engineering, ITER, Siksha ‘O’ Anusandhan (Deemed to be) University, Bhubaneswar, Odisha, India B. K. Pattanayak e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2022 D. Mishra et al. (eds.), Intelligent and Cloud Computing, Smart Innovation, Systems and Technologies 286, https://doi.org/10.1007/978-981-16-9873-6_36

393

394

P. C. Sahoo and B. K. Pattanayak

value decomposition using wavelet study achieves 93.7% accuracy. In 2015 Subbiah et al. [4] used a method of electrocardiogram signal which is used SVM classifier to give 89% accuracy. In 2015 SVM Acc = 89. In 2017 Maršánová et al. [5] used a method called Morphological and spectral features and it used SVM to give 98.6% accuracy. In 2018 Hassanien et al. [6] developed a method, i.e., Time-domain feature extraction which was taking both SVM and Elephant Herding Optimization classifier to give 93.31% accurate result. In 2015 Kavitha et al. [7] used a method, i.e., Linear and nonlinear features from the HRV which was used to improve SVM classifier to give 93.38% accuracy. In 2018 Ashtiyani et al. [8] developed a method, i.e., DWT and it used SVM to give 97.14% accurate result. In 2008 Asl et al. [9] designed a Support vector machine-based arrhythmia classification using reduced features of heart rate variability signal. In 2016 Hu et al. [10] represented. The distance function effect on K-nearest neighbor classification for medical datasets. In 2017 Zhang and Chen [11] designed LMD based features for the automatic seizure detection of EEG signals using SVM. In 1999 Suykens et al. [12] developed Least squares Support Vector Machine classifiers and Neural Processing Letters. By use of AMFIS we are getting the result as 99.9% accuracy.

36.2 Proposed Method See Fig. 36.1. Fig. 36.1 Proposed system

36 Classification of Arrhythmia ECG Signal Using EMD …

395

Fig. 36.2 Feature extraction by using EMD

36.2.1 Clinical Dataset For this experimental analysis arrhythmia database is taken [9]. In the database, signals were band pass filter at 0.1–100 Hz and digitized at 360 Hz. The database carries information about many types of anomalous beats. The records are not used extra preprocessing beats because its aim was to develop potential arrhythmia patterns using HVR features. If there were no anomalous beat points out in the beat annotation then the rhythm was treated as normal (NSR) in the segment.

36.2.2 Feature Extraction Feature extraction by using EMD is the features extracted from imf1, imf2, imf3, imf4, imf5 as Mean, SDNN, RMSSD, SDSD, pNN5, pNN10, pNN20, pNN50, HRV Triangular Index, TINN (LF, HF, LF/HF, total PSD), SD1/SD2, Fano factor, Allan factor [10]. Figure 36.2 shows the feature extraction using EMD method.

36.2.3 Classifiers Used There are different types of classifiers for categorizing information. In this experiment, we used K-nearest neighbors (KNN) [10], SVM [11], Least Squares SVM

396

P. C. Sahoo and B. K. Pattanayak

(LS-SVM) [12], and ANFIS to demonstrate the effectiveness of our proposed classification approach.

36.2.4 Performance Measurements This paper estimates the performance of the proposed classifiers using sensitivity (SEN%), Specificity (SPE%), accuracy (ACC%), Positive predictive value (PPV), Matthew’s correlation coefficient (MCC), Area under curve (AUC), and execution time in second. The computational equations are, SEN% =

TP TP + FN

(36.1)

SPE% =

TN TN + FP

(36.2)

TP TP + TN + FP + FN

(36.3)

TP TP + FP

(36.4)

(TP × TN) − (FN × FP) T1 × T2

(36.5)

ACC% =

PPV% = MCC% =

36.3 Results and Discussion The step-wise proposed model is represented in Fig. 36.1. In this research work, we have extracted features from arrhythmia signal and after preprocessing by the help of Empirical Mode Decomposition we have selected few features and implemented few methods (KNN, SVM, LS-SVM, and ANFIS) to get normal ECG signal and non-ECG signal. Table 36.1 shows that here we have used five classifiers and simulTable 36.1 Performance computation of classifiers Classifiers

SPE%

SEN%

ACC%

PPV%

MCC%

Execution time (s)

AUC

KNN

92.53

95.88

94.8

92.89

79.82

18.28

0.912

SVM

95.82

96.21

95.71

92.45

82.52

14.20

0.883

LS-SVM

96.72

97.52

96.88

94.52

88.29

11.5

0.978

ANFIS

99.28

99.52

99.9

98.25

96.52

4.8

1

36 Classification of Arrhythmia ECG Signal Using EMD …

397

Fig. 36.3 ROC plot of the proposed method

taneously got the performance. The performance evaluates in the form of statistical parameters as mentioned above. Finally, we got AUC value with less execution time. Here when we used KNN it gave specificity, sensitivity, accuracy, PPV, MCC, execution time, and AUC as 92.53, 95.88, 94.8, 92.89, 79.82%, 18.28 s, and 0.912. When we used SVM it gave specificity, sensitivity, accuracy, PPV, MCC, execution time and AUC as 95.82, 96.21, 95.71, 92.45, 82.52%, 14.20 s and 0.883. For LS-SVM it gave as 96.72, 97.52, 96.88, 94.52, 88.29%, 11.5 s and 0.978. When we used ANFIS it gave as 99.28, 99.52, 99.9, 98.25, 96.52%, 4.8 s and 1. From these experiential results, ANFIS classifiers were giving tremendous results at less execution time in every direction. Table 36.1 shows ANFIS classifier performed well for every set of experimental works. Figure 36.2 shows how different classifiers performed. Figure 36.3 shows that ANFIS outperforms in every direction by taking these parameters. Figure 36.4 shows the plot o classifiers which determined the easy way for detecting ECG signals.

36.4 Comparative Analysis See Table 36.2.

398

P. C. Sahoo and B. K. Pattanayak

Fig. 36.4 Plot shows the classification performances of classifiers

Table 36.2 Comparative analysis in terms of accuracy (ACC %) of existing methods with proposed methods studied on used database in our work References

Existing method

[3]

Singular value decomposition (SVD) using wavelet analysis 93.7

[4]

Detection of peak waveform in ECG signal

89

[5]

Morphological and spectral features

98.6

[6]

Time-domain feature extraction

93.31

[7]

Linear and nonlinear features from the HRV

93.38

[8]

DWT

Proposed method ANFIS

Accuracy (%)

97.14 99.9

36.5 Conclusion In this research work, we extracted eight features from the arrhythmia database. After that, all the features are added together. Then, by the help of different rulebased classifiers, these features are used. After a long experiment, it was found that ANFIS with Arrhythmia ECG signal performed well in every possible direction. It gives the output in the form of Accuracy, Sensitivity, specificity, PPV and MCC are 99.9%, 99.52%, 99.28%, 98.25%, and 96.52%, respectively with less execution time.

36 Classification of Arrhythmia ECG Signal Using EMD …

399

References 1. Charllis, R.E., Kittney, R.I.: Biomedical signal processing (in four parts). Part 1: time domain methods. Med. Biog. Eng. Comput. 509–524 (1990) 2. Hu, Y.H., Palreddy, S., Tompkins, W.: A patient adaptable ECG beat classifier using a mixture of experts approach. IEEE Trans. Biomed. Eng. 44(9), 891–900 (1997) 3. Balasundaram, K., Masse, S., Nair, K., Umapathy, K.: A classification scheme for ventricular arrhythmias using wavelets analysis. Med. Biol. Eng. Comput. 51, 153–164 (2013). https://doi. org/10.1007/s11517-012-0980-y. PubMed PMID: 23132525 4. Subbiah, S., Patro, R., Subbuthai, P.: Feature extraction and classification for ECG signal processing based on artificial neural network and machine learning approach. In: International conference on inter disciplinary research in engineering and technology, pp. 50–57 (2015) 5. Maršánová, L., Ronzhina, M., Smíšek, R., et al.: ECG features and methods for automatic classification of ventricular premature and ischemic heartbeats: a comprehensive experimental study. Sci. Rep. 7(1), 11239 (2017). https://doi.org/10.1038/s41598-017-10942-6 6. Hassanien, A.E., Kilany, M., Houssein, E.H.: Combining support vector machine and elephant herding optimization for cardiac arrhythmias. CoRR arXiv:1806.08242 (2018) 7. Kavitha, R., Christopher, T.: An effective classification of heart rate data using PSO-FCM clustering and enhanced support vector machine. Indian J. Sci. Technol. 8(30) (2015) 8. Ashtiyani, M., Navaei Lavasani, S., Asgharzadeh Alvar, A., Deevband, M.R.: Heart rate variability classification using support vector machine and genetic algorithm. J. Biomed. Phys. Eng. 8(4) (2018) 9. Asl, B.M., Setarehdan, S.K., Mohebbi, M.: Support vector machine-based arrhythmia classification using reduced features of heart rate variability signal. Artif. Intell. Med. 44(1), 51–64 (2008) 10. Hu, L.-Y., Huang, M.-W., Ke, S.-W., Tsal, C.-F.: The distance function effect on K-nearest neighbor classification for medical datasets. Springerplus 5, 1304 (2016) 11. Zhang, T., Chen, W.: LMD based features for the automatic seizure detection of EEG signals using SVM. IEEE Trans. Neural Syst. Rehabil. Eng. 25, 1100–1108 (2017) 12. Suykens, J.A., et al.: Least squares Support Vector Machine classifiers. Neural Process. Lett. 9(3), 293–300 (1999)

Chapter 37

A Comparative Analysis of Data Standardization Methods on Stock Movement Binita Kumari and Tripti Swarnkar

Abstract Prediction of stock market indices has pinched considerable debate due to its brunt on economic development. Prediction of appropriate stock market indices is important in order to curtail the uncertainty related to it in order to arrive at conclusion on effective finance schemes. Thus selection of a proper forecasting model is highly appreciated which is always affected by the input data. The objective of this paper is to efficiently normalize input data in order to obtain accurate forecasting of stock movement and compare the accuracy results for different classifiers. This study compares three normalization techniques and their effect on the forecasting performance. In our work, we implemented different classifiers like SVM, ANN, and KNN for stock trend forecasting because of their risk management capabilities. This article deals primarily with the normalization of input data for the estimation of stock movement. Simulation was performed on six stock indices from different parts of the world market and a performance review of the system was performed. The study reveals the great affectability of commonly used methodologies for data standardization computations, as well as the need for a cautious approach to dealing with the results obtained.

37.1 Introduction In the field of information processing, a machinery that is rapidly developing is data mining. It is been linked to various fields, for example, armed forces, engineering, science, administration, and also in the field of business. Inside the financial realm, B. Kumari (B) Department of Computer Science & Engineering, Siksha ‘O’ Anusandhan Deemed to be University, Bhubaneswar, Odisha, India e-mail: [email protected] T. Swarnkar Department of Computer Application, Siksha ‘O’ Anusandhan Deemed to be University, Bhubaneswar, Odisha, India e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2022 D. Mishra et al. (eds.), Intelligent and Cloud Computing, Smart Innovation, Systems and Technologies 286, https://doi.org/10.1007/978-981-16-9873-6_37

401

402

B. Kumari and T. Swarnkar

data mining might be carried out to assist with the search for stock prices, financial evaluations, and so on. The forecasting of stock market progress is seen as a daunting work as the financial sector is very complex, dynamic, and non-linear efficient framework [1]. Numerous studies have focused on the usage of different classifiers [6, 16, 20, 32] in the domain of the forecasting financial market. A part of the stock market indices forecasting system delivers the criterion(s) that shows the outcome in a variety of units and scales to a standard and corresponding numeral scope. This method, called normalization, will have a profound impact on the result of the calculation. In this article, we emphasize the standardization of info on the stock market forecast. We also contrast the classification results of Support Vector Machine, K Nearest Neighbor, and Artificial Neural Network. The rest of the paper is organized as follows: In Sect. 37.2, we describe the related work. In Sect. 37.3, the methods and materials used in our work are discussed. Section 37.4 provides detail about the proposed prediction model. Section 37.5 deals with the results and discussion followed by conclusion and future work in Sect. 37.6.

37.2 Related Work Looking at market dynamics and critical patterns, stock traders and anyone wanting to select the right stock or, conceivably, the favorite time to search for or sell stocks is very attractive [30]. Many full-scale financing-related components, such as political events, organization methods, general monetary circumstances, etc., affect the stock costs [23]. According to the writers in [5], soft computation approaches are widely used for financial market problems and are useful techniques for predicting nonlinear behavior. The Support Vector Machine, as well as Artificial Neural Network (ANN), has been utilized for stock forecasting by various. Be that as it may, even subsequent to building such a significant number of dynamic models, Artificial Neural Networks incorporates couple of deterrents inside the learning method which impacts the result as appeared in [15]. As a consequence, a handful of researchers like advanced systems that rely on a powerful statistical basis, such as SVM [10]. Various researches use SVM to interpret time series information [12, 33]. The SVM is a machine learning system popularized by Vapnik in 1995 and has been used for non-linear predictions due to its eye-catching decisions and its high degree of execution in various issues. In [33] Tai and Cao tried to use the neural framework for estimating but saw the SVM as superior to the multi-layer neural system framework for predicting monetary time. Artificial Neural Network has been used in many domains and one of them is in stock predictions. References [3, 18, 19, 27, 32] are only a few of the studies that have been done on using Artificial Neural Networks to model stock prices. References [24, 33] include several more papers on using neural networks to predict stock market volatility. The results obtained through the use of ANNs are superior to those obtained through the use of linear and logical regression models [4, 30].

37 A Comparative Analysis of Data Standardization Methods …

403

A research using ANNs to forecast financial outcomes [6] yielded results with a 3 percent average failure rate. Even during the financial crisis, a Multi-Layer Perceptron model [16] with macroeconomic indicators used to estimate Istanbul Stock Exchange produced a signal with a 73.7 percent correctness ratio, demonstrating the skill of ANNs in prediction. Artificial Neural Networks outperform the adaptive exponential smoothing approach in predicting market movement, according to a report [2]. Many economists and financial analysts have advocated for the presence of financial market nonlinearity and uncertainty [9]. Normalization is a necessary part of every technique anywhere protocols for managing information is implemented. In this regard, a study of the implementation of normalization procedures in various areas has been carried out. A massive amount of analysis work pre-forms the information while not placing any burden on the quality of the results. The authors’ questions were posed in [11, 22, 28] on the pre-requisite of the pre-requisite. For the oversampling of unbalanced datasets, a pre-processing method known as SMOTE ENN has been used in [34]. As addressed by the researchers in [22, 28] the conduct of a standardization method shall be further contacted by the noise company within the data set. Han and Men [13] attempted and valued the effect of standardization on the identification of RNA-seq disease. In a different paper, the author [17] has assessed 14 standard learning methods for the development of a powerful selection miniature in order to select the most appropriate standardization technique. Accordingly, it is clear from the literature that the normalization method chosen to perform a data mining task can have an effect on the accuracy of the performance. In our paper, we are going to take a closer look at the value of stock forecast standardization.

37.3 Methods and Materials 37.3.1 Data Sets To confirm the effect of the normalization of input data on forecasting results, six different Sensex datasets from different countries namely from India, United States of America, Hong Kong, China, and Tokyo have been considered. BSESN, NIFTY50, NASDAQ, HANG SENG, NIKKEI225, and SSE composite index are selected as experimental data sets in this analysis. The analysis uses the details from 3/11/2015 to 27/11/2020. The accumulated data involves each day’s high, open, low, and closing rate. They are used as informational indicators. Data was obtained from Yahoo Finance (https://in.finance.yahoo.com/). Table 37.1 provides the details for the datasets used. The purpose of our paper is to set the guidance of each day’s record of the Sensex movement. A big issue with any dataset is that there are no up/down class marks in it. Therefore, as described in [8], we use the c attribute that indicates the variation

404

B. Kumari and T. Swarnkar

Table 37.1 Dataset description Sl. No.

Dataset

Duration

No. of collected samples

No. of features

1

BSESN

30/11/2015 to 27/11/2020

1235

7

2

NIFTY50

30/11/2015 to 27/11/2020

1235

7

3

NASDAQ

30/11/2015 to 27/11/2020

1235

7

4

HANG SENG

30/11/2015 to 27/11/2020

1235

7

5

NIKKEI225

30/11/2015 to 27/11/2020

1235

7

6

SSE composite index

30/11/2015 to 27/11/2020

1235

7

in closing price. c has been used as an identifier for a class. “1” and “−1” convey that the coming day index is increased or reduced than the current index. Miniatures for forecasting are produced and the output is used to determine the production.

37.3.2 Normalization Normalization is a pre-data administration phase where we scale the input data to a small scope. Essentially, the standardization of data is needed to manage characteristics of different units and sizes with the ultimate aim of achieving better results. If a mining technique has a random sampling aspect, standardizing the specimen size will aid ensure that all the sources are evaluated uniformly. It also ensures the minimization of data-availability bias. Input data standardization performs an essential aspect in the process of predicting stock. Then again, on the off chance that the mining calculation has an irregular examining segment, at that point normalizing for test size may benefit guaranteeing that all origins are dealt with similarly, and that information accessibility predisposition (and its relating deception of the information universe) is decreased. Standardization of info information assumes a significant job in the stock expectation process. The following three normalization methods have been used to analyze their effect on stock estimates. 1. 2. 3.

Min–max Z score Robust.

37 A Comparative Analysis of Data Standardization Methods … Table 37.2 Normalization methods used

Sl. No.

Normalization techniques

Formula

1

Z score

Ai =

2

Robust

Ai =

3

Table 37.3 Some widely adopted technical indicators

405

Min–max

Ai −μ σ

Ai −median 75 percentile−25 percentile Ai −min Ai Ai = max Ai −min Ai

Technical indicators Relative index Stochastic slow Stochastic indicator Disparity 5 20-day bias Rate of change (ROC) Momentum Relative strength index (RSI) Rate of change Disparity 10 Moving average oscillators (MAO)

Normalization techniques equations for the three techniques utilized in our paper are shown in Table 37.2. Here Ai means the ith dimension of the data set and n means the count of records in total. As recorded by the researchers in [5, 29] along with the literature review, we have got that the different normalization techniques as set out in Table 37.2 are commonly used in a variety of fields, such as medicine, industry, finance, business, etc. On the basis of a literature survey, usage of 70% of the data points has been done as practice details. The rest 30% of data points are used as trial material. Aiming at boosting the predictive scope of the model, we have produced an amalgamated data set consisting of the general features of the stock data as well as eighty-three technical indicators. Some of the technical indicators are listed in Table 37.3. Along with that, it includes c as indicated in [8] and the marked label class as −1 or 1.

37.3.3 Technical Indicators For stock market indices, the input features usually used are- opening value, closing cost, maximum cost, lowest cost along with total volume. Numerous studies have exhibited that technical indicators are useful for the forecasting of stock [14, 25]. By

406

B. Kumari and T. Swarnkar

applying the opening value equation, the lowest cost, the highest cost as well as the trading volume of the data, the calculation of the technical indicators can be done. Some of the technical indicators which are commonly adopted are shown in Table 37.3.

37.3.4 Support Vector Machines As indicated by the researchers in [11, 12], SVMs are administered learning models that examine facts and classify templates for the usage in classification and regression studies. It pursues by creating hyperplanes in a multidimensional space that segregates specific class mark instances. It is capable of managing a variety of variables like continuous as well as categorical. While employing SVM to predictions, the main point to be discussed is the kernel functions choice. Many researchers addressed the kernel features selection for financial prediction. In our study, we implemented the Radial basis function due to its wide use in literature. At a point where the kernel function is chosen, two essential parameters (C, a) ˜ need to be set. Variable a˜ is the estimation of gamma in kernel function and variable C is the cost of C-SVM. It is clear that the assessment of C and a˜ can have an effect on the outcome of the SVM. In our analysis, C and a˜ were selected using grid-search method process.

37.3.5 Artificial Neural Network The structure and function of biological neural networks were used to design ANN architecture. ANN is made up of neurons that are organized in layers, much like neurons in the brain. The feed-forward neural network is a common neural network that has three layers: an input layer that receives external data for pattern recognition, an output layer that solves the problem, and a hidden layer that connects the other layers. Acyclic arcs link neighboring neurons in the input as well as output layers. The ANN learns datasets using a training algorithm that adjusts neuron weights based on the error rate between goal and actual performance. In general, ANN learns datasets by using backpropagation algorithm as a training algorithm.

37.3.6 K Nearest Neighbor K Nearest Neighbors (KNN) is a well-known machine learning algorithm that has been applied to massive number of data mining projects. The concept is that an enormous quantity of training data is used, with each data point being defined by a

37 A Comparative Analysis of Data Standardization Methods …

407

Fig. 37.1 A general structure for our proposed methodology

collection of variables. Every point is conceptually sketched in a high-dimensional zone, with each pivot corresponding to a separate variable. On having a new (test) data point, we crave to identify the K nearest neighbors who are the utmost “similar” to it. The square root of the whole count of points in the training data set (N) is commonly used as the value for K. (If N is 400, K equals 20.)

37.4 Proposed Methodology We first collected data for six different countries (Fig. 37.1). Then we generated a synthesized dataset by including another set of eighty-three technical indicators, c, and class label. We then normalized the datasets with three different normalization techniques as 1. 2. 3.

Min–max Z score Robust

We tend to check the classification accuracy for different datasets for different classifiers namely KNN, SVM, and ANN.

37.5 Results and Discussion Facts were obtained from Yahoo Finance for six Sensex data sets, which are BSESN, NIFTY50, NASDAQ, HANG SENG, NIKKEI225, and SSE composite index. In our article, the task is to predict the caption of the daily stock value record as “1” or “−1” depicting a decrease or an increase in the closing cost. Figures 37.2, 37.3, 37.4, 37.5, 37.6, and 37.7 show the closing price trends for the collected datasets.

408

B. Kumari and T. Swarnkar

Fig. 37.2 Closing price trend for BSESN

Fig. 37.3 Closing price trend for SSE

In extension to the opening cost, the closing cost, the lowest cost, the highest cost, the total trading volume, 83 suitable technical indicators were considered as initial feature pools. According to the researchers in [26, 35], the technical indicators are feasible means for presenting the true market condition in the monetary timeline prediction. We can be further informative than using mere values [26]. Thus eighty-three technical indicators which are used mostly all over the world have been generated using Python packages. Some of the technical indicators used in our analysis are given in Table 37.2. Based on literature review, seventy percent of the data (closing costs) are used as training data. The remaining thirty percent of distinguished data points are used as

37 A Comparative Analysis of Data Standardization Methods …

409

Fig. 37.4 Closing price trend for NASDAQ

Fig. 37.5 Closing price trend for NIFTY50

test material. In order to improve the model’s forecasting efficiency, we developed an amalgamated dataset. The amalgamated dataset needs to be normalized so that the results of the prediction are good. We evaluated three different standardization approaches for each of the six datasets. Throughout our analysis, the normalization strategies taken into consideration are 1. 2. 3.

Min–max Z score Robust.

410

B. Kumari and T. Swarnkar

Fig. 37.6 Closing price trend for NIKKEI225

Fig. 37.7 Closing price trend for HANGSENG

We effectively track the forecasting efficiency and effect of normalization techniques between Z score, Robust, and Min–max along with SVM, ANN, and KNN considering the same training data set as well as testing data set for BSESN, NIFTY50, NASDAQ, HANG SENG, NIKKEI225, and SSE composite index, respectively. The assessment of the miniature was carried out using different criteria like accuracy, f1 score, precision, and recall for which the formulas are mentioned below. Accuracy =

TP + TN TP + TN + FP + FN

(37.1)

37 A Comparative Analysis of Data Standardization Methods …

F1 score =

411

2 × Precision × Recall Precision + Recall

(37.2)

TP TP + FP

(37.3)

Precision = Recall =

TP TP + FN

(37.4)

The efficiency results of KNN using Z score, Robust, and Min–max for prediction of 2 class names, mainly up or down for BSESN, NIFTY50, NASDAQ, HANG SENG, NIKKEI225, and SSE composite index are listed in Table 37.4. Tables 37.4, 37.5, and 37.6 show the value for each performance measurement criteria using the three different normalization techniques for KNN, SVM, and ANN, respectively for BSESN, NIFTY50, NASDAQ, HANG SENG, NIKKEI225, and SSE composite index. The predictive output of SVM, ANN, and KNN differs when specific input data normalization methods are enforced for a dataset. Additionally, we also observe from Tables 37.4, 37.5, and 37.6 that the classification accuracy of ANN is better as compared to SVM and KNN. Figures 37.8, 37.9, and 37.10 Table 37.4 Accuracy results for different normalization techniques using KNN Dataset BSESN

NASDAQ

NIFTY50

NIKKEI25

HANGSENG

SSE composite index

Normalization technique

KNN Accuracy

Precision

Recall

F1-score

MM

0.66

0.68

0.65

0.65

ZS

0.69

0.72

0.65

0.67

RS

0.70

0.82

0.65

0.71

MM

0.69

0.72

0.65

0.65

ZS

0.69

0.72

0.65

0.65

RS

0.71

0.76

0.70

0.71

MM

0.69

0.72

0.65

0.65

ZS

0.71

0.76

0.70

0.71

RS

0.70

0.74

0.61

0.71

MM

0.71

0.76

0.70

0.71

ZS

0.71

0.76

0.70

0.71

RS

0.70

0.74

0.68

0.69

MM

0.71

0.76

0.70

0.71

ZS

0.70

0.69

0.48

0.58

RS

0.68

0.70

0.67

0.69

MM

0.70

0.74

0.61

0.67

ZS

0.69

0.72

0.65

0.65

RS

0.68

0.70

0.67

0.69

412

B. Kumari and T. Swarnkar

Table 37.5 Accuracy results for different normalization techniques using SVM Dataset BSESN

NASDAQ

NIFTY50

NIKKEI25

HANGSENG

SSE composite index

Normalization technique

SVM Accuracy

Precision

Recall

F1-score

MM

0.61

0.62

0.59

0.61

ZS

0.66

0.68

0.65

0.65

RS

0.59

0.61

0.60

0.58

MM

0.59

0.61

0.60

0.58

ZS

0.62

0.58

0.65

0.61

RS

0.63

0.68

0.50

0.59

MM

0.62

0.68

0.65

0.58

ZS

0.65

0.52

0.59

0.60

RS

0.63

0.68

0.60

0.61

MM

0.59

0.61

0.60

0.58

ZS

0.64

0.58

0.60

0.61

RS

0.64

0.60

0.51

0.58

MM

0.58

0.61

0.56

0.56

ZS

0.63

0.68

0.60

0.61

RS

0.63

0.58

0.50

0.59

MM

0.58

0.61

0.56

0.56

ZS

0.62

0.51

0.50

0.55

RS

0.61

0.60

0.64

0.58

demonstrate the findings achieved for the various methods in Tables 37.4, 37.5, and 37.6. From Fig. 37.8, we can observe that the RS method performs better for two datasets (BSESN, NASDAQ), ZS method is better for one dataset (NIFTY50), ZS and RS methods perform similarly for one dataset (NIKKEI25), and for the remaining two datasets (HANGSENG, SSE Composite Index) the MM method performs better. From Fig. 37.9, we can observe that the ZS method performs better for three datasets (BSESN, NIFTY50, SSE Composite Index), RS method is better for one dataset (NASDAQ), ZS and RS methods perform similarly for two datasets (NIKKEI25, HANGSENG). From Fig. 37.10, we can observe that the ZS method performs better for four datasets (NASDAQ, NIKKEI25, HANGSENG, SSE Composite Index), RS method is better for two datasets (BSESN, NIFTY50). ZS and MM methods perform almost similar for one dataset (BSESN). Thus, it can be deduced that the precision of the interpretation depends on the normalization method used for the input data, in accordance with further criteria such as parameter tuning, etc. Variables with differing spans or differing fidelities have varying driving values, they can impact the ultimate result. Therefore, the final output of using the same normalization technique to various types of data sets together with

37 A Comparative Analysis of Data Standardization Methods …

413

Table 37.6 Accuracy results for different normalization techniques using ANN Dataset BSESN

NASDAQ

NIFTY50

NIKKEI25

HANGSENG

SSE composite index

Normalization technique

ANN Accuracy

Precision

Recall

F1-score

MM

0.69

0.72

0.65

0.65

ZS

0.70

0.68

0.61

0.71

RS

0.95

0.82

0.74

0.79

MM

0.69

0.72

0.65

0.65

ZS

0.92

0.90

89

0.85

RS

0.91

0.88

0.90

0.83

MM

0.71

0.76

0.69

0.70

ZS

0.91

0.85

0.71

0.82

RS

0.93

0.89

0.78

0.79

MM

0.67

0.71

0.65

0.67

ZS

0.93

0.90

0.88

0.82

RS

0.89

0.86

0.78

0.81

MM

0.68

0.70

0.67

0.69

ZS

0.96

0.83

0.85

0.80

RS

0.93

0.81

0.78

0.78

MM

0.64

0.68

0.60

0.61

ZS

0.96

0.83

0.75

0.84

RS

0.95

0.82

0.74

0.83

Fig. 37.8 Accuracy results for different normalization techniques using KNN

414

B. Kumari and T. Swarnkar

Fig. 37.9 Accuracy results for different normalization techniques using SVM

Fig. 37.10 Accuracy results for different normalization techniques using ANN

the same data mining methodology could be diverse. Likewise, the utilization of various forms of normalization strategies to an individual data set can often have varying results due to the properties of the bottom-line data set. Figures 37.8, 37.9, and 37.10 show the graphical representation of the accuracy results for different datasets for different normalization techniques for prediction of two class labels, that is, up or down for the data sets for KNN, SVM, and ANN. As seen in Tables 37.4, 37.5, and 37.6, it can be argued that the utilization of various types of standardization methods to an individual data set may have varying results due to the properties of the underlying data set. We may therefore assume that the efficiency of the prediction depends on the normalization method used for input data, together with more criteria.

37 A Comparative Analysis of Data Standardization Methods …

415

Our study shows that applying the same normalization technique to various datasets can provide divergent performance rates. The outcomes of the prediction error assessment are therefore different when the dataset varies.

37.6 Conclusion and Future Work The normalization method which is utilized for the input data normalization significantly impacts the accuracy result of the machine learning processes. The selection of standardization techniques ought to depend on the features being anticipated, and on the rule of loss minimization. Each application of classifier needs to create choices regarding parameter tuning, etc. It has occurred from our study that the choice of technique of normalization of input data will considerably influence the accuracy results given by classifiers. When varying types of normalization methods are applied on same data set using the same machine learning method, the outcome may vary. In our study, we have considered the six stock indexes from various countries like India, China, Tokyo, Hong Kong and Unites States of America from around the globe. For future work, we suggest exploring other classifiers for their accuracy result and behavior. Declaration of Competing Interest None to declare. Funding No funding.

Ethics Authors confirm that this manuscript has not been published elsewhere and that no ethical issues are involved.

References 1. Abu-Mostafa, Y.S., Atiya, A.F.: Introduction to financial forecasting. Appl. Intell. 6(3), 205– 213 (1996) 2. Akel, V., Bayramoglu, M.F.: Financial Predicting with Artificial Neural Networks in Crisis Periods: The Case of ISE 100 Index. Balikesir (2008) 3. Al-Qaheri, H., Hassanien, A.E., Abraham, A.: Discovering stock price prediction rules using rough sets. Neural Netw. World J. 18–181 (2008) 4. Altay, E., Satman, M.H.: Stock market forecasting: artificial neural network and linear regression comparison in an emerging market. J. Financ. Manag. Anal. 18(2), 18 (2005) 5. Barak, S., Arjmand, A., Ortobelli, S.: Fusion of multiple diverse predictors in stock market. Inf. Fusion 36, 90–102 (2017) 6. Cao, L.J., Tay, F.H.: Support vector machine with adaptive parameters in financial time series forecasting. IEEE Trans. Neural Netw. 14(6), 1506–1518 (2003) 7. Chen, Y., Hao, Y.: A feature weighted support vector machine and K-nearest neighbor algorithm for stock market indices prediction. Expert Syst. Appl. 80, 340–355 (2017)

416

B. Kumari and T. Swarnkar

8. Chen, W.H., Shih, J.Y., Wu, S.: Comparison of support-vector machines and back propagation neural networks in forecasting the six major Asian stock markets. Int. J. Electron. Finance 1(1), 49 (2006) 9. de Faria, E.L., Albuquerque, M.P., Gonzalez, J.L., Cavalcante, J.T.P., Albuquerque, M.P.: Predicting the Brazilian stock market through neural networks and adaptive exponential smoothing methods. Expert Syst. Appl. 36(10), 12506–12509 (2009) 10. Fernandez-Lozano, C., Canto, C., Gestal, M., Andrade-Garda, J.M., Rabuñal, J.R., Dorado, J., et al.: Hybrid model based on Genetic Algorithms and SVM applied to variable selection within fruit juice classification. Sci. World J. 2013, 982438 (2013) 11. Garcia, L.P.F., de Carvalho, A.C.P.L.F., Lorena, A.C.: Effect of label noise in the complexity of classification problems. Neurocomputing 160, 108–119 (2015) 12. García, S., Fernández, A., Luengo, J., Herrera, F.: Advanced nonparametric tests for multiple comparisons in the design of experiments in computational intelligence and data mining: experimental analysis of power. Inf. Sci. (Ny) 180(10), 2044–2064 (2010) 13. Han, H., Men, K.: How does normalization impact RNA-seq disease diagnosis? J. Biomed. Inform. 85, 80–92 (2018) 14. Hsu, M.-W., Lessmann, S., Sung, M.-C., Ma, T., Johnson, J.E.V.: Bridging the divide in financial market forecasting: machine learners vs. financial economists. Expert Syst. Appl. 61, 215–234 (2016) 15. Huang, Z., Chen, H., Hsu, C.-J., Chen, W.-H., Wu, S.: Credit rating analysis with support vector machines and neural networks: a market comparative study. Decis. Support Syst. 37(4), 543–558 (2004) 16. Huang, W., Nakamori, Y., Wang, S.-Y.: Forecasting stock market movement direction with support vector machine. Comput. Oper. Res. 32(10), 2513–2522 (2005) 17. Jain, S., Shukla, S., Wadhvani, R.: Dynamic selection of normalization techniques using data complexity measures. Expert Syst. Appl. 106, 252–262 (2018) 18. Kaastra, I., Boyd, M.: Designing a neural network for forecasting financial and economic time series. Neurocomputing 10(3), 215–236 (1996) 19. Khan, Z.H., Alin, T.S., Hussain, M.A.: Price prediction of share market using artificial neural network (ANN). Int. J. Comput. Appl. 22(2), 42–47 (2011) 20. Kim, K.-J.: Financial time series forecasting using support vector machines. Neurocomputing 55(1–2), 307–319 (2003) 21. Kumari, B., Swarnkar, T.: Importance of data standardization methods on stock indices prediction accuracy. In: Advances in Intelligent Systems and Computing, pp. 309–318. Springer, Singapore (2020) 22. Leigh, W., Modani, N., Hightower, R.: A computational implementation of stock charting: abrupt volume increase as signal for movement in New York Stock Exchange Composite Index. Decis. Support Syst. 37(4), 515–530 (2004) 23. Majhi, B., Rout, M., Baghel, V.: On the development and performance evaluation of a multiobjective GA-based RBF adaptive model for the prediction of stock indices. J. King Saud Univ. Comput. Inf. Sci. 26(3), 319–331 (2014) 24. Mitra, S.K.: Optimal combination of trading rules using neural networks. Int. Bus. Res. [Internet] 2(1) (2009) 25. Neely, C.J., Rapach, D.E., Tu, J., Zhou, G.: Forecasting the equity risk premium: the role of technical indicators. Manag. Sci. 60(7), 1772–1791 (2014) 26. Nikfarjam, A., Emadzadeh, E., Muthaiyah, S.: Text Mining Approaches for Stock Market Prediction, pp. 256–260. IEEE (2010) 27. Racine, J.: On the nonlinear predictability of stock returns using financial and economic variables. J. Bus. Econ. Stat. 19(3), 380–382 (2001) 28. Sáez, J.A., Galar, M., Luengo, J., Herrera, F.: Tackling the problem of classification with noisy data using Multiple Classifier Systems: analysis of the performance and robustness. Inf. Sci. (Ny) 247, 1–20 (2013) 29. Sahin, U., Ozbayoglu, A.M.: TN-RSI: Trend-normalized RSI indicator for stock trading systems with evolutionary computation. Procedia Comput. Sci. 36, 240–245 (2014)

37 A Comparative Analysis of Data Standardization Methods …

417

30. Senol, D., Ozturan, M.: Stock price direction prediction using artificial neural network approach: the case of turkey. J. Artif. Intell. 1(2), 70–77 (2008) 31. Tay, F.E.H., Cao, L.: Application of support vector machines in financial time series forecasting. Omega 29(4), 309–317 (2001) 32. Vanstone, B., Finnie, G.: Combining Technical Analysis and Neural Networks in the Australian Stockmarket. Bond University ePublications@bond. Information Technology Papers (2006) 33. Vanstone, B., Finnie, G.: An empirical methodology for developing stockmarket trading systems using artificial neural networks. Expert Syst. Appl. 36(3), 6668–6680 (2009) 34. Xie, B., Passonneau, R., Wu, L., Creamer, G.G.: Semantic Frames to Predict Stock Price Movement, pp. 873–883 (2013) 35. Yeh, T.-L.: Capital structure and cost efficiency in the Taiwanese banking industry. Serv. Ind. J. 31(2), 237–249 (2011)

Chapter 38

Implementation of Data Warehouse: An Improved Data-Driven Decision-Making Approach L. S. Abinash Nayak, Kaberi Das, Srilekha Hota, Bharat Jyoti Ranjan Sahu, and Deb ahuti Mishra Abstract IT industries or organizations are continuously growing day by day where Data Warehousing Technology plays an important role. A Data Warehouse has the characteristics such as subject-oriented, integrated, time variant, and non-volatile which supports the management to take necessary decisions. Data warehousing can be defined as a data hub or repository, where a huge and large amount of meaningful data are stored. It gives a clear visualization for analysis of the huge data and helps to take the decisions for business growth. Data warehouse stores historical information and makes it simple for business analysts to analyze data. This technology enables analysts to extract and view business data from different sources. Here a discussion has been made about the implementation of Data Warehouse technology. This paper represents a framework regarding how to maintain data in a data warehouse while collecting data from different data sources and making it usable inside a Data Warehouse.

38.1 Motivation of the Work The motivation behind using DWH technology is to develop a well-structured and improved decision support system by taking information from millions of rows. This technology can improve data integrity and quality that suits the organizational L. S. Abinash Nayak (B) · K. Das · B. J. R. Sahu · D. Mishra Siksha ‘O’ Anusandhan University, Bhubaneswar, Orissa, India K. Das e-mail: [email protected] B. J. R. Sahu e-mail: [email protected] D. Mishra e-mail: [email protected] S. Hota Cognizant Technology Solutions, Bengaluru, India e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2022 D. Mishra et al. (eds.), Intelligent and Cloud Computing, Smart Innovation, Systems and Technologies 286, https://doi.org/10.1007/978-981-16-9873-6_38

419

420

L. S. Abinash Nayak et al.

environment efficiently for the growth of the organization. It gives us clarity to understand what is involved in the certain and ongoing administration of an effective business intelligence system.

38.2 Introduction Data Warehouse (DWH) satisfies the properties such as subject-oriented, integrated, time variant, and non-volatile [1]. Subject-oriented means it allows the users to directly access the subject database. DWH collects data from different sources but the data available in DWH is in an integrated form. The time variant property explains the data that is available in the DWH is extensive in nature means maintained historically. Non-volatile property explains that we are able to access the new data along with the existed data in a DWH. It gives an effective visualization to the management, CEO, Business head, and the analysts to take smart and effective decisions for the organization. DWH is nothing but an environment that is designed to be used by the experts for writing the query to get the desired results which help to analyze the data in different perceptions [2]. Data can be made to measure the importance of the available data in a complex graphical presentation which includes facts and data-set table. Facts are the parameters that are studied from various points of view to take necessary decisions for the businesses and Dimensions are the background for the facts to describe. As the size of the online analytical processing (OLAP) server is very huge it is used for data warehousing purposes for storing historical data for a long-term period. As the amount of data is growing rapidly in real world and in many organizations, we need to handle those data by rearranging the data in Data Warehouse, as a result of which we can retrieve efficient data from the unknown source of data. Many researchers and authors have used the concepts such as integrating different types of data and many authors have extracted important information from the unorganized raw data. To implement a Data Warehouse for any small or big organizations, there are basically 2types of approaches are followed, i.e., Top-down approach otherwise known as Inmon’s approach and Bottom-up approach otherwise known as Kimball’s approach.

38.3 Literature Review Many researchers and authors have used the concepts such as integrating different types of data and many authors have extracted important information from the unorganized raw data. Different approaches or procedures have been adopted to handle the source data available in different forms in various locations. According to Rieger et al. [2], it has been introduced that a meta-database by integrating quantitative and qualitative information by using associate management functionalities. The authors Bleyberg and Ganesh proposed [3] to build text warehouse by using the concept

38 Implementation of Data Warehouse: An Improved …

421

of dynamic multidimensional model. The index table of a snowflake schema is the origin of any model. The information is extracted from the warehouse with respect to their syntactic groups which are committed to those data. Zamfir et al. [4] have discussed the multidimensional warehouse structure where queries are written to get the output based on the documents from a Data Warehouse. Tseng et al. [5] have given a description about the three procedures of integrating the structured and unstructured data inside a data warehouse to make an efficient business system. Data warehousing technique enhances the quality of decision making by integrating data from various sources. Baars and Kemper [6] have focused on the different methods of integrating the structured and unstructured data for the analysis purpose by the business analysts. The efficiency and challenges are discussed. According to Alqarni et al. [7] the structured data and unstructured data are depicted in a multi-layer schema. To integrate various types of unstructured documents, the mechanism of Linguistic match and word-net are used to identify the similarity among the data between both of the data sources. El Moukhi et al. [8] have discussed the state of art and the future challenges of data warehousing technologies.

38.4 State of the Art After studying the work of so many authors we have found that in Inmon’s approach, i.e., Top-Down approach, first data is collected from various sources and loaded into the centralized Data Warehouse system through a data staging area where data is cleansed, de-duplicated, and transformed into a common format. Then different types of Data Marts are organized in such a way that it takes data directly from the origins or Data Warehouse and display it to the users. This approach is robust for building an effective Data Warehouse for a long-term purpose where the business requirement is huge or large. Secondly, in Kimball’s approach, i.e., Bottom-Up approach, it is found that, when data is extracted from the source areas into the data staging area, first Data Marts are combined together to create a broad Data Warehouse. Then Data is extracted from the data staging area to the Data Warehouse through the Data Marts. This type of approach of building a Data Ware House is followed when the business requirements are small and the business is for short-term purposes. To create a successful Data Warehouse Environment, ETL Tool along with the process to extract is the pre-requisite. Gathering the scattered data from the realtime environment is not possible without ETL tool. So, by using some ETL tool it is easy to collect the data from the sources and maintain a repository system. ETL tool is very much needed for an organization to consolidate data from heterogeneous sources to develop an effective data warehousing environment. ETL tools gather data, understand it, and integrate large volumes of raw data from multiple data sources platforms. All the collected data are loaded into a single database to access easily. Then the data is processed and converted to make it significant by joining,

422

L. S. Abinash Nayak et al.

reformatting, filtering, merging, and aggregation. Then we can visualize the data in a graphical representation to understand easily.

38.5 Experimental Analysis For a successful implementation of Data Warehouse in our project, we have followed Top-Down approach where first we have loaded the data into Data Warehouse and then we have extracted the data from the Data Warehouse to the different Data Marts according to the requirements or subjects that is nothing but subject-oriented data. Then we retrieved the data that is available in Data Marts for our respective subjects. We have considered the “ORACLE SQL Developer” (Version 20.4.1) server as our OLAP server where we have designed our Data Warehouse and different Data Marts. We have loaded data into the Oracle server and extracted the required data through the Data Marts whenever it is necessary. To understand it practically we can follow the below steps. We have taken screenshots of different steps during the implementation process for better understanding purposes. Step-1 See Fig. 38.1. This is a snapshot of Oracle “OLAP Server”, i.e., nothing but a Data Warehouse server. In this server, we can find a large and huge amount of data is stored and organized for an organization. Step-2 See Fig. 38.2. Here we can see that different Data Marts are designed and organized according to the subject. These Data Marts are directly connected to the Data Warehouse. We can access these data marts with respect to the business requirements. As per the subject

Fig. 38.1 Oracle server

38 Implementation of Data Warehouse: An Improved …

423

Fig. 38.2 Data marts

requirements, separately we can connect the data marts o the Data Warehouse and; visualize the data for our analysis purpose. Step-3 See Fig. 38.3. As we can see in the figure, we have connected the last Data Mart directly with the server. Now, whatever the data requirements we have from the last data mart

Fig. 38.3 Data mart connectivity with data warehouse

424

L. S. Abinash Nayak et al.

Fig. 38.4 Writing query to retrieve data from the data warehouse through the data mart

according to the subject, we can extract those data by writing the query in the worksheet. We can find the subject-oriented data from that particular data mart from the Data Warehouse without disturbing the data from the other data marts, so it satisfies the subject-oriented property of Data Warehouse. Step-4 See Fig. 38.4. In the above image we can see for a particular Data mart we have written the query to get the result from the Data Warehouse through that particular Data Mart only. Let’s see the output in the next step. Step-5 See Fig. 38.5. In the above image, we have extracted the desired data through the query from the Data Warehouse. We have observed that the “Actual Output” is matching with the “Expected Output”. So, this is how we have successfully implemented a long-term purposed Data Warehouse technology for the organization.

38.6 Conclusion and Future Scope Data warehousing is one of the key technologies for any organization or enterprise to handle the analytical terms such as CRM, ERP, Supply chain, Products, and Customers. According to some authors, data has grown extensively to be treated

38 Implementation of Data Warehouse: An Improved …

425

Fig. 38.5 Data extraction from data warehouse

as an important part of industrial decision-making process and it needs to be incorporated into the warehouse and structures derived from it for information retrieval, knowledge discovery, and business intelligence. In this paper, we have presented a framework of different approaches being proposed by various authors to deal with the data warehouse. We have studied all the possible and related works done by the different authors and implemented a Data Warehouse successfully. In future, we’ll be trying to develop a framework for manipulating data to get better results from the ongoing research works. The future will give the real-time data warehouse updates with the ability to give the organizations a view of ongoing projects and take action either manually or through a condition triggered by the data warehouse data. It will increase the productiveness of the decisions made by the decision-makers or analysts by designing a warehouse of consistent, subject oriented and historical data. It combines data from different inconsistent sources which gives a compatible view of the organization which allows the business analysts to execute more substantive, accurate, and consistent analysis.

426

L. S. Abinash Nayak et al.

References 1. Kimball, R., Ross, M., Thornthwaite, W., Mundy, J., Becker, B.: The Data Warehouse Lifecycle Toolkit, 2nd edn. Wiley (2008). ISBN: 978-0-470-14977-5 2. Rieger, B., Kleber, A., von Maur, E.: Metadata-based integration of qualitative and quantitative information resources approaching knowledge management. In: ECIS (2000) 3. Bleyberg, M.Z., Ganesh, K.: Dynamic multi-dimensional models for text warehouses. In: Proceedings of the IEEE International Conference on Systems, Man, and Cybernetics, Nashville, Tennessee (2000) 4. Maria Zamfir, B., Paranjape, P.S.: A content delivery strategy for text warehouses. In: IEEE International Conference on Systems, Man and Cybernetics, vol. 4, pp. 2322–2325 (2001) 5. Tseng, F.S.C., Chou, A.Y.H.: The concept of document warehousing for multidimensional modeling of textual-based business intelligence. Decis. Support Syst. 42, 727–744 (2005) 6. Baars, H., Kemper, H.-G.: Management support with structured and unstructured data—an integrated business intelligence framework. Inf. Syst. Manag. 25(2), 132–148 (2008) 7. Alqarni, A.A., Pardede, E.: Integration of data warehouse and unstructured business documents. In: 15th International Conference on Network-Based Information Systems (2012) 8. Garani, G., Helmer, S.: Integrating star and snowflake schemas in data warehouses. Int. J. Data Wareh. Min. 8(4), 22–40 (2012) 9. Ishikawa, H., Ohta, M., Kato, K.: Document warehousing: a document-intensive application of a multimedia database. In: Eleventh International Workshop on Research Issues in Data Engineering, pp. 25–31 (2001) 10. Perez, J.M., Berlanga, R., Aramburu, M.J., Pedersen, T.B.: A relevance-extended multidimensional model for a datawarehouse contextualized with documents. In: Proceedings of the 8th ACM International Workshop on Data Warehousing and OLAP, pp. 19–28 (2005) 11. Ravat, F., Teste, O., Tournier, R.: OLAP aggregation function for textual Data Warehouse. In: Proceedings of the International Conference on Enterprise Information Systems (ICEIS 2007), DISI, INSTICC Press, pp. 151–156 (2007) 12. Lin, C.X., Ding, B., Han, J., Zhu, F., Zhao, B.: Text cube: computing IR measures for multidimensional text database analysis. In: Eighth IEEE International Conference on Data Mining, ICDM ’08 (2008) 13. Zhang, D., Zhai, C., Han, J.: Topic cube: topic modeling for OLAP on multidimensional text databases. In: Proceedings of the 9th SIAM International Conference on Data Mining, pp. 1123–1134 (2009) 14. Prasad, K.S.N., Ramakrishna, S.: Text analytics to data warehousing. (IJCSE) Int. J. Comput. Sci. Eng. 02(06), 2201–2207 (2010) 15. Thenmozhi, M., Vivekanandan, K.: An ontology based hybrid approach to derive multidimensional schema for data warehouse. Int. J. Comput. Appl. (0975-8887) 54(8) (2012) 16. Sharma, Y., Nasri, R., Askand, K.: Building a data warehousing infrastructure based on service oriented architecture. In: 2012 International Conference on Cloud Computing Technologies, Applications and Management (ICCCTAM), Dubai, pp. 82–87 (2012) 17. Kassem, G., Turowski, K.: Matching of business data in a generic business process warehousing. In: International Conference on Computational Science and Computational Intelligence (CSCI), Las Vegas, NV, USA, pp. 284–289 (2018) 18. Sreemathy, J., Joseph, V.I., Nisha, S., Prabha, I.C., Priya, R.M.G.: Data integration in ETL using TALEND. In: 6th International Conference on Advanced Computing and Communication Systems (ICACCS), Coimbatore, India, pp. 1444–1448 (2020) 19. El Moukhi, N., El Azami, I., Mouloudi, A.: Data warehouse state of the art and future challenges. In: International Conference on Cloud Technologies and Applications (CloudTech), Marrakech, pp. 1–6 (2015) 20. Diouf, P.S., Boly, A., Ndiaye, S.: Variety of data in the ETL processes in the cloud migration and validation: state of the art. In: IEEE International Conference on Innovative Research and Development (ICIRD), Bangkok, pp. 1–5 (2018)

38 Implementation of Data Warehouse: An Improved …

427

21. Mandal, S., Maji, G.: Integrating telecom CDR and customer data from different operational databases and data warehouses into a central data warehouse for business analysis. Int. J. Eng. Tech. Res. 5, 516–523 (2016)

Chapter 39

An Empirical Comparison of TOPSIS and VIKOR for Ranking Decision-Making Models Sidharth Samal and Rajashree Dash

Abstract Multi-criteria decision making (MCDM) comes under operational research whose sole purpose is to compare, rank, and select several alternatives for multiple and conflicting criteria. Several MCDM methods have been used by decision-makers to choose one alternative among several decision-making models that met their goals, objectives, desire, and values. Technique for Order of Preference by Similarity to Ideal Solution (TOPSIS) and VlseKriterijumska Optimizacija I Kompromisno Resenje (VIKOR) are two prevailing MCDM methods built upon an aggregating function that represents proximity to the optimal solution. In TOPSIS vector normalization and in VIKOR linear scale normalization is used to transform all criteria into a uniform range. The VIKOR method provides a compromise solution based on maximum group utility and minimum individual regret. TOPSIS elects a solution that is farther from the negative ideal solution and closer to the ideal solution. In this paper, a correlative inquiry of these two methods is demonstrated with an analytical case study producing their similar and distinct features.

39.1 Introduction MCDM is a complex process where various alternatives are evaluated for different criteria to find the best among all the alternatives. Multi-criteria ranking of alternatives helps the decision-maker in formulating managerial decisions. MCDM process helps the decision-makers in ranking various alternatives concerning distinct criteria and choosing the most favorable alternative among them. MCDM methods can be used in various real-life conflict management scenarios. Apart from selecting the best alternative, in the end, the decision-maker also has to express his/her preferences in terms of various criteria, i.e., which criteria matter the most and which have least significance while decision making. For such, weights can be assigned to each individual criterion by the decision-maker. S. Samal · R. Dash (B) Computer Science & Engineering Department, Siksha O Anusandhan (Deemed to be University), Bhubaneswar, India © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2022 D. Mishra et al. (eds.), Intelligent and Cloud Computing, Smart Innovation, Systems and Technologies 286, https://doi.org/10.1007/978-981-16-9873-6_39

429

430

S. Samal and R. Dash

MCDM techniques are handy tools for evaluating and selecting multiple alternatives based on distinct criteria. A wide range of MCDM techniques has been proposed by researchers. As these techniques vary on a procedural basis, they may produce different ranking lists for the same problem. In literature, many researchers have used MCDM techniques like TOPSIS [1–9], VIKOR [10–17], PROMETHEE [18], GRA [18], ELECTREE [18], etc. for solving decision-making problems such as supplier selection [5, 13], classifier ranking and selection [8, 9], design and manufacturing [11, 12], etc. In [8] the authors implemented TOPSIS for ranking heterogeneous classifiers used for stock index price movement prediction. In [9], the authors employed TOPSIS for ranking different classifier models and selecting the best ones as base classifiers to be used in an ensemble model for predicting the movement of stock index prices. In this study, two prominent MCDM methods, i.e., TOPSIS and VIKOR are compared, the key focus being the normalization scheme, the aggregation function, and the ranking system. The procedures of these two MCDM methods are also compared. A comparative investigation of these techniques has been represented with a mathematical case study. The rest of this article is organized in the following manner. Section 39.2 gives an overview of MCDM process, TOPSIS, and VIKOR methods. Section 39.3 provides a numerical analysis of these two methods. In the end, the concluding remarks are presented in Sect. 39.4.

39.2 MCDM Process The MCDM process involves various steps beginning from formulating the first alternative solution ending with selecting the optimal alternative. In this paper, we have focused on two of the most popular and widely applicable MCDM techniques such as TOPSIS and VIKOR. The generalized working principle for these MCDM techniques is presented in Fig. 39.1. The first step is to generate various alternate solutions for a proposed problem. Here, any of these alternatives can be selected as the most feasible solution for the problem. In the next step, multiple evaluation criteria are determined. These evaluation criteria are used to evaluate the proposed alternatives. Now, one or more MCDM techniques can be utilized to evaluate all the alternatives with respect to multiple criteria. MCDM processes may vary for different MCDM techniques, but normalization and criteria weight assignment remain almost identical in TOPSIS and VIKOR. As the range of criteria values may vary significantly from one another, to make the MCDM process more useful and effortless, a normalization scheme is adopted and all the values are mapped in-between a particular uniform range (mostly between 0 and 1). The decision-maker has to assign certain weights to all the evaluation criteria as it has a significant role in the MCDM process. The weights determine the importance or precedence of all the criteria. In general, the sum of all criteria weights is 1. Finally, TOPSIS and VIKOR provide a rankingbased system where all the alternatives are assessed and ordered based on multiple

39 An Empirical Comparison of TOPSIS and VIKOR …

431

Fig. 39.1 MCDM process overview

criteria. TOPSIS ranks the alternatives whereas VIKOR ranks and also provides a compromise solution. The final ranking-based system assists the decision-maker in electing the most advantageous alternative amidst all the alternatives.

39.2.1 A Brief Overview of TOPSIS TOPSIS, one of the most prevailing MCDM methods, was proposed by Ching-Lai Hwang and Yoon in 1981 [1]. It attempts to pick alternatives that have concurrently farthest distance from negative ideal solution and shortest distance from positive ideal solution. The negative ideal solution enhances the cost criteria and reduces the benefit criteria whereas the positive ideal solution enhances the benefit criteria and reduces the cost criteria. The fundamental steps involved in the TOPSIS ranking system have been represented in Fig. 39.2. Initially, in step 1 a decision matrix is formed by evaluating K number of alternatives over N number of criteria and normalized. In step 2, a weighted standardized decision matrix is created. In step 3 the negative ideal

432

S. Samal and R. Dash

Fig. 39.2 TOPSIS flow chart

Decision matrix

Standardized decision matrix

Criteria weight

Weighted standardized decision matrix

Ideal solution

Negative ideal solution

Separation measure from ideal solution

Separation measure from negative ideal solution

Relative closeness to ideal solution

Rank

solution and positive ideal solution are calculated. This is accompanied by calculating the separation measures from negative and positive ideal solutions in step 4. Finally, the relative closeness index is measured in step 5. The ranking list of alternatives can be found by ordering the relative closeness values in a descending sequence. The TOPSIS method has been widely applied in real-world decision-making problems across diverse fields. Business and marketing [2, 3], supply chain management [4, 5], health care [6, 7], classifier algorithm selection [8, 9], etc. are few of the crucial areas of application where researchers implemented TOPSIS.

39.2.2 A Brief Overview of VIKOR In 1998, S. Opricovic proposed the VIKOR method for multi-criteria optimization of complex systems where the aim is to rank and determine the optimal alternative from a set of alternatives evaluated using multiple conflicting criteria. It provides a ranking list based on the relative closeness to the ideal solution [10]. Apart from ranking, this method also provides a list of compromise solutions. VIKOR based compromise ranking has four significant steps where K number of alternatives are evaluated based on N number of criteria. In steps 1 and 2, the utility and regret

39 An Empirical Comparison of TOPSIS and VIKOR …

433

measures for alternatives are estimated related to individual criteria. In steps 3 and 4 the maximum group utility and minimum individual regret are determined. Finally, in step 4 the majority support is calculated which produces a ranking system as well as a compromise solution list. The analytical procedure is represented in Fig. 39.3. VIKOR has been applied in various fields of research and decision-making systems. Design and manufacturing [11, 12], supply chain management [13, 14], business and marketing management [15, 16], risk management [17], etc. are few of the distinct areas of application of VIKOR.

Fig. 39.3 VIKOR flowchart

434

S. Samal and R. Dash

39.2.3 Comparative Study of TOPSIS and VIKOR TOPSIS and VIKOR share a lot of similarities on a procedural basis although the mathematical formulas are different in both of them. The key characteristics of TOPSIS and VIKOR are compiled here to interpret the discrepancies among these two MCDM methods. • Normalization: Initially, the regularization of the decision matrix is performed by both TOPSIS and VIKOR. However, the normalization schemes are different for both of these methods. TOPSIS uses linear and vector scale normalization whereas VIKOR uses only linear normalization. • Aggregation function: TOPSIS and VIKOR have distinct aggregation functions. TOPSIS introduces an aggregation function that considers both the relative distance to the ideal and negative ideal solution for obtaining the ranking index. However, VIKOR uses an aggregation function based on the Lp metric where it considers the maximum group utility and minimum individual regret to produce a ranking list and compromise solution list. • Ranking: The sole purpose of these two MCDM techniques is to assess the alternatives based on distinct criteria and produce a ranking list in the end. In TOPSIS, the alternative whose ranking index value is closest to 1 is considered as the optimal alternative whereas in VIKOR the alternative whose ranking index value is closest to 0 is considered as the optimal alternative. In general, the TOPSIS ranking index is organized in ascending order whereas the VIKOR ranking index is arranged in descending order. The features of TOPSIS and VIKOR are summarized in Table 39.1.

39.3 Simulation Let us consider a model selection problem in which we have to evaluate three distinct decision-making models such as ALT-1, ALT-2, and ALT-3 for three criteria, i.e., CR-1, CR-2, and CR-3, respectively. The experimentation has been conducted on a PC equipped with 2.20 GHz Intel(R) Core(TM) i7-8750H CPU and 16 GB of RAM using MATLAB 2016a. As we have to evaluate the feasibility of three alternatives based on three criteria, this can be considered as an MCDM problem. The individual models and their respective criteria values are represented in Table 39.2 which can be considered as the decision matrix. It is assumed that all the criteria that are being considered have equal importance and the sum of all weights is 1. Two MCDM frameworks, i.e., TOPSIS and VIKOR have been implemented to evaluate the given problem and the outcomes are displayed in Table 39.3. The values of separation measure from “ideal solution” (SM*) and separation measure from “negative ideal solution” (SM-) for TOPSIS, “maximum group utility” (MGU*) and “minimum individual regret” (MIR-) for VIKOR are represented in Fig. 39.4. This

39 An Empirical Comparison of TOPSIS and VIKOR …

435

Table 39.1 Features of TOPSIS and VIKOR Sl. No. Features

TOPSIS

VIKOR

1

Developed by

Ching-Lai Hwang and Yoon, 1981 [1]

Serafim Opricovic, 1979 [10]

2

Procedure

A decision matrix is obtained by evaluating K number of alternatives in terms of N number of distinct criteria

3

Normalization scheme

Linear scale and vector normalization

Linear scale normalization

4

Dependency on normalization

Regularized values may depend on the evaluation system

Regularized values do not depend on the evaluation system

5

Aggregation function

Relative closeness

Based on Lp metric

6

Ranking advantages

It only ranks the It ranks as well as provides alternatives with respect to a compromise solution different criteria

7

Relative distance from ideal It considers both the solutions relative distances from the ideal and negative ideal solutions

Table 39.2 Decision matrix

Alternatives

It applies certain conditions for evaluating the relative distances among the proposed solutions

Criteria CR-1

CR-2

CR-3

ALT-1

0.875

0.860

0.870

ALT-2

0.817

0.824

0.818

ALT-3

0.682

0.742

0.641

Table 39.3 Results after TOPSIS and VIKOR ranking TOPSIS

VIKOR

SM*

SM−

R*

Rank

MGU*

MIR−

VR

Rank

ALT-1

0

0.079

1

1

0

0

0

1

ALT-2

0.02

0.05

0.72

2

0.28

0.12

0.29

2

ALT-3

0.07

0

0

3

1

0.4

1

3

Fig. 39.4 Position of alternatives with respect to SM* and SM− for TOPSIS and MGU* and MIR− for VIKOR

436

S. Samal and R. Dash

shows that for TOPSIS the approximate value of the best alternative should be highest and for VIKOR the value should be the lowest among all alternatives. From an initial observation of the values which are in ascending order, we can see that model ALT-1 is the top-ranked model for the TOPSIS method. Similarly, the values of which are in descending order, give us model ALT-1 as the compromise solution for the VIKOR method. As model ALT-1 is the best-ranked alternative in both TOPSIS and VIKOR, it can be considered as the best alternative among all three alternatives for the given problem. The results obtained from TOPSIS show that ALT-1 is the optimal alternative and is least distant from the ideal solution and most distant from the negative ideal solution, i.e., (SM∗ , SM− ) = (0, 0.079) (Fig. 39.4) and ALT-1 has the highest relative closeness value, i.e., (R ∗ , ALT-1) = 1 among all three alternatives (Table 39.3). Similarly, the results of VIKOR prescribe that ALT-1 is the best compromise solution as its maximum group utility and minimum individual regret values, i.e., (MGU∗ , MIR− ) = (0, 0) are the smallest among all three alternatives (Fig. 39.4). Furthermore, ALT-1 is ranked first in the set of compromise solutions, i.e., (VR, ALT-1) = 0 (Table 39.3). The above experimental study illustrates the difference between MCDM methods TOPSIS and VIKOR. Figure 39.4 demonstrates the variation among the positions of the alternatives for TOPSIS and VIKOR. From Fig. 39.4 we can observe that the highest-ranked alternative for TOPSIS (ALT-1) remains closer to 1 and lowestranked alternative, i.e., (ALT-3) remains closer to 0 which is inverted for the VIKOR method where the best alternative ALT-1 remains closer to 0 and the worst ALT-3 remains closer to 1. Though these two MCDM methods use different procedures, normalization techniques, and aggregation metrics their outcome for the considered problem remains the same.

39.4 Conclusion TOPSIS and VIKOR both consider the distance to the ideal solution as the basic setup to work on. According to a comparative analysis, TOPSIS evaluates alternatives based on their respective distance from the ideal solution and negative ideal solution, whereas VIKOR measures the closeness of alternatives from a single point of reference, i.e., closeness to the ideal solution. Although these two MCDM techniques use different aggregation functions and normalization schemes, their implications in real-world decision-making problems such as chemical engineering, waste management, classifier ranking, supplier selection, material selection, and so on are similar. According to the literature on MCDM techniques, TOPSIS and VIKOR can be used for the same decision-making problems and the former is neither superior nor inferior to the latter. Only the mathematical steps of these two methods have been analyzed in this paper to demonstrate various similarities and differences.

39 An Empirical Comparison of TOPSIS and VIKOR …

437

References 1. Hwang, C.L., Yoon, K.: Methods for multiple attribute decision making. In: Multiple Attribute Decision Making, pp. 58–191 (1981) 2. Aydogan, E.K.: Performance measurement model for Turkish aviation firms using the roughAHP and TOPSIS methods under fuzzy environment. Expert Syst. Appl. 38(4), 3992–3998 (2011) 3. Zandi, F., Tavana, M.: A fuzzy group quality function deployment model for e-CRM framework assessment in agile manufacturing. Comput. Ind. Eng. 61(1), 1–19 (2011) 4. Kahraman, C., Engin, O., Kabak, Ö., Kaya, ˙I: Information systems outsourcing decisions using a group decision-making approach. Eng. Appl. Artif. Intell. 22(6), 832–841 (2009) 5. Chen, C.T., Lin, C.T., Huang, S.F.: A fuzzy approach for supplier evaluation and selection in supply chain management. Int. J. Prod. Econ. 102(2), 289–301 (2006) 6. Krohling, R.A., Campanharo, V.C.: Fuzzy TOPSIS for group decision making: a case study for accidents with oil spill in the sea. Expert Syst. Appl. 38(4), 4190–4197 (2011) 7. Yue, Z.: An extended TOPSIS for determining weights of decision makers with interval numbers. Knowl.-Based Syst. 24(1), 146–153 (2011) 8. Dash, R., Samal, S., Rautray, R., Dash, R.: A TOPSIS approach of ranking classifiers for stock index price movement prediction. In: Soft Computing in Data Analytics, pp. 665–674 (2019) 9. Dash, R., Samal, S., Dash, R., Rautray, R.: An integrated TOPSIS crow search based classifier ensemble: In application to stock index price movement prediction. Appl. Soft Comput. 85, 105784 (2019) 10. Opricovic, S.: Multicriteria optimization of civil engineering systems. Faculty of Civil Engineering, Belgrade 2(1), 5–21 (1998) 11. Çalı¸skan, H., Kur¸suncu, B., Kurbano˘glu, C., Güven, SY.: ¸ Material selection for the tool holder working under hard milling conditions using different multi criteria decision making methods. Mater. Des. 45, 473–479 (2013) 12. Ahmadi, A., Gupta, S., Karim, R., Kumar, U.: Selection of maintenance strategy for aircraft systems using multi-criteria decision making methodologies. Int. J. Reliab. Qual. Saf. Eng. 17(03), 223–243 (2010) 13. Shemshadi, A., Shirazi, H., Toreihi, M., Tarokh, M.J.: A fuzzy VIKOR method for supplier selection based on entropy measure for objective weighting. Expert Syst. Appl. 38(10), 2160– 12167 (2011) 14. Chen, L.Y., Mujtaba, B.G.: Assessment of service quality and benchmark performance in 3C wholesalers: forecasting satisfaction in computers, communication and consumer electronics industries. Int. J. Bus. Forecast. Mark. Intell. 1(2), 153–163 (2009) 15. Kang, D., Park, Y.: Based measurement of customer satisfaction in mobile service: sentiment analysis and VIKOR approach. Expert Syst. Appl. 41(4), 1041–1050 (2014) 16. Kuo, M.S.: A novel interval-valued fuzzy MCDM method for improving airlines’ service quality in Chinese cross-strait airlines. Transp. Res. Part E: Logist. Transp. Rev. 47(6), 1177– 1193 (2011) 17. Ou Yang, Y.P., Shieh, H.M., Leu, J.D., Tzeng, G.H.: A VIKOR-based multiple criteria decision method for improving information security risk. Int. J. Inf. Technol. Decis. Mak. 8(02), 267–287 (2009) 18. Kou, G., Lu, Y., Peng, Y., Shi, Y.: Evaluation of classification algorithms using MCDM and rank correlation. Int. J. Inf. Technol. Decis. Mak. 11(01), 197–225 (2012)

Chapter 40

An Efficient Learning Model Selection for Dengue Detection Miranji Katta , R. Sandanalakshmi, Gubbala Srilakshmi , and Ramkumar Adireddi

Abstract Significant advances are being made in the development of diagnostic tools in the biomedical sector nowadays. However, finite precision flaws in diagnostic technologies significantly mislead comprehensive medical therapies. In this regard, machine learning is an important programming tool in the diagnosis of early detection of the most life-threatening diseases with existing clinical findings. The machine-learning method offers a framework for creating sophisticated, automated, and objective algorithms with decision-making capability for the processing of multimodal and high-dimensional biomedical data. The prediction performance of several machine-learning algorithms in disease diagnosis using pre-processed datasets based on percentage selection is simulated and compared in this article. The dengue ailment would be used to apply several machine-learning algorithms in order to evaluate performance, and the best of these machine-learning techniques is then incorporated in the Raspberry Pi. The random forest approach outperforms the simulation findings with a 95.60% true positive rate, 94.39% classification accuracy, 93.08% precision, and 0.69 and 0.494% mean absolute and root mean square errors, respectively. In practice, the experimental setting for an analogous machine-learning method yields 78.56% accuracy. During this project, Jupyter, RaspberryPi, and Python idle with OpenCV, Google Collab, and Weka toolbox are used for simulation to evaluate the performance of various methods. The same is implemented in RaspberryPi-V.4 for validation.

M. Katta (B) · R. Adireddi Sir C R Reddy College of Engineering, Eluru 534007, Andhrapradesh, India M. Katta · R. Sandanalakshmi Pondicherry Engineering College, Puducherry 605014, India e-mail: [email protected] G. Srilakshmi Vishnu Institute of Technology, Bhimavaram 534202, Andhrapradesh, India e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2022 D. Mishra et al. (eds.), Intelligent and Cloud Computing, Smart Innovation, Systems and Technologies 286, https://doi.org/10.1007/978-981-16-9873-6_40

439

440

M. Katta et al.

40.1 Introduction Machine learning is based on the branch of algorithms. The branch of algorithms models high-level abstraction with gathered data by using decision trees, base-level analysis, and linear and non-linear conversion techniques [1]. Nowadays, big data is broken down into pieces as a result of the interdisciplinary collaboration of machinelearning algorithms and accessible datasets [2]. Disease diagnosis, particularly in the biomedical sector, is difficult and invasive, with complicated medical procedures and a longer turnaround time for assessment [3]. To address these issues and to implement cost-effective and efficient disease detection, automated decision support systems were created. The field of medical sciences offers an abundance of data on clinical assessments, patient data, pharmaceutical processes, and statistical evaluation reports on chronic ailments. In this case, data organization is also one of the difficult aspects that necessitate the use of appropriate tools to extract and process data efficiently. Various machine-learning algorithms are used to build classifiers that will split the data based on their characteristics and should classify the data into two or more groups. Classifiers of this kind are employed in data analysis and illness detection [4]. In the healthcare sector, ML algorithms are mainly employed and adopted to successfully monitor data using different techniques. Data management methods have undergone a major transformation as a result of the contemporary digital era. By providing accurate data recordings, ML methods are highly effective for the investigation of clinical results and may significantly identify health problems in hospitals [5]. To execute an algorithm, the precise patient record is provided as input, and the results are evaluated without any human interference using previous comparable accessible data, referred to as datasets. Classification techniques may assist physicians in detecting patients more quickly and accurately. These classifiers may also help a person with no previous expertise diagnose a problem [6]. The electronic health record, which includes high-dimensional patterns and different data sets, is determined using machine-learning methods. Its primary goal is to incorporate data processing reports because each patient generates massive amounts of clinical records such as X-ray pictures, medicines, blood tests, physical examination reports, DNA sequences, pharmaceuticals, and other previous medical histories. In this paper, we look at pre-processed discrete images of tiny blood cells, other documented data that have been subjected to ML, which allows for diagnostic accuracy and decision making with appropriate medical prescriptions [7]. Multiple iterative trials are also carried out to identify the optimum model for diagnosing the issue by fine-tuning the learning model for employing machine-learning techniques to diagnose a disease. Based on their knowledge, comprehension, and trust in their perception, they would choose the most competent model to set up the experiment [8]. This research focuses on dengue disease prediction using machine-learning algorithms since early detection of dengue saves many lives during a pandemic. According to a literature study, dengue is a mosquito-borne ailment that mostly affects the tropical south. The Aedes aegypti mosquito is responsible for the propagation of this

40 An Efficient Learning Model Selection for Dengue Detection

441

disease [9]. According to WHO estimates, this deadly ailment kills approximately 500,000 people each year. Since January, 2% of Indian tribes have perished, and 28% have been afflicted with dengue. The predominance of diseases has been reported in West Bengal and Orissa [10]. During the same time, the number of approved fatalities from the chronic disease has risen from 110 to 242. This infectious disease assaults the body by producing a high temperature and flu-like symptoms such as a headache, irritability, achy joints, tiredness, and infection with minimal bleeding from the gums. Dengue fever is often affected by climatic and socioeconomic factors. Temperature, humidity, wind speed, and rainfall all contribute to the spread of dengue disease [11]. This article’s prime objective is to anticipate infectious diseases employing different machine-learning techniques. First, the available datasets of dengue patients were gathered via a proper channel, and prediction began with the classification of these datasets’ recognition. The next sections go through the classification techniques. This article’s information is organized as follows. Section 40.2 includes an overview of relevant works of literature. Section 40.3 provides a detailed description of the proposed work. The findings of the experiments are shown in Sect. 40.4. Finally, in Sect. 40.5, the concluding remarks were presented.

40.2 Related Works Several authors have concentrated on various machine-learning algorithms for disease prediction. They agreed that these machine-learning approaches are effective in the detection of many diseases. The author interprets the decision tree as a tool for data analysis and proposes a set of consequence characteristics derived from temporal data. To characterize ailments from various patient medications and other health records, the decision tree model is employed. The considers the need on using pattern recognition to identify disease mentions from microscopic blood cell patterns and other important clinical data features. This section contains a review of the literature on more serious diseases detected using ML methods listed in Table 40.1.

40.3 Proposed Work This section discusses the different machine-learning methods used for the early identification of dengue.

86

92

90

Navie Bayes

Grading

Logi Boost

Chaurasia et al. [17]

2013

2014

Cardiovascular

Diabetes disease

UCI

Pima Indian diabetes dataset from UCI-ML repository

WEKA

WEKA

82.31

66.40

77.47

77.86

Sen et al. [15]

Adaboost

WEKA

85.1 86.41

Diabetic Research Institute, Chennai

WEKA

Vembandasamy et al. [16]

Cardiovascular disease

UCI

Navie Bayes

Coronary artery disease

84.5 2015

2015

FT

Otoom et al. [28]

84.5

SVM

Bayes

79.56 78.64

WEKA

76.95

PIMA Indian diabilities

77

DME

CART

Diabetes disease

99.80

ED

J48

2015

Health records of age (20–69)

97

Iyer et al. [14]

Diabetes through retinopathy

–

Navie Bayes

2016

EHR

Perdomo et al. [27]

CNN

Diabetes disease

Kavakiotis et al. [26]

SVM

2017

51 95

(continued)

Accuracy in %

Khan et al. [25]

119 samples from PAEC –

WEKA

WEKA

Analysis tool

SVM

Hepatitis B

102 Brazilian patients

344 patients data-MEDLINE

Data resource

ANN 2018

Dengue

Cardiovascular

Disease

98

2019

2020

Year

SVM

AI

Caio Davi et al. [24]

Mathur et al. [23]

SVM

CNN

Author

MLT

Table 40.1 Classification accuracy of various machine-learning algorithms in disease detection

442 M. Katta et al.

2011

2011

Hepatitis

Hepatitis

Diabetes

UCI

UCI

UCI

KIPM and surveys of hospitals Chennai and Tirunelveli, India

WEKA

Rapid Miner

MATLAB

R Project version 2.12.2

SQL with MATLAB

98.00

97.00

87.00

90.42

95.13

64.82

Sathyadevi et al. [20]

Karlik et al. [21]

Navie Bayes

2011

Dengue

Various online and offline sources

C4.5

Ephzibah et al. [30]

Genetic algorithm and fuzzy logic

2012

Type-2 diabetes

71.42

Fathima and Manimaglal [29]

SVM

2012

FFNN with BP

Sarvar et al. [13]

Navie Bayes

74.00

94.60

Navie Bayes

70.41

SVM

WEKA

Neural networks Cardio research center in Chennai

83.66

LMT Cardiovascular

83.00

J48

2012

83.47

K*

Parthiban et al. [18]

87.10

96.52 84.00

WEKA

FT

UCI

Navie Bayes updatable

Hepatitis

Ba-Alwi and Hintaya [22]

Navie Bayes

2013

85.03

(continued)

Accuracy in %

78.02

MATLAB 2010a

Analysis tool

Kumari et al. [12]

UCI

Data resource

SVM

Diabetes

Disease

Bagging 2013

Year 84.35

Author

J48

MLT

Table 40.1 (continued)

40 An Efficient Learning Model Selection for Dengue Detection 443

Tan et al. [19]

Ibrahim et al. [31]

MFNN (multilayer feed forward neural networks)

Author

GA and SVM (hybrid approach)

CART

ID3

MLT

Table 40.1 (continued)

2005

2009

Year

Dengue

Cardiovascular Disease

Disease

From 252 hospitalized patients

UCI

Data resource

MATLAB neural network

LIBSVM WEKA

Analysis tool

90.02

84.07

83.24

Accuracy in %

444 M. Katta et al.

40 An Efficient Learning Model Selection for Dengue Detection

445

Fig. 40.1 Symptoms of dengue. Source https://www. disabled-world.com/health/ dengue.php

40.3.1 Symptoms The above markers help us to assess and address the dengue infection earlier. Sometimes even the symptoms of certain other viral diseases leading to an emergency are modest or misunderstood. Dengue fever, associated with increased fever, lymphatic and vascular damage, nose and vaginal dryness, liver expansion, and arterial failure, is included. It also includes diarrhea. The symptoms might cause huge distress, hemorrhage, and death [32, 33] (Fig. 40.1).

40.3.2 Datasets Resources Data for this study were obtained from Right Health Medical Laboratories (RHML), Hyderabad, Telangana State, and a few datasets from the public health ministry of Andhra Pradesh [34, 35], a few data sets from Govt. District Hospital, West Godavari, Andhra Pradesh, a few more data sets from online resources. From the collected data, 64% of the dataset was used for training and 36% for testing.

40.3.3 Criteria for Percentage Selection of Trained and Tested Data The percentage selection of data for training and testing can be seen as one of the two fundamental approaches to problem-solving, as referred to as an interpretation of the situation on intuition and theory. However, there are intermediate approaches,

446

M. Katta et al.

Fig. 40.2 Percentage selection of train and test data based on Brute force technique

for example, using the philosophy to implement the method, such an approach is known as guided empiricism. This way of thought has been the mainstay of Karl Popper’s approach for falsification in the philosophy of science [36]. Here, the percentage frequency distribution system of 100 samples was found to be the optimal PDF range used to pick the best ratio of train and test samples from pre-processed data sets. According to the Brute Force Technique (BFT) [37], among the different combinations of training and testing samples, we found 64:36 is the optimized ratio to achieve good accuracy, precision, and error rate with significant system performance as shown in Fig. 40.2. The system performance may include configuration, training time, testing time, execution time, etc. It also varies with respect to the system configuration.

40.3.4 Proposed Algorithm for Evolution of Best ML Technique To get reliable results from the simulated output, the proposed algorithm is followed in order to predict the performance of various machine-learning algorithms. Step-1: Collect Datasets from online/offline resources. Step-2: Refine the collected datasets according to the required attributes. Step-3: Pre-process the images in the collected datasets using the sober filter for abnormalities. Step-4: Divide the collected data for the training and testing, based on percentage selects as 64–36%. Step-5: Apply various machine-learning algorithms on processed datasets. Step-6: Identify the performance metrics.

40 An Efficient Learning Model Selection for Dengue Detection

447

Step-7: Decision-based on the statistical analysis of performance metrics.

40.3.5 Process Flow Chart See Fig. 40.3.

Online

offline Internal Database

Refiner

Data/image Preprocess

Divide data for train and test

Execution different ML techniques

Identify Performance metrics

Yes

No If not satisfied with metrics, Explore more about ML Techniques

Decision Based on

Based on the input image observe the

If satisfied with metrics, respective ML Technique is loaded into the controller (Raspberry PI)

As a Independent system, trained data

output of controller with LED Indicator of

stored in system memory and test data

red or green for yes or no respective-

externally connected via smart

ly/Display message

phone/computer system

Fig. 40.3 Flow graph for best ML technique selection and PCD (point of care diagnostics)

448

M. Katta et al.

40.3.6 ML Algorithms Used In this work, the classification algorithms used are Naive Bayes, Random Forest, KNN, J48, SMV, Adaboost M1, REP TREE, SMO, LWL.

40.4 Results and Discussion To assess the feasibility of the developed work, simulation-based tests were performed by using the open-source tool WEKA 7.0. This toolkit includes prediction model components. This technique has been used successfully in bioinformatics. Four performance indicators are used to assess the accuracy of the aforementioned algorithms: True Positive (TP) Rate, Classification Accuracy (CA), Mean Absolute Error (MAE), and Root Mean Square Error (RMSE). A confusion matrix is a tool that is frequently used to assess the classification results (or “classifier”) on a validation dataset whose actual values are computed. The confusion matrix itself would be easy to understand, but the language associated with it may be perplexing [38]. Here two projected classes are conceivable, “yes” and “no” For instance, if we are to forecast an ailment, “yes” means that you also have the ailment and “no” means that you have not the disease. TP Rate: It is referred to as a true positive rate. It is the proportion of positives that are correctly identified as being such. It is also known as sensitivity, and it is computed as follows. TPR =

TP TP + FN

(40.1)

Classification Accuracy refers to the number of correctly classified samples from the input samples. CA =

CS TP + TN = TP + FN + TN + FP n

(40.2)

where CS represents the correctly classified samples and n indicates the number of samples. Mean Absolute Error: It is the average of absolute errors MAE =

n 1 | pi − ti | n i=1

Precision: it is defined as relative instances among retrieved instances.

(40.3)

40 An Efficient Learning Model Selection for Dengue Detection

P=

449

TP TP + FP

(40.4)

Root Mean Square Error: It is the square root of the sum of squares error divided by a number of predictions. n 1 RMSE = et2 n i=1

(40.5)

The performances of the various algorithms used are presented in Table 40.2 and pictorially represented in Figs. 40.4 and 40.5. From the performance parameters of Table 40.1 and analysis of Figs. 40.1 and 40.2, the random forest algorithm shows significantly superior performance than all other algorithms except Adaboost M1. The simulation and performance of each algorithm depend on several factors like the number of data sets availability, number of trained and tested samples, simulation system configuration, etc. (Table 40.3). As per the data obtained from the binary classifier confusion matrix as shown in Fig. 40.3, in contrast with Adaboost M1, the random forest algorithm has better performance as 5.49% for TPR, 2.04% for classification accuracy, but in the view of

perfromance Parameters in %

Table 40.2 Confusion matrix—binary classifier [46]

Binary classifier

Predicted No

Predicted Yes

Actual No

FN

TN

Actual Yes

FP

TP

100 90 80 70 60 50 40 30 20 10 0

TPR CA Precision

NB

RF

KNN SVM

J48

ABM1 SMO

REP LWL ZeroR TREE

Algorithm Fig. 40.4 True positive rate, classification accuracy and precision as performance parameters of different algorithms

450

M. Katta et al. 2.5

Error Value

2 1.5 1 MAE

0.5

RMSE 0

Algorithm

Fig. 40.5 Mean absolute error and root mean square error over various algorithms

Table 40.3 Represents various algorithms with true positive rate, classification accuracy and precision, mean absolute error and root mean square error calculated form confusion matrix with binary classifiers Algorithm

TPR (%)

CA (%)

Precision (%)

MAE

RMSE

Naïve Bayes

78.79

80.46

89.36

0.865

1.180

RF

95.60

94.39

93.08

0.650

0.494

KNN

81.90

84.42

93.07

1.562

1.712

SVM

89.58

81.31

91.08

0.863

0.961

J48

78.79

80.85

90.18

1.032

1.456 0.855

Adaboost M1

91.66

92.9

95.83

0.655

SMO

71.80

67.06

84.33

1.467

2.075

REP TREE

71.34

70.05

84.95

0.456

0.645

LWL

77.67

69.47

78.85

1.450

2.052

ZeroR

70.24

63.25

82.34

1.345

2.134

precision, it shows poor performance as −3.62%. But the mean absolute error and Root mean square errors of RFA are far better than Adaboost M1. Finally, random forest algorithm is chosen from the metrics, in order to build up the independent diagnosing system for point of care diagnostics. Figure 40.6 shows the simple experimental setup to detect dengue disease. In this experiment necessary library files and trained, initially loaded into the Raspberry Pi and LEDs are connected to GPIO male headers as shown in above figure. Test data in the form image is applied to Raspberry Pi through the connected system or smartphone. Based on matching the performance of the trained and tested data Red or Green LED indicates dengue affected or not. Here matching threshold is fixed to 75% to avoid false alarms.

40 An Efficient Learning Model Selection for Dengue Detection

451

Fig. 40.6 a Experimental setup with Raspberry Pi4 with LED as output decision indicator based on input test data. b Shows decision as “no” with red LED glow—no dengue detected. c Decision as “yes” with green LED glow—dengue detected

40.5 Conclusion As per the analysis of results, it is observed that the role of machine-learning algorithms is quite significant in disease detection but how far it is accurate depends on several factors like datasets, number of trained and tested samples, system configuration, etc. According to the simulation results, from existing machine-learning algorithms, for a given dengue dataset, the random forest algorithm and Adaboost M1 show good performance than the remaining ML algorithms tabulated above. But in-between RFA and ADM1, RFA has 5.49 and 2.04% of TPR and CA, low error value than the ADM1 algorithm. Finally, it is concluded that the random forest algorithm gives better performance in disease diagnosis from statistics and experiment results are quite satisfied with 78.56% accuracy.

452

M. Katta et al.

References 1. Kotsiantis, S.B., Zaharakis, I., Pintelas, P.: Supervised machine learning: a review of classification techniques. Emerg. Artif. Intell. Appl. Comput. Eng. 160(1), 3–24 (2007) 2. Jeansoulin, R.: Review of forty years of technological changes in geomats toward the big data paradigm. ISPRS Int. J. Geo-Inf. 5(9), 155 (2016) 3. Katta, M., Sandanalakshmi, R., Narendra Kumar, M., Prakash, J.: Static and dynamic analysis of carbon nano tube cantilever for nano electro mechanical systems based applications. J. Comput. Theor. Nanosci. 17(5), 2151–2156 (2020) 4. Nikam, S.S.: A comparative study of classification techniques in data mining algorithms. Oriental J. Comput. Sci. Technol. 8(1), 13–19 (2015) 5. López-Martínez, F., Núñez-Valdez, E.R., García-Díaz, V., Bursac, Z.: A case study for a big data and machine learning platform to improve medical decision support in population health management. Algorithms 13(4), 102 (2020) 6. Shmueli, G., Bruce, P.C., Yahav, I., Patel, N.R., Lichtendahl, Jr., K.C.: Data Mining for Business Analytics: Concepts, Techniques, and Applications in R. Wiley, Hoboken, NJ (2017) 7. Mittal, S., Hasija, Y.: Applications of deep learning in healthcare and biomedicine. In: Deep Learning Techniques for Biomedical and Health Informatics, pp. 57–77. Springer, Cham (2020) 8. Khan, S., Yairi, T.: A review on the application of deep learning in system health management. Mech. Syst. Signal Process. 107, 241–265 (2018) 9. World Health Organization: Dengue and severe dengue. No. WHO-EM/MAC/032/E. World Health Organization. Regional Office for the Eastern Mediterranean (2014) 10. Ganguly, N.: Slums of India. MJP Publisher, Chennai (2019) 11. Internet sources. https://www.who.int/news/item/09-12-2020-who-reveals-leading-causes-ofdeath-and-disability-worldwide-2000-2019 12. Kumari, V.A., Chitra, R.: Classification of diabetes disease using support vector machine. Int. J. Eng. Res. Appl. 3(2), 1797–1801 (2013) 13. Ijaz, A., Babar, S., Sarwar, S., Shahid, S.U.: The combined role of allelic variants of IRS-1 and IRS-2 genes in susceptibility to type2 diabetes in the Punjabi Pakistani subjects. Diabetol. Metab. Syndr. 11(1), 1–6 (2019) 14. Iyer, A., Jeyalatha, S., Sumbaly, R.: Diagnosis of diabetes using classification mining techniques. arXiv preprint arXiv:1502.03774 (2015) 15. Sen, S., Das, P., Debnath, B.: A Data Mining Approach for Genetic Diabetes Prediction (2019) 16. Vembandasamy, K., Sasipriya, R., Deepa, E.: Heart diseases detection using Naive Bayes algorithm. Int. J. Innov. Sci. Eng. Technol. 2(9), 441–444 (2015) 17. Chaurasia, V., Pal, S.: Early prediction of heart diseases using data mining techniques. Caribb. J. Sci. Technol. 1, 208–217 (2013) 18. Parthiban, G., Rajesh, A., Srivatsa, S.K.: Diagnosis of heart disease for diabetic patients using naive Bayes method. Int. J. Comput. Appl. 24(3), 7–11 (2011) 19. Tan, K.C., Teoh, E.J., Yu, Q., Goh, K.C.: A hybrid evolutionary algorithm for attribute selection in data mining. Expert Syst. Appl. 36(4), 8616–8630 (2009) 20. Sathyadevi, G.: Application of CART algorithm in hepatitis disease diagnosis. In: 2011 International Conference on Recent Trends in Information Technology (ICRTIT), pp. 1283–1287. IEEE (2011) 21. Karlik, B.: Hepatitis disease diagnosis using backpropagation and the naive Bayes classifiers. IBU J. Sci. Technol. 1(1) (2012) 22. Ba-Alwi, F.M., Hintaya. H.M.: Comparative study for analysis the prognostic in hepatitis data: data mining approach. Int. J. Sci. Eng. Res. 4(8), 680–685 (2013) 23. Mathur, P., Srivastava, S., Xu, X., Mehta, J.L.: Artificial intelligence, machine learning, and cardiovascular disease. Clin. Med. Insights: Cardiol. 14, 1179546820927404 (2020) 24. Davi, C., Pastor, A., Oliveira, T., de Lima Neto, F.B., Braga-Neto, U., Bigham, A.W., Bamshad, M., Marques, E.T.A., Acioli-Santos, B.: Severe dengue prognosis using human genome data and machine learning. IEEE Trans. Biomed. Eng. 66(10), 2861–2868 (2019)

40 An Efficient Learning Model Selection for Dengue Detection

453

25. Khan, S., Ullah, R., Khan, A., Ashraf, R., Ali, H., Bilal, M., Saleem, M.: Analysis of hepatitis B virus infection in blood sera using Raman spectroscopy and machine learning. Photodiagn. Photodyn. Ther. 23, 89–93 (2018) 26. Kavakiotis, I., Tsave, O., Salifoglou, A., Maglaveras, N., Vlahavas, I., Chouvarda, I.: Machine learning and data mining methods in diabetes research. Comput. Struct. Biotechnol. J. 15, 104–116 (2017) 27. Perdomo, O., Otalora, S., Rodríguez, F., Arevalo, J., González, F.A.: A novel machine learning model based on exudate localization to detect diabetic macular edema (2016) 28. Otoom, A.F., Abdallah, E.E., Kilani, Y., Kefaye, A., Ashour, M.: Effective diagnosis and monitoring of heart disease. Int. J. Softw. Eng. Appl. 9(1), 143–156 (2015) 29. Fathima, A., Manimegalai, D.: Predictive analysis for the arbovirus-dengue using SVM classification. Int. J. Eng. Technol. 2(3), 521–527 (2012) 30. Ephzibah, E.P.: Cost effective approach on feature selection using genetic algorithms and fuzzy logic for diabetes diagnosis. arXiv preprint arXiv:1103.0087 (2011) 31. Ibrahim, F., Taib, M.N., Wan Abas, W.A.B., Guan, C.C., Sulaiman, S.: A novel dengue fever (DF) and dengue haemorrhagic fever (DHF) analysis using artificial neural network (ANN). Comput. Methods Progr. Biomed. 79(3), 273–281 (2005) 32. Khare, R.K., Raj, P.: Dengue fever with thrombocytopenia and it’s complications: a hospital based study. J. Adv. Med. Dental Sci. Res. 5(3), 72 (2017) 33. Paessler, S., Walker, D.H.: Pathogenesis of the viral hemorrhagic fevers. Annu. Rev. Pathol. 8, 411–440 (2013) 34. https://data.world/datasets/dengue 35. https://nvbdcp.gov.in/index4.php?lang=1&level=0&linkid=431&lid=3715 36. Kaplan, A.: The Conduct of Inquiry: Methodology for Behavioural Science. Routledge, New York (2017) 37. Bayardo, Jr., R.J.: Brute-Force Mining of High-Confidence Classification Rules. In: KDD, vol. 97, pp. 123–126 (1997) 38. https://en.wikipedia.org/wiki/Machine_learning

Chapter 41

A Modified Convolution Neural Network for Covid-19 Detection Rasmiranjan Mohakud and Rajashree Dash

Abstract COVID-19 turned into a critical health problem around the world. Since the start of its spreading, numerous Artificial Intelligence-based models have been created for foreseeing the conduct of the infection and recognizing its contamination. One of the efficient methods of deciding the COVID-19, pneumonia disease is through the chest X-ray images analysis. As there are lots of patients in emergency clinical conditions, it would be tedious and difficult to analyze loads of X-ray images physically. So, an automated, AI-based system can be helpful to predict the infection due to COVID-19 in less time. In this study, a Modified Convolution Neural Network (CNN) model is suggested to predict the COVID-19 infections from the chest X-ray images. Proposed model is designed based on the state-of-art models like GoogleNet, U-Net, VGGNet. The model is fine-tuned using less number of layers than the existing model to get acceptable accuracy. The model is implemented on 724 chest X-ray images from COVID-19 image data collection and is able to produce 93.5% accuracy, 93.0% precision, 93.5% recall, and 92.5% F1-Score, respectively.

41.1 Introduction COVID-19 was declared as a pandemic in 2020, by world health organization, affecting 220 countries and territories over the planet. By June 2021, the epidemic had already taken over 3,849,865 people’s life and infected more than 177,764,675 around the world [1]. Reverse Transcriptase-Polymerase Chain Reaction (RT-PCR) [2] is the popularly used method for COVID-19 detection. It has a high specificity, but it is also pricey, slow, and in high demand right now. The most popular and least expensive way for distinguished COVID-19 is chest X-rays, but it is difficult to identify the virus using this method. So, an automated AI-based method is required to do this work. R. Mohakud (B) · R. Dash Computer Science & Engineering Department, Siksha O Anusandhan (Deemed to be University), Bhubaneswar, India e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2022 D. Mishra et al. (eds.), Intelligent and Cloud Computing, Smart Innovation, Systems and Technologies 286, https://doi.org/10.1007/978-981-16-9873-6_41

455

456

R. Mohakud and R. Dash

Researchers already proposed different AI-based methods for detection of the COVID-19. Model proposed by Bassi et al. in [3] by fine-tuned neural network pertaining to ImageNet and has applied a “twice transfer learning approach” uses X-ray images to classify or detect COVID-19 disease. It has shown accuracy of up to 100% on Chest X-ray images. Das et al. in [4] propose a state-of-the-art architecture in which “DenseNet201, Resnet50V2, and Inceptionv3” are merged to predict the class value using a weighted average ensemble approach. To test the efficiency of the model, Das et al. used publicly available chest X-ray images of COVID-19. Muhammad et al. in [5] predict COVID-19 disease using regulated AI models by using calculation such as “logistic regression, decision tree, support vector machine, naive Bayes, and artificial neural network”. The study by Muhammad et al. uses dataset for positive and negative COVID-19 instances of Mexico. Accuracy up to 94.99%, sensitivity up to 93.34%, specificity up to 94.30% for Decision Tree, Support Vector, Naive Bayes, respectively. Narin et al. in [6] suggested five pretrained CNN models (“ResNet101, ResNet50, ResNet152, Inception-ResNetV2 and InceptionV3”) for detecting Covid-19, pneumonia-tainted patients using chest X-ray radiographic pictures. The pre-trained ResNet50 model has been found to have the best classification performance on chest X-ray images data, shared by Dr. Joseph Cohen on his GitHub. Covid-Net by Wang et al. in [7], is an open-source and freely available CNN, is contributed for the identification of COVID-19 patients from chest X-ray scans. Deep neural networks have been adopted in many classification and recognition systems, due to their remarkable success in solving complex tasks. CNN is one of the popular deep neural networks widely used by researchers for image classification tasks due to its ability to classify complex image objects to acceptable accuracy. In the ImageNet Large-Scale Visual Recognition Challenge 2014, a profound deep CNN known as GoogleNet achieved the new cutting edge for order and identification (ILSVRC14). The main hallmark of this work is to utilize the computing resource inside the network. A profound CNN is utilized to portray the 1.2 million highcontrast pictures in the ImageNet LSVRC-2010 competition into the 1000 distinct classes. For faster training, ImageNet utilized un-saturate neurons and an extremely proficient GPU execution of the convolution activity. ConvNet explores the impact of the convolution network profundity on its exactness in the enormous scope image recognition setting. The primary commitment of this work is the use of small convolution filters to deepen the network, which shows that a profound improvement can be achieved by deepen to 16–19 weight layers then prior state-of-art models. UNet network architecture consists of an encoding path to encode the features and equivalent decoding path that expand the selected features to the original input size. Automated skin cancer detection using hyper-parameter optimized CNN proposed by Mohakud et al. in [8, 9], uses nature-inspired algorithms to enhance hyper-parameters of the CNN. The test in chest X-ray based discovery of COVID-19 victims is so difficult due to prepared specialists may not be accessible constantly, particularly in far-off regions. Additionally, the radiological appearances identified with COVID-19 are new and new to numerous specialists not having past experience with COVID-19 decisive

41 A Modified Convolution Neural Network for Covid-19 Detection

457

patients. We proposed a straightforward and economical profound learning-based procedure to characterize COVID decisive utilizing chest X-ray pictures. Utilizing this strategy a close precise detection of COVID-19 decisive patients should be possible in less time. As a part of this research, main contributions are as follows: • We proposed a CNN architecture that can be used to predict the class of COVID-19 decisive patients. • Data set is prepared by choosing explicit chest X-ray images from arXiv: 2003.11597. • Model performance is measured using standard metrics and compared with existing benchmark models. Immediate parts of the paper are coordinated as follows: Sect. 41.2 describes the proposed method, Sect. 41.3 presents the experimental result and analysis, Sect. 41.4 concluded the proposed work.

41.2 Modified CNN Model for Covid-19 Detection from Chest X-Ray Images Due to the COVID-19 pandemic situation, it is required to find a cost-effective, efficient, and automated method to classify a chest X-ray image as COVID-19 or not. CNN has played a vital role in medical image processing for a decade. CNN has the capability to automate the feature detection and classification of the image. The end-to-end training process and accuracy of CNN attract lots of researchers to use it for different problems. Many variants of the CNN like googLeNet, convNet, ImageNet are available. These models differ according to the uses of number of convolution layers, kernel size, and kernel numbers. In this work, we proposed a modified CNN model which is the stack of layers like convolution, maxpool, dropout, batch normalization, and dense layer. Proposed CNN model takes a chest X-ray image and classifies it into two classes COVID-19 or non COVID-19. The modified CNN model is designed with less number of layers than the existing CNN model. Proposed model’s schematic diagram is presented in Fig. 41.1. The suggested CNN model consists of eight convolution layers, two max pool layers, two dropout layers, and three dense layers. All eight convolution layers’ kernel size (KS) is 3 × 3, and all are activated by relu activation function. Convolution layers’ numbers of kernels (NK) are, 20, 40, 60, 80, 100, 100, 100, and 100, respectively. Maxpool layer has pooling size of 2 × 2, dropout layer has dropout rate of 0.2, and dense layers have sizes (DS) 500, 500, and 2, respectively. Last dense layer size is 2, which is according to the number of classes predicted by our model. All dense layers including the last dense layer use softmax activation function. Flatten layers are only used to convert n-dimensional array to one-dimensional array so that it will be able to densely connected to the dense layer. Convolution layer is used to select

458

R. Mohakud and R. Dash

Fig. 41.1 Schematic diagram of the proposed modified CNN

the features from the input images, initial convolution layers NK are kept less to filter simple features whereas later layers have greater NK to filter complex features. Maxpool layer is used to downsize the features. Dropout and batch normalization are used to regularize the model. Dropout layers drop some of the features to regularize the model. Batch normalization is used to normalize the layers’ output batch-wise so that the model does not overfit. At last dense layer is used to predict the correct class.

41.2.1 Performance Matrices Proposed model is assessed utilizing accuracy, precision, recall, F1-Score, and area under receiver operating characteristic curve (AUC). These are calculated from the confusion matrix as follows: Accuracy =

(TP + TN) (TP + FP + FN + TN)

(41.1)

TP TP + FP

(41.2)

Precision = Recall = F1-score = 2 ×

TP TP + FN

(Pescision × Recall) (Precision + Recall)

True Negative Rate =

TN TN + FP

(41.3) (41.4) (41.5)

41 A Modified Convolution Neural Network for Covid-19 Detection

False Positive Rate =

459

FP TN + FP

(41.6)

The “Receiver Operator Characteristic” (ROC) curve is an assessment measure for different classification problems. The ROC probabilistic curve uses probabilistic thresholds to plot the true positive rate against false. The summary of ROC curve is represented by, “Area under the curve” (AUC) which is the measurement of the capability of chronology to distinguish between classes. 1 AUC =

ROC

(41.7)

0

Loss function used in our model is the binary cross-entropy loss as in Eq. (41.8). N is the number of data, Y i is actual class and Pi is the predicted class by CNN model. Loss = −

N

Yi log(Pi )

(41.8)

i=1

41.3 Experimental Result Discussion The simulations are carried out by collecting the chest X-Ray images of COVID19 and non COVID-19 patients from COVID-19 image data collection, arXiv: 2003.11597, 2020 [10]. This data set consists of chest X-ray, CT scan images of COVID-19 and non COVID-19 patients. It includes 463 COVID-19, and 261 non COVID-19 images. The label of the images is prepared by assigning 1 to COVID-19, and 0 to non COVID-19. Images of the data set are RGB images of different sizes like 522 × 582, 1024 × 995, and 2000 × 2000. We split the data set into two parts keeping 55% for training and 45% for testing. The details about the dataset are given in Table 41.1. Our model is implemented by Python-3, and Keras 2.2. For image preprocessing we use Open-Cv2. For plotting the results we have used matplotlib of python. For the confusion matrix, we have used the sklearn package of python. All experiments Table 41.1 Dataset used for proposed model Dataset

Total images COVID-19 Non COVID-19 Training (%) Testing (%)

Chest X-ray image 724 data collection

463

261

55

45

460

R. Mohakud and R. Dash

are carried out on google collab pro platform with Tesla-35C GPU, 16 GB Ram, NVIDIA-SMI 465.27, CUDA Version: 11.2. After preparing the dataset we have resized all the images to 100 × 100 × 3 size by using image cropping method, to save the time of model training. We trained the model using 55% of the dataset for 300 epochs. At the time of training the model, we have got an accuracy of up to 97% and loss of 0.2052. Then the model is tested with 45% of test data, where we got the test accuracy of 93.59% and test loss of 0.2735. Accuracy and loss of the proposed model are given in Table 41.2. Figure 41.2 illustrates the accuracy and loss of the training and validation data over the epoch. From this plot, it is figured out that the accuracy of the training data is smooth after 100 epochs whereas testing accuracy is smooth after 200 epochs and loss of both training and testing smooth in all epochs. Model prediction confusion matrix is given in Table 41.3. The class-wise precisions recall, F1-Score of the proposed model are summarized in Table 41.4. Micro average of precision, recall, and F1-Score is calculated by averaging the class-wise score. It is inspected from Table 41.4, that the micro average precision, recall, F1Score are around 93.0, 93.5, 92.5, respectively. Figure 41.3 shows the ROC curve of the proposed model and also its AUC, respectively. Finally, the model performance is compared in terms of the accuracy with other state-of-the-art models such as VGG19, ResNet-50, and COVID-Net and presented in Table 41.5. The proposed model’s accuracy is 10.5%, 3.1%, 0.2% more than VGG-19, ResNet-50 and COVID-Net, respectively. Table 41.2 Accuracy and loss of proposed CNN model Model

Training accuracy (%)

Testing accuracy (%)

Training loss

Testing loss

Proposed CNN

97.00

93.59

0.2052

0.2736

Fig. 41.2 a Epoch versus accuracy of training and testing data. b Epoch versus loss of training and testing data of the proposed CNN model

41 A Modified Convolution Neural Network for Covid-19 Detection

461

Table 41.3 Confusion matrix of proposed model COVID-19

Non COVID-19

COVID-19

156

24

Non COVID-19

0

144

Table 41.4 Classification report of the proposed model Precision

Recall

F1-score

Support

COVID-19

1.00

87.0

0.93

180

Non COVID-19

0.86

1.0

0.92

144

Micro average

93.0

93.5

92.5

Fig. 41.3 ROC and AUC of the proposed CNN model

Table 41.5 Accuracy of proposed model compared with other state-of-art models

Model

Auuracy (%)

VGG-19 [7]

83.0

ResNet-50 [7]

90.6

COVID-Net [7]

93.3

Proposed CNN

93.5

41.4 Conclusion In this study, we introduced a modified deep CNN for the detection of COVID-19 pneumonia infections from chest X-ray images. The suggested CNN classifier with just 18 layers is able to detect the covid-19 cases with an accuracy of 93%. Such a small CNN may not be best for another dataset, but for chest X-ray image data set it gives promising results. The proposed model will be helpful to accelerate the

462

R. Mohakud and R. Dash

computer-aided screening process of the infection. We plan to expand the experimental work validating the approach with larger datasets in the future, depending on the availability of large datasets. Finally, future work will focus on model trimming and quantization to improve efficiency and allow deployment on mobile devices.

References 1. Hopkins, J.: COVID-19 Dashboard by the Center for Systems Science and Engineering (CSSE). Coronavirus Resource Center (2020) 2. Wang, W., Xu, Y., Gao, R., Lu, R., Han, K., Wu, G., Tan, W.: Detection of SARS-CoV-2 in different types of clinical specimens. JAMA 323(18), 1843–1844 (2020) 3. Bassi, P.R., Attux, R.: A deep convolutional neural network for COVID-19 detection using chest X-rays. Res. Biomed. Eng. 1–10 (2021) 4. Das, A.K., Ghosh, S., Thunder, S., Dutta, R., Agarwal, S., Chakrabarti, A.: Automatic COVID19 detection from X-ray images using ensemble learning with convolutional neural network. Pattern Anal. Appl. 1–14 (2021) 5. Muhammad, L.J., Algehyne, E.A., Usman, S.S., Ahmad, A., Chakraborty, C., Mohammed, I.A.: Supervised machine learning models for prediction of COVID-19 infection using epidemiology dataset. SN Comput. Sci. 2(1), 1–13 (2021) 6. Narin, A., Kaya, C., Pamuk, Z.: Automatic detection of coronavirus disease (covid-19) using X-ray images and deep convolutional neural networks. Pattern Anal. Appl. 1–14 (2021) 7. Wang, L., Lin, Z.Q., Wong, A.: Covid-net: a tailored deep convolutional neural network design for detection of covid-19 cases from chest X-ray images. Sci. Rep. 10(1), 1–12 (2020) 8. Mohakud, R., Dash, R.: Designing a grey wolf optimization based hyper-parameter optimized convolutional neural network classifier for skin cancer detection. J. King Saud Univ. Comput. Inf. Sci. (2021). https://doi.org/10.1016/j.jksuci.2021.05.012 9. Mohakud, R., Dash, R.: Survey on hyperparameter optimization using nature-inspired algorithm of deep convolution neural network. In: Intelligent and Cloud Computing, pp. 737–744 (2021) 10. Cohen, J.P., Morrison, P., Dao, L., Roth, K., Duong, T.Q., Ghassemi, M.: Covid-19 Image Data Collection: Prospective Predictions are the Future. arXiv preprint arXiv:2006.11988 (2020)

Chapter 42

Bi-directional Long Short-Term Memory Network for Fake News Detection from Social Media Suprakash Samantaray and Abhinav Kumar

Abstract These days’ web-based media is one of the main news hotspots for individuals throughout the planet for its minimal expense, simple openness, and quick spreading. This web-based media can in some cases include uncertain messages and has a critical danger of openness to counterfeit or fake news, which may elude the pursuers. Therefore, finding fake news from social media is one of the important natural language processing tasks. In this work, we have proposed a bi-directional long short-term memory (Bi-LSTM) network to identify COVID-19 fake news posted on Twitter. The performance of the proposed Bi-LSTM network is compared to six different popular classical machine learning classifiers such as Naïve Bayes, KNN, Decision Tree, Gradient Boosting, Random Forest, and AdaBoost. In the case of classical machine learning classifiers uni-gram, bi-gram, and tri-gram word TFIDF features are used whereas in the case of the Bi-LSTM model word embedding features are used. The proposed Bi-LSTM network performed best in comparison to other implemented models and achieved a weighted F1-score of 0.94 in identifying COVID-19 fake news from Twitter.

42.1 Introduction World Health Organization (WHO), in December 2019, promoted Pneumonia’s obscure reason in the city of Wuhan, China. Those cases were analyzed as having intense Pneumonia and dryness, sleepiness, fever, hack, and breathing trouble. In January 2020, WHO called the infection “Extreme Acute Respiratory Syndrome Coronavirus 2 (SARS-CoV2)” after that called Coronavirus COVID-19, this is an infection spread everywhere in the world to turn into a worldwide plague. The inescapable utilization of online media has become there misdirecting general society with fake news [1–3]. Counterfeit news implies false realities showed, like news, which compromises the unwavering quality concerning recording, introduced, and S. Samantaray · A. Kumar (B) Department of Computer Science and Engineering, Siksha ‘O’ Anusandhan Deemed to be University, Bhubaneswar, Odisha, India e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2022 D. Mishra et al. (eds.), Intelligent and Cloud Computing, Smart Innovation, Systems and Technologies 286, https://doi.org/10.1007/978-981-16-9873-6_42

463

464

S. Samantaray and A. Kumar

distributed by utilizing informal communities as evident news. Likewise, during that advanced period we live in now, online media is the simple way utilizing and colossal obstacles to limit and control the spread of misdirecting for various reasons; one of this free progression of information and data into various positions web-based media like Twitter, YouTube, Facebook, Instagram, whatnot. During the circumstance, an article report tells, around one of every two residents crossed the United Kingdom U.K., United States US, Spain, Argentina, Canada, South Korea, and Germany, guarantee they have seen delude data on the informal organizations related with COVID (Coronavirus). Due to an overflowing spread of phony and deceiving news, a few pundits hold promptly alluding to the most recent flood of falsehood joined by the Coronavirus pandemic. These days’ online media is one of the main news hotspots for individuals throughout the planet for its minimal expense, simple availability, and fast spreading [4–9]. Nonetheless, this can now and again include uncertain messages and has a huge danger of openness to “counterfeit news”, which may deceive the perusers [10, 11]. Counterfeit news implies false realities showed, like news, which compromises the unwavering quality concerning writing, introduced, and distributed by utilizing informal organizations as evident news. Likewise, during that advanced period we live in now, online media is the simple way utilizing and enormous obstacles to limit and control the spread of misdirecting for various reasons; one of this free progression of information and data into various positions web-based media like Twitter, YouTube, Facebook, Instagram, whatnot. Recently, some of the works have been proposed by different researchers [12– 16] to identify fake news from social media. Al-Zaman [12] investigated the five different characteristics of COVID-19 fake news that is circulating on social media. Kolluri and Murthy [13] developed a CoVerifi platform that combines the power of machine intelligence and human input to determine the trustworthiness of the posted news. Müller et al. [14] proposed a transformer-based model that experimented with different COVID-19 related datasets to identify fake news. In this work, we are proposing a Bi-LSTM based model to identify COVID19 related fake news from Twitter. Along with the Bi-LSTM network, we have also implemented several conventional machine learning classifiers and compared the result of the proposed Bi-LSTM model with the conventional machine learning classifiers. The rest of the sections are organized as follows: Sect. 42.2 discusses the proposed methodology, Sect. 42.3 lists the results, and Sect. 42.4 concludes the paper with some future directions.

42.2 Related Work The COVID-19 epidemic has had a detrimental influence on several elements of life. As a result, the Coronavirus pandemic and its consequences have become a topic of conversation on social media. However, not all social media posts are accurate.

42 Bi-directional Long Short-Term Memory Network …

465

Many of them propagated false information, causing panic among readers, distorting others’ truths, and so exacerbating the pandemic’s impact. The identification of false news is one of the most current hot issues in artificial intelligence research. Fake news producers frequently combine true stories with false information to deceive the viewer. Several researchers have published papers on identifying fake news on social media [1, 2, 12]. Al-Zaman [12] conducted a thorough investigation on the five distinct characteristics of COVID-19 fake news that is spreading on social media. The CoVerifi platform, which is a web application, was proposed by Kolluri and Murthy [13]. This platform combines the power of machine intelligence and human input to determine the trustworthiness of the posted news. The CT-BERT, a transformer-based model that has been pre-trained on a specified target domain, was utilized in [14]. For five COVID19-related datasets, they found F1-scores ranging from 0.833 to 0.802. The problem of identifying whether COVID-19-related social media messages are fake or realistic was investigated by Felber [15]. They want to address this issue by integrating standard machine learning methods with language characteristics such as n-grams, readability, emotional tone, and punctuation. They achieved a weighted average F1score of 95.19% using their best linear SVM-based method. Raha et al. [17] used transformer models such as BERT, Roberta, Electra, and others in their research. They achieved their best performance for the Electra model with an F1-score of 0.98. Zubiaga et al. [18] collected a variety of content-based and social factors. For content-based features, they gathered word vectors, part of speech tags, word count, use of the question mark, and the time of user account creation. They gathered information on social parameters such as the number of tweet counts, follow ratio, verified account age, and so on. They employed several classifiers, including SVM, Random Forest, CRF, and Maximum Entropy. They obtained an F1-score of 0.607 using the CRF model. Singh et al. [1] proposed a hybrid attention-based LSTM model with an attention layer that contained various content and user-based features and achieved cutting-edge performance with an F1-score of 0.88. Kumar et al. [2] proposed a multi-layered dense neural network to identify fake news articles written in the Urdu language. They achieved an F1-score of 0.81. Kar et al. [4] presented a BERT-based model integrating additional features derived from COVID-19 tweets. They tested with English, Hindi, and Bengali datasets. Their suggested models obtained F1-scores of 0.98, 0.79, and 0.81 for English, Hindi, and Bengali datasets, respectively. Furthermore, they presented a zero-shot learning strategy for the Hindi and Bengali datasets and obtained F1-scores of 0.81 and 0.78 for the Hindi and Bengali datasets for COVID-19 false news detection, respectively. Apuke and Omar [5] investigated a number of factors impacting the spread of COVID-19 fake news on social media. They discovered that among Nigerians, altruism is the most important element in spreading bogus news. Kolluri and Murthy [13] developed a web application called CoVerifi that that uses machine learning and human input to determine the authenticity of COVID-19-related news.

466

S. Samantaray and A. Kumar

42.3 Methodology This section discusses the proposed methodology in detail. The overall flow diagram for the proposed model can be seen in Fig. 42.1. In this work, six different conventional machine learning models such as (i) Naïve Bayes, (ii) Decision Tree, (iii) Adaptive Boosting, (iv) Gradient Boosting, (v) Random Forest, and (vi) KNearest Neighbor (KNN) were implemented. Along with these six different conventional machine learning classifiers, we proposed a Bi-directional Long-Short-Termmemory (Bi-LSTM) network were proposed. To validate the proposed model, we used the dataset published by Patwa et al. [19]. The overall data statistic used in this study can be seen in Table 42.1. In the case of conventional machine learning classifiers, uni-gram, bi-gram, and tri-gram Term-Frequency-Inverse-Document Frequency (TF-IDF) feature vectors were used. In the case of the Bi-LSTM network, for each word, 100-dimensional word embedding vectors were used to make the sentence matrix. For the deep learning model, we fixed the maximum work size of 30 to give input to the proposed Bi-LSTM network. We experimented with different learning rates, batch size, and optimizer in the case of the Bi-LSTM model, and in the extensive experiments, we found that the learning rate of 0.001, batch size of 32 performed best with the Adam optimizer. The proposed Bi-LSTM network was trained for 50 epochs. The supplied input vector Fig. 42.1 Overall flow diagram for the proposed model for COVID-19 fake news detection

Table 42.1 Data statistics used to validate the proposed model

Class

Training

Testing

Validation

Fake

3060

1120

1120

Real

3360

1020

1020

Total

6420

2140

2140

42 Bi-directional Long Short-Term Memory Network …

467

is processed in two ways by the Bi-LSTM network: one from past to future and the other from future to past. As a result, Bi-LSTM is able to store more information from the past as well as the future in order to perform the prediction task. To implement conventional machine learning classifiers, the Sklearn Python library was used and in all the conventional machine learning models, we used default parameters defined in the Sklearn library. For the deep learning model, the Keras library with TensorFlow as a backend is used to build the model.

42.4 Results This section lists the results for conventional machine learning and deep learning models. The performance of the proposed models is measured with precision, recall, F1-score, AUC-ROC curve, and confusion matrix. A weighted average of the precision, recall, and F1-score are also calculated by considering the available data samples of each of the classes in the testing set. The result for the different conventional machine learning and deep learning models are listed in Table 42.2. As, can be seen from Table 42.2, in the case of conventional machine learning models, Random Forest classifiers performed best with a weighted F1-score of 0.92 whereas the proposed Bi-LSTM network achieved a weighted precision, recall, and F1-score of 0.94. The AUC-ROC curve and confusion matrix for the proposed BiLSTM network can be seen in Figs. 42.2 and 42.3, respectively. The AUC-ROC curve (Fig. 42.2) shows that the proposed Bi-LSTM model achieves an AUC of 0.97 for both real and fake classes. According to the confusion matrix plotted in Fig. 42.3, out of a total of 100 samples from the fake class, the suggested model can classify it as fake in 93 instances. Similarly, out of a total of 100 samples from the real class, the suggested model correctly identifies it as real in 94 cases. The performance of each of the implemented models in terms of weighted F1-score is plotted in Fig. 42.4.

42.5 Conclusion In a pandemic situation like COVID, people are unable to know which news is genuine and which is not. They get panic when they come across some false news, not only this situation but also other aspects of today’s world. In this work, we have proposed a bi-directional long short-term memory (Bi-LSTM) network to identify COVID-19 fake news posted on Twitter. The performance of the proposed Bi-LSTM network is compared to six different popular classical machine learning classifiers such as Naïve Bayes, KNN, Decision Tree, Gradient Boosting, Random Forest, and AdaBoost. From the above relative investigation of different machine learning and deep learning models, we tracked down that the BI-LSTM accomplished the most elevated F1-score of 0.94 that outperformed other implemented models. It means the use of word embedding vector with Bi-LSTM network leaned better than the

468

S. Samantaray and A. Kumar

Table 42.2 Results for the different machine and deep learning models Classifiers

Class

Precision

Recall

F1-score

Naïve Bayes

Fake

0.91

0.89

0.90

Decision Tree

Adaptive Boosting

Gradient Boosting

Random Forest

KNN

Bi-LSTM

Real

0.90

0.92

0.91

Weighted Avg

0.91

0.91

0.91

Fake

0.87

0.88

0.88

Real

0.89

0.88

0.89

Weighted Avg

0.88

0.88

0.88

Fake

0.89

0.90

0.89

Real

0.91

0.89

0.90

Weighted Avg

0.90

0.90

0.90

Fake

0.89

0.89

0.89

Real

0.90

0.90

0.90

Weighted Avg

0.89

0.89

0.89

Fake

0.90

0.93

0.91

Real

0.93

0.90

0.92

Weighted Avg

0.92

0.92

0.92

Fake

0.83

0.95

0.88

Real

0.94

0.82

0.88

Weighted Avg

0.89

0.88

0.88

Fake

0.93

0.93

0.93

Real

0.94

0.94

0.94

Weighted Avg

0.94

0.94

0.94

Fig. 42.2 ROC curve for the proposed Bi-LSTM network for the COVID-19 fake news detection

42 Bi-directional Long Short-Term Memory Network …

469

Fig. 42.3 Confusion matrix for the proposed Bi-LSTM network COVID-19 news detection

Fig. 42.4 Comparison of all the implemented models for the COVID-19 fake news identification

TF-IDF features in the identification of COVID-19 fake news from social media. In the future, some other robust ensemble-based deep learning models can be made to achieve better performance. In this work, we trained our system with the English language data only, but during a disaster like COVID-19, people also post messages in their regional languages. Therefore, in the future, a multi-lingual system can also be made to identify COVID-19 fake news.

470

S. Samantaray and A. Kumar

References 1. Singh, J.P., Kumar, A., Rana, N.P., Dwivedi, Y.K.: Attention-based LSTM network for rumor veracity estimation of tweets. Inf. Syst. Front. 1–16 (2020) 2. Kumar, A., Saumya, S., Singh, J.P.: NITP-AI-NLP@ UrduFake-fire2020: multi-layer dense neural network for fake news detection in Urdu news articles. In: FIRE (Working Notes), 2020, pp. 458–463 3. Roy, P.K., Chahar, S.: Fake profile detection on social networking websites: a comprehensive review. IEEE Trans. Artif. Intell. (2021) 4. Kar, D., Bhardwaj, M., Samanta, S., Azad, A.P.: No Rumours Please! A Multi-Indic-Lingual Approach for COVID Fake-Tweet Detection. arXiv preprint arXiv:2010.06906 (2020) 5. Apuke, O.D., Omar, B.: Fake news and COVID-19: modelling the predictors of fake news sharing among social media users. Telemat. Inform. 56, 101475 (2021) 6. Kumar, A., Singh, J.P., Dwivedi, Y.K., Rana, N.P.: A deep multi-modal neural network for informative Twitter content classification during emergencies. Ann. Oper. Res. 1–32 (2020) 7. Pandey, R., Kumar, A., Singh, J.P., Tripathi, S.: Hybrid attention-based Long Short-Term Memory network for sarcasm identification. Appl. Soft Comput. 106, 107348 (2021) 8. Kumar, A., Rathore, N.C.: Relationship strength-based access control in online social networks. In: Proceedings of First International Conference on Information and Communication Technology for Intelligent Systems, vol. 2, pp. 197–206. Springer, Cham (2016) 9. Kumar, A., Singh, J.P.: Location reference identification from tweets during emergencies: a deep learning approach. Int. J. Disaster Risk Reduct. 33, 365–375 (2019) 10. Saumya, S., Kumar, A., Singh, J.P.: Offensive language identification in Dravidian code mixed social media text. In: Proceedings of the First Workshop on Speech and Language Technologies for Dravidian Languages, pp. 36–45 (2021) 11. Kumar, A., Saumya, S., Singh, J.P.: NITP-AI-NLP@ HASOC-FIRE2020: fine tuned BERT for the hate speech and offensive content identification from social media. In: FIRE (Working Notes), pp. 266–273 (2020) 12. Al-Zaman, M.: COVID-19-related social media fake news in India. J. Media 2(1), 100–114 (2021) 13. Kolluri, N.L., Murthy, D.: CoVerifi: a covid-19 news verification system. Online Soc. Netw. Media 22, 100123 (2021) 14. Müller, M., Salathé, M., Kummervold, P.E.: Covid-twitter-bert: A Natural Language Processing Model to Analyse covid-19 Content on Twitter. arXiv preprint arXiv:2005.07503 (2020) 15. Felber, T.: Constraint 2021: Machine Learning Models for COVID-19 Fake News Detection Shared Task. arXiv preprint arXiv:2101.03717 (2021) 16. Paka, W.S., Bansal, R., Kaushik, A., Sengupta, S., Chakraborty, T.: Cross-SEAN: a crossstitch semi-supervised neural attention model for COVID-19 fake news detection. Appl. Soft Comput. 107, 107393 (2021) 17. Raha, T., Indurthi, V., Upadhyaya, A., Kataria, J., Bommakanti, P., Keswani, V., Varma, V.: Identifying COVID-19 Fake News in Social Media. arXiv preprint arXiv:2101.11954 (2021) 18. Zubiaga, A., Liakata, M., Procter, R.: Learning Reporting Dynamics During Breaking News for Rumour Detection in Social Media. arXiv preprint arXiv:1610.07363 (2016) 19. Patwa, P., Sharma, S., Pykl, S., Guptha, V., Kumari, G., Akhtar, M.S., Ekbal. A., Das, A., Chakraborty, T.: Fighting an Infodemic: Covid-19 Fake News Dataset. arXiv preprint arXiv: 2011.03327 (2020)

Chapter 43

Effect of Class Imbalanceness in Credit Card Fraud Adyasha Das and Sharmila Subudhi

Abstract With the expansion of latest technologies, credit card fraud has increased dramatically, making it the ideal target for fraudsters. As a result, financial institutions must be proactive in identifying and monitoring unethical behavior. Furthermore, the presence of class imbalanceness in the credit card usages poses a high challenge in identifying the genuine patterns. In this article, the authors have performed comprehensive experimental research to address such issues by proposing a novel hybrid fraud detection model that uses the machine learning techniques for identification of forged activities in transactional records. Extensive experiments have been done using a real-world dataset to show the efficiency of the proposed system. The authors have used an oversampling approach to remove the data imbalances present in the dataset. In addition, the efficiency of the proposed model on a balanced dataset over the imbalanced one has been presented.

43.1 Introduction The rapid growth of Internet has fueled the generation of huge amounts of data consisting of various social media views, remarks, and the consumer’s buying habits as well. The analysis of such kind of data reveals the underlying data/behavioral pattern of a user. By exploiting such patterns, an intruder can wreak havoc on many financial institutions and banks. As per a report published by the Federal Trade Commission in 2020, $1.9 billion has lost to financial fraud in the United States of America in 2019 alone [1]. This report has also pointed that out of $1.9 billion, the credit card fraud amounted to $135 million with 53,763 number of reported fraudulent activities. Another report has presented that there is a rise of 84% in credit A. Das Department of Computer Science Engineering, Siksha ‘O’ Anusandhan Deemed to be University, Bhubaneswar, Odisha, India S. Subudhi (B) Department of Computer Science & Information Technology, Siksha ‘O’ Anusandhan Deemed to be University, Bhubaneswar, Odisha, India e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2022 D. Mishra et al. (eds.), Intelligent and Cloud Computing, Smart Innovation, Systems and Technologies 286, https://doi.org/10.1007/978-981-16-9873-6_43

471

472

A. Das and S. Subudhi

card fraud attempts during the COVID-19 pandemic [2]. This indicates the severity of such fraud. As a result, relying solely on prevention-based security solutions to safeguard sensitive data against novel threats is frequently insufficient. From the statistics, it has been clearly understood regarding the impact of credit card frauds in the finance world. The abuse of a credit card account without appropriate authorization is referred to as credit card fraud [3]. In such scenario, the cardholder is unaware of the improper handling of card by a third party. The fraudster simple may, either try to obtain the identity of the original credit card holder to gain the control on the card or try to make every fraudulent transaction look legitimate, making it harder to identify fraud [3]. Different types of credit card frauds need to be analyzed to design the fraud detection models for effective identification of intruders [4]. Furthermore, the age-old challenges plaguing the credit card datasets have been discussed below. • Concept Drift: Basically, any changes in the transactional pattern that do not follow the routine behavioral pattern, the FDS flags it as an alert [3]. When a genuine user deviates from their pattern, the FDS also marks the activity as intrusive resulting in bogus situation. • Noisy Data: Mostly, the behavioral dataset contains some missing or incorrect values, null elements, duplicate records, inconsistent values, and much more. This incomplete dataset, when fed to an FDS, tends to lessen its potential in detecting the malicious actions successfully [3]. • Skewed Distribution: The fraudsters generally have a low tendency to invoke fewer malicious instances in any victim’s card so that they can go unnoticed. This leads to lesser fraudulent samples than the normal ones. Upon giving this skewed imbalanced dataset to an FDS, its capability is overwhelmed by the genuine records [3]. • Variation in Fraud Patterns: Since beginning the intruders change their attacking style to avoid the detection by the FDSs. Thus, it becomes the most vital in updating the obsolete FDS profiles regularly to check the newer intrusions [3]. Following sections have been included in this article as follows. Section 43.2 deals with an extensive literature survey of the existing research work as well as the technologies used for the credit card fraud detection. Section 43.3 elaborately presents a novel anomaly-based Credit Card Fraud Detection System (CFDS) that investigates the credit card transactions of the users for finding any sign of forged actions. The model is based on the hybridization of various supervised classifiers with skewed class removal strategies. A detailed assessment of the proposed CFDS has been mentioned in Sect. 43.4 after employing a real-world large-scale credit card dataset. A comparative analysis of the model with other existing works as well as the effectiveness of balanced data over an imbalanced one is also introduced in this chapter. Section 43.5 briefly summarizes the conclusions of the research done in this proposition along with their achievements.

43 Effect of Class Imbalanceness in Credit Card Fraud

473

43.2 Preliminary Work In paper [4], a meta-learning technique for distinguishing real credit card claims from fraudulent ones was presented. Due to the sensitive nature of the credit card information, they have anonymized the records and carried out extensive experiments on it. They have modeled their system by hybridizing several meta-learning techniques and oversampling methods to handle the skewed class distribution issue. Upon applying the anonymized credit card records of the U.S. citizens, their model exhibited the sensitivity score of 80%. Delamaire et al. described varieties of fraudulent activities like, bankruptcy or counterfeit or theft or application or behavioral fraud happening in credit card [5]. They have also suggested different classification and clustering techniques to counter those events. Furthermore, five different fraud areas have been explored in their article along with their performances and the type of available datasets in the respective fields. Another unique approach has been proposed by Kundu et al. to segregate the illegitimate events in the credit card transactions by employing the sequence alignment concept [6]. Initially, the authors have created two sequences, one for genuine users and another one for the fraudsters by emulating their behavioral spending patterns. Then, they have compared these two sequences with a new incoming behavior sequence. Based on the matching score, their model was able to identify the new sequence as normal or illegitimate. Due to unavailability of the real-world dataset, they have simulated an artificial dataset through various Markov Chain and Gaussian simulators. Their system was capable to discriminate the fraudulent characteristics successfully by 80%. Srivastava et al. suggested the use of Hidden Markov Model (HMM) to identify the malicious transactions in credit card records [7]. They have mimicked the spending behavior of the consumers and segregated into three groups based on the type of services availed, such as electronic items, groceries, and other miscellaneous items. They have also created a simulated dataset for testing their HMM model that yielded the sensitivity of 75%. Pozzolo et al. have built a FDS that overcomes the issue of concept-drift by employing various supervised classifiers [8]. They have adopted a racing method to choose the most adaptive and effective strategy that will update the behavioral profiles regularly to avoid the concept-drift. They have used various European datasets to measure the performance of their model. Carta et al. proposed a novel Prudential Multiple Consensus models to detect the credit card fraud that produced a sensitivity score is 85% [9]. Wang et al. developed an ensemble of C4.5, RIPPER, and Naïve Bayes classifiers to successfully identify the illegitimate transactions with an accuracy of 82% [10]. From the above discussions, it is evident that the impact of credit card fraud is still at large although various models were developed. Moreover, it is also clear that the use of machine learning classifiers is quite effective in detecting fraud, and their efficiency is mostly affected by the skewed distribution as well. Therefore, we have emphasized on removing the class imbalanceness of the credit card transactional records before applying the machine learning classifiers.

474

A. Das and S. Subudhi

43.3 Proposed Credit Card Fraud Detection Model This research work addresses the issue of class imbalance problem present in the automobile insurance claim records by using an oversampling approach, namely, SMOTE (Synthetic Minority Oversampling Technique). The working of the Credit Card Fraud Detection System (CFDS) is divided into the two phases—training and fraud detection phase. The oversampling module is used to enhance the amount of training samples in the training set during the training phase. The symmetrical modified train set is then made by mixing them with the majority class samples. Upon the arrival of an incoming claim record, the fraud detection phase begins by following the same methodology. The samples identified as suspicious are verified by three different supervised classifiers—Neural Network (NN), Random Forest (RF) and Logistic Regression (LR) individually for classification and comparison purposes. The used techniques are too popular to be discussed here. The fraud detection methodology of the proposed system has been elaborated in the following subsection.

43.3.1 Proposed SMOTE_ML_CFDS Model Initially, the proposed system takes the unbalanced dataset for reducing the class imbalance effect by using the SMOTE data generation technique [11]. The resulting modified samples are then processed by various supervised classifiers for classification purposes. In the intended CFDS, the detecting fraud mechanism has been discussed in the later steps, while the workflow diagram of the proposed SMOTE_ML_CFDS is given in Fig. 43.1. • After dropping the class imbalanceness of the raw dataset, 30% samples are chosen as the test set. The remaining 70% of records are segregated into training and validation datasets based on the tenfold cross-validation approach [12]. • The train set is subjected to the three supervised classifiers—NN [4], LR [13] and RF [14] individually to build their respective trained classifier models. The performance of each model depends on its corresponding parameters. • Each classifier is trained using tenfold cross-validation resulting in the production of a pool of trained classifier models. Afterwards, the validation set is deployed to decide the best performing model. • Finally, the fraud detection step begins, in which the test data is supplied as input to each validated model to categorize the credit card transaction as real or fraudulent.

43.4 Results and Discussions This section examines the experimental procedure and assessment of the proposed SMOTE_ML_CFDS model.

43 Effect of Class Imbalanceness in Credit Card Fraud Fig. 43.1 Flow of events in proposed SMOTE_ML_CFDS

(a) Removal of Class Imbalance by SMOTE

(b) Fraud Detection using LR/RF/NN

475

476

A. Das and S. Subudhi

43.4.1 Experimental Setup The implementation of the proposed method has been written in Python language and run in the Google Colab environment. Extensive experiments were conducted on a large scale real-world anonymized credit card dataset for performance analysis.

43.4.2 Description of the Dataset The real-world anonymized credit card records have the transactions made by European cardholders in September 2013 [8]. Unfortunately, due to confidentiality issues, the authors have not provided the original features and more background information about the data. A smaller version of this dataset is available in the Kaggle repository (https://www.kaggle.com/mlg-ulb/creditcardfraud) that contains only 284,807 samples. The dataset comprises 31 attributes, out of which 28 features are the principal components obtained with Principal Component Analysis (PCA) and the rest three features are presented as follows: • Time depicts the seconds elapsed between each transaction. • Amount signifies the transaction amount. • Class refers to the labels as 1 (fraud) and 0 (genuine). Out of the total 284,807 records, only 492 samples have class label as malicious, which is 0.172% of all transactions. Figure 43.2 depicts the highly imbalanced nature of the dataset. The efficacy of the current proposition has been proved by experimenting with the credit card dataset. Initially, the credit card transactions are scaled to [0, 1] by using the standard scaling operation [15]. Once the data pre-processing

Fig. 43.2 Original imbalanced credit card fraud dataset

43 Effect of Class Imbalanceness in Credit Card Fraud

477

Fig. 43.3 Balanced training set after using SMOTE

is over, the dataset is divided into training and test set and the steps of our method are executed. Figure 43.3 illustrates the efficacy of applying SMOTE on the dataset. It is quite clear that the number of minority class instances has increased enough.

43.4.3 Performance Metrics The basic performance measures, such as Sensitivity, Precision and Accuracy are used to assess the proposed system’s efficacy. Another performance graph, known as Receiver Operating Characteristic (ROC) curve, is used to demonstrate how fairly a binary classifier model is performed [12]. This plot shows the trade-off between sensitivity and False Positive Rate (FPR). The closeness of the curve to the top right corner determines the quality of the model in the ROC plot. The Area Under the Curve (AUROC) is the measure of the ability of a classifier to distinguish between the classes and is used as a summary of the ROC curve. Sensitivity =

TP TP + FN

(43.1)

Precision =

TP TP + FP

(43.2)

TP + TN TP + TN + FP + FN

(43.3)

Accuracy =

where TP is the forged samples marked by the CFDS and FP signifies the genuine instances labeled as malicious. Similarly, FN refers to the fraudulent instances

478

A. Das and S. Subudhi

marked as genuine, and TN is the normal data points identified as genuine by the system.

43.4.4 Result Analysis Table 43.1 presents the performance of the supervised classifiers on both the unbalanced and balanced data set (after applying SMOTE). The results are shown in the table clearly depicts that LR classifier performed poorly, while the RF classifier produced better efficiency on the imbalanced dataset. Likewise, in case of balanced dataset, all classifiers yield effective classification results. However, the AUROC of RF learner reduced slightly in case of balanced dataset than that of imbalanced dataset. Figures 43.4 and 43.5 show the ROC curve of the classifiers in terms of unbalanced and balanced datasets. From these two figures, the Logistic Regression yields the best performance among all the classifiers in both types of datasets. Table 43.1 Performance of supervised classifiers on credit card fraud dataset Performance metrics (in %)

Unbalanced dataset

Balanced dataset

NN

RF

LR

NN

RF

LR

Accuracy

99.93

99.95

99.92

99.65

99.95

97.38

Sensitivity

70.06

77.55

62.85

86.39

76.19

91.15

Precision

31.28

95.00

5.69

88.79

94.91

88.46

AUROC

85.02

88.77

81.28

93.03

88.09

94.27

Fig. 43.4 ROC curve of classifiers on unbalanced dataset

43 Effect of Class Imbalanceness in Credit Card Fraud

479

Fig. 43.5 ROC curve of classifiers on balanced dataset

43.5 Conclusions Credit card fraud is the most prevalent and costly fraud, and it has caused widespread alarm throughout the world. Several issues make it difficult to discover solutions for credit card fraud detection. In this research, an oversampling-based Credit card Fraud Detection System (CFDS), namely, SMOTE_ML_CFDS has been developed to generate more artificial minor class points to minimize the class skewness. The training, validation and testing samples are then extracted from the balanced dataset and the three different classifiers, viz—Neural Network (NN), Random Forest (RF) and Logistic Regression (LR) have been applied individually to the train set for building trained supervised models. The validation set is used to decide the corresponding best-trained model based on the maximum performance and least error rate. The test set is then used in the chosen models and final classification is done. Experiments were carried out for evaluating the effectiveness of the proposed system by using a real-world anonymized dataset. Tests were done to estimate the efficiency of the classifiers on both balanced and imbalanced datasets with respect to different performance parameters. From the observations, we can infer that LR produces the best result in terms of highest AUROC = 94.27% and Sensitivity = 91.15%, while RF generates lowest Sensitivity of 76.19%.

References 1. Sandberg, E.: 15 disturbing credit card fraud statistics. https://www.cardrates.com/advice/cre dit-card-fraud-statistics/. Accessed 29 June 2021 2. Andriotis, A., McCaffrey, O.: Borrower, beware: credit-card fraud attempts rise during the

480

3. 4.

5. 6. 7. 8.

9. 10.

11. 12. 13. 14. 15.

A. Das and S. Subudhi coronavirus crisis. https://www.wsj.com/articles/borrower-beware-credit-card-fraud-attemptsrise-during-the-coronavirus-crisis-11590571800. Accessed 29 June 2021 Abdallah, A., Maarof, M.A., Zainal, A.: Fraud detection system: A survey. J. Netw. Comput. Appl. 68, 90–113 (2016) Stolfo, S., Fan, D.W., Lee, W., Prodromidis, A., Chan, P.: Credit card fraud detection using meta-learning: issues and initial results. In: AAAI-97 Workshop on Fraud Detection and Risk Management, pp. 83–90 (1997) Delamaire, L., Abdou, H., Pointon, J.: Credit card fraud and detection techniques: a review. Banks Bank Syst. 4(2), 57–68 (2009) Kundu, A., Panigrahi, S., Sural, S., Majumdar, A.K.: BLAST-SSAHA hybridization for credit card fraud detection. IEEE Trans. Depend. Secure Comput. 6(4), 309–315 (2009) Srivastava, A., Kundu, A., Sural, S., Majumdar, A.: Credit card fraud detection using hidden Markov model. IEEE Trans. Depend. Secure Comput. 5(1), 37–48 (2008) Dal Pozzolo, A., Boracchi, G., Caelen, O., Alippi, C., Bontempi, G.: Credit card fraud detection and concept-drift adaptation with delayed supervised information. In: 2015 International Joint Conference on Neural Networks (IJCNN), pp. 1–8. IEEE (2015) Carta, S., Fenu, G., Recupero, D.R., Saia, R.: Fraud detection for e-commerce transactions by employing a prudential multiple consensus model. J. Inf. Secur. Appl. 46, 13–22 (2019) Wang, H., Fan, W., Yu, P.S., Han, J.: Mining concept-drifting data streams using ensemble classifiers. In: Proceedings of the Ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. pp. 226–235 (2003) Chawla, N.V., Bowyer, K.W., Hall, L.O., Kegelmeyer, W.P.: Smote: synthetic minority oversampling technique. J. Artif. Intell. Res. 16, 321–357 (2002) Refaeilzadeh, P., Tang, L., Liu, H.: Cross-validation. In: Encyclopedia of Database Systems, pp. 532–538. Springer, Boston, MA (2009) Tolles, J., Meurer, W.J.: Logistic regression: relating patient characteristics to outcomes. JAMA 316(5), 533–534 (2016) Breiman, L.: Random forests. Mach. Learn. 45(1), 5–32 (2001) Tax, D., Duin, R.: Feature scaling in support vector data descriptions. Learning from Imbalanced Datasets, pp. 25–30 (2000)

Chapter 44

Effect of Feature Selection on Software Fault Prediction Vinod Kumar Kulamala , Priyanka Das Sharma, Preetipunya Rout, Vanita, Madhuri Rao, and Durga Prasad Mohapatra

Abstract Software is composed of several modules. Software reliability, quality, and maintenance cost are drastically affected by the existence of software glitches. Achieving bug-free software is hard as most of the time there are hidden defects. Several techniques have been introduced to overcome the software defect problem. Predicting the rate of faultiness in software modules before commercializing is still a challenge to software engineers. The main objective of the software fault prediction module is not only to detect the faultiness of the software but also to identify the components which are likely to be defective. In this study, we aim to apply and compare different machine learning algorithms such as Support Vector Machine and Random Forest to develop software fault, prediction models. The developed models are applied on fault datasets collected from different open source repositories such as the PROMISE repository. We analyze the results of the developed models using different performance measures such as Accuracy, AUC, and F1-Score.

44.1 Introduction The existence of faults in software has become unavoidable. The existing models of fault prediction are not optimized for commercial use. The loss incurred on the commercial level due to the presence of bugs in software are phenomenal. Several prediction approaches are being employed in the software engineering discipline like correction cost, prediction test, quality prediction, security prediction, reusability V. K. Kulamala (B) · P. D. Sharma · P. Rout · Vanita · M. Rao Department of Computer Science and Engineering, Siksha ‘O’ Anusandhan Deemed to be University, Bhubaneswar, Odisha, India e-mail: [email protected] M. Rao e-mail: [email protected] V. K. Kulamala · D. P. Mohapatra Department of Computer Science and Engineering, National Institute of Technology, Rourkela, Odisha, India e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2022 D. Mishra et al. (eds.), Intelligent and Cloud Computing, Smart Innovation, Systems and Technologies 286, https://doi.org/10.1007/978-981-16-9873-6_44

481

482

V. K. Kulamala et al.

Data Set Collection

Feature Selection

Training and Testing using ML Techniques

Measure Performance

Fig. 44.1 Steps in software fault prediction

prediction, fault prediction, and effort prediction. Software metrics and fault data are used to predict the faulty modules [1, 2]. Therefore, the software fault prediction model plays a vital role in understanding, evaluating, and improving a software system’s quality. The simple steps followed in software fault prediction using Machine Learning (ML) are depicted in Fig. 44.1. The task of software fault prediction is concerned with predicting which software components are likely to be defective, helping to increase testing cost-effectiveness. Though there is much research related to software fault prediction using various classifiers, there is no single model that predicts the faulty components/modules for all types of software. Hence we are motivated to develop fault prediction models to identify faulty modules of a software system. Various objectives that are needed to be fulfilled to solve the problem at hand are listed as below: • To study various techniques and tools available for finding which software modules are frequently faulty. • To find out the error rate using the predicted and actual values using the fault prediction techniques. • To develop a more efficient machine learning algorithm that could boost the process of finding the faults more efficiently using different datasets. The rest of the paper is organized as: Sect. 44.2 contains a literature survey of research results that have been used in the field of software fault prediction. Section 44.3 discusses the methodology used to develop software fault prediction models and details the software fault dataset. Section 44.4 presents the results of experiments performed with some discussion on results. In Sect. 44.5, we conclude this paper and describes the future scope for this work.

44.2 Literature Survey There are many studies related to software fault prediction using various techniques. Elish and Elish [3], used NASA datasets to perform experiments using SVM for software fault prediction. They compared their results with KNN, Random Forest, Naïve Bayes, and Linear Regression. With results they concluded that, the performance of SVM is better and there is no risk of defective modules go undetected.

44 Effect of Feature Selection on Software Fault Prediction

483

Rathore and Kumar [4], used PROMISE repository datasets in developing different ensemble techniques for software fault prediction. They found that these techniques are effective to build the prediction model for software faults. Son et al. [1] performed a systematic mapping on Defect Prediction in Software (DeP). They have applied this technique on public domain datasets like NASA, Eclipse, and Apache. They have applied the learning methods such as Decision Tree, Regression, Discriminant analysis, Naïve Bayes, and Neural Networks. They used various performance evaluation methods such as Accuracy, Sensitivity, Specificity, and Area under the ROC Curve. They identified the metrics that were found to be significant predictors of software defects. The most significant predictor of defects is found to be LOC, a total of 26 studies have reported LOC as a significant predictor [2]. The most frequently used technique for feature selection is correlation-based feature selection (CFS) and the most used technique is Principal Component Analysis (PCA). The findings of this research are useful not only to the software engineering domain but also to the empirical studies. Many machine learning methods have been examined for building software fault prediction models [5–7]. When taking into account the overall approach and effectiveness of DeP studies, there are a lot of shortcomings [8]. Overall, the literature available for software defect prediction has high-quality work, but it lacks in the overall methodology that is used to construct defect prediction models [9].

44.3 Methodology In this section, we explain the method followed to develop software fault prediction models using ML techniques. Figure 44.2 presents the proposed architecture of the software fault prediction. First step is the selection of the data repository which we have taken from the Promise repository dataset. The datasets are extracted for the fault prediction. The extracted datasets are scaled so that the values are between 0 and 1. The extracted

Fig. 44.2 Architecture of proposed approach

484

V. K. Kulamala et al.

Table 44.1 Datasets used Data set

Total number of modules

Total number of faulty modules

Total number of non-faulty modules

% of faulty modules

PC1

1109

77

1032

6.94

93.05

PC2

5589

23

5566

0.41

99.58

PC3

1563

160

1403

10.23

89.76

PC4

1458

178

1280

12.2

87.79

MC1

9496

68

9398

0.71

% of non-faulty modules

98.96

and scaled dataset and software metrics are combined to train the software fault prediction model. Feature selection is used to train the dataset faster and to improve accuracy. Support vector machine (SVM) and Random Forest (RF) classifiers are used to build the model. The predicted value of the faults and the actual value of faults are compared to calculate the performance of the model.

44.3.1 Datasets The software fault prediction model uses the PROMISE data repository [10] dataset which is available publicly for the predictive software models (PSMs). We have taken five different datasets for building and evaluating the prediction models, namely PC1, PC2, PC3, PC4, and MC1. Table 44.1 shows the details about the datasets considered.

44.3.2 Classification Techniques Random Forest constructs several decision trees on random subsets and merges them together to get a stable prediction and to prevent overfitting. It is a fast simple and flexible tool and is considered a highly accurate and vigorous method. SVM constructs a decision plane known as a hyperplane in multi-dimensional space that separates into different classes. The kernel functions in SVM are used to address the original dataset into a higher dimensional space. We have used the Radial Basis Function in our SVM model. For feature selection, Pearson correlation is used. We have used Python framework, with Spyder Interface to implement the classification models (SVM and RF). We have used a laptop computer with Intel Core i5 processor with 8 GB RAM and 250 GB HDD. RBF kernel of the SVM classifier can be described with Eq. (44.1). 2 K x, x = e−γ ||x−x ||

(44.1)

44 Effect of Feature Selection on Software Fault Prediction

485

where γ is a scalar and has to be >0. In Random Forest classifier the important parameters are the number of random features to sample at each split point (max_features), and the number of trees (n_estimators). To implement the SVM with RBF kernel, and Random Forest classifiers, we have used the default parameter setting in Python’s Scikit-learn library. Pearson Correlation is a filtering method. Correlation is the term given to the statistical value which defines how closely a variable is linearly related to another. A high correlation value denotes that the two features are more linearly dependent and affect the dependent variable equally. Hence one of the two variables can be dropped. Another approach can be explained such that, the features having a high correlation value (determined by a threshold) with the target variable can be taken into consideration. Mathematical notation for the correlation coefficient is given by Eq. 44.2.

(xi − x)(yi − y) 2 2 i (x i − x) i (yi − y)

r =

i

(44.2)

where r is correlation coefficient, x i is values of the x-variable in a sample, x is mean of the values of the x-variable, yi is values of the y-variable in a sample, y is mean of the values of the y-variable.

44.3.3 Evaluation Criteria Confusion matrix is a tabular approach to analyze how the prediction model performs. It is multi-class classification or binary classification problem that helps to determine mistake patterns. It visualizes the accuracy of a classifier by comparing the true and predicted classes. Off-diagonal squares are incorrect predictions. Generally, it contains true positive, true negative, false positive, and false negative values. To evaluate the performance of the machine learning techniques different performance measures namely, accuracy, AUC, and F1-Score have been used which are based on a confusion matrix. Sample confusion matrix is shown in Table 44.2. Where TN is True Negative, FP is False Positive, FN is False Negative, TP is True Positive. The confusion matrix is used to calculate performance measures such as accuracy, precision, recall, F1-Score. Accuracy is defined as the ratio of the number of modules correctly predicted to the total number of modules. Accuracy is also Table 44.2 Confusion matrix

Predicted class Observed class

No

Yes

No

TN

FP

Yes

FN

TP

486

V. K. Kulamala et al.

known as the correct classification rate. It is calculated as follows using Eq. 44.3. Accuracy = (TP + TN)/(TP + TN + FP + FN)

(44.3)

F1-Score is the weighted harmonic mean of precision and recall. Values range from 0 (bad) to 1(good), the classifier will get a high F1-Score if both recall and precision are high. It is calculated as follows using Eq. 44.4. F1 -Score = (2 ∗ Precision ∗ Recall)/(Precision + Recall)

(44.4)

where Precision = TP/(TP + FP) and Recall = TP/(TP + FN). Receiver Operating Characteristic (ROC) Curve is the probability curve and Area Under the Curve (AUC) represents the measure or degree of reparability. 1 is considered to be an excellent model. The higher the AUC, the better the performance of the model at distinguishing between the positive and negative classes.

44.4 Results and Discussions We have developed fault prediction models using SVM and Random Forest classifiers. First The models are developed without feature selection, then we have developed with feature selection. The developed models are applied on five different datasets. In this section, we present and discuss the different results obtained by proposed fault prediction models. First, we present the results of Pearson correlation to select the best suitable metrics. Then we present the performance of the software fault prediction models developed using SVM and RF. Then we analyze the results with feature selection and without feature selection. Finally, we compare the classification techniques SVM and RF with the help of three different performance measures such as accuracy, F1-Score, and AUC. The correlation between different features is represented using a correlation heat map as shown in Fig. 44.3. We compared the correlation between features and removed the features that have a correlation higher than 0.5. In a similar fashion, we have selected the following features such as: cyclomatic density, design density, essential density, parameter count, Halstead level, normalized cyclomatic complexity, percent comments, etc. Tables 44.2 and 44.3 represent prediction performance measures of Random Forest (RF) classifier. Tables 44.4 and 44.5 represent prediction performance measures of Support vector machines (SVM) classifier. From Tables 44.3 and 44.4 we can observe that, from accuracy measure, for PC1 dataset, the accuracy of PC1 dataset without feature selection is slightly better than with feature selection. The accuracy of PC4 and MC1 datasets without feature selection is slightly lesser than with feature selection. We can observe similar performance for other measures (F1-Score and AUC) also. From Tables 44.5 and 44.6 we can observe that, from accuracy measure, for

44 Effect of Feature Selection on Software Fault Prediction

487

Fig. 44.3 Correlation heat map Table 44.3 Performance of RF without feature selection

Table 44.4 Performance of RF with feature selection

Dataset

Accuracy (%)

F1-score

AUC

PC1

99.54

0.97

0.97

PC2

99.59

0.31

0.59

PC3

100

1.0

1.0

PC4

99.31

0.97

0.97

MC1

99.73

0.84

0.89

Dataset

Accuracy (%)

F1-score

AUC

PC1

99.39

0.95

0.95

PC2

99.58

0.46

0.65

PC3

100

1.0

1.0

PC4

99.54

0.98

0.97

MC1

99.96

0.98

0.98

488 Table 44.5 Performance of SVM without feature selection

Table 44.6 Performance of SVM with feature selection

V. K. Kulamala et al. Dataset

Accuracy (%)

F1-score

AUC

PC1

93.09

0.9

0.5

PC2

99.4

0.99

0.5

PC3

88.91

0.84

0.54

PC4

84.47

0.77

0.52

MC1

99.4

0.99

0.61

Dataset

Accuracy (%)

F1-score

AUC

PC1

93.03

0.9

0.55

PC2

99.04

0.99

0.7

PC3

90.61

0.86

0.52

PC4

88.58

0.83

0.5

MC1

99.33

0.99

0.5

PC1 dataset, the accuracy of PC1 dataset without feature selection is slightly better than with feature selection. The accuracy of PC4 and MC1 datasets without feature selection is slightly lesser than with feature selection. We can observe similar performance for other measures (F1-Score and AUC) also. Hence, From the results of Tables 44.3, 44.4, 44.5, and 44.6, we can conclude that fault prediction models developed with feature selection performs better than or similar to the models developed without feature selection. So feature selection is useful, because with lesser number of features also we can get good performance for developing fault prediction models. Now, let’s compare the Tables 44.4 and 44.6, to know which classifier performs better out of SVM and RF. From Tables 44.4 and 44.6 we can observe that, for accuracy measure, for accuracy measure, RF classifier performs better than SVM for all datasets. Similar performance is achieved for other performance measures (F1-Score, AUC) also. Hence, From Tables 44.4 and 44.6 we can conclude that RF classifier performs better than SVM classifier to classify the modules of a software product as faulty or non-faulty.

44.5 Conclusion and Future Work Our model improves software quality and testing efficiency by early identification of faults so that the software practitioners can detect bugs in early-stage and can solve the problem by not wasting much time. The datasets contain the software modules and their features that stated which of the software modules were faulty and which were not. Software fault Prediction is a classification machine learning problem. The model is evaluated based on performance metrics such as accuracy, F1-Score and

44 Effect of Feature Selection on Software Fault Prediction

489

AUC. To enhance efficiency, feature selection is applied to every dataset. To obtain more accurate results we also applied, k-fold cross-validation while developing the fault prediction models. In this paper, two classification algorithms have been implemented. However, in the future, new and different machine learning models can be tried in the environment. We will try to use more different datasets and some other techniques and methods to simplify and generalize our findings.

References 1. Son, L.H., Pritam, N., Khari, M., Kumar, R., Phuong, P.T.M., Thong, P.H.: Empirical study of software defect prediction: a systematic mapping. Symmetry 11, 21 (2019) 2. Hong, E.: Software fault-proneness prediction using module severity metrics. Int. J. Appl. Eng. Res. 12(9) (2017). ISSN: 0973-4562 3. Elish, K.O., Elish, M.O.: Predicting defect- prone software modules using Support Vector Machines. J. Syst. Softw. 81(5), 649–660 (2008) 4. Rathore, S.S., Kumar, S.: An empirical study of ensemble techniques for software fault prediction. Appl. Intell. 51(6), 3615–3644 (2021) 5. Kalaivani, N., Beena, R.: Overview of software defect prediction using machine learning algorithms. Int. J. Pure Appl. Math. 118(20), 3863–3873 (2018) 6. Yucalar, F., et al.: Multiple-classifiers in software quality engineering: combining predictors to improve software fault prediction ability. Eng. Sci. Technol. Int. J. 23(4), 938–950 (2020) 7. Tran, H.D., Thi My Hanh, L.E., Binh, N.T.: Combining feature selection, feature learning and ensemble learning for software fault prediction. In: 2019 11th International Conference on Knowledge and Systems Engineering (KSE). IEEE (2019) 8. Singh, A., Bhatia, R., Singhrova, A.: Taxonomy of machine learning algorithms in software fault prediction using object oriented metrics. Procedia Comput. Sci. 132, 993–1001 (2018) 9. Kalsoom, A., et al.: A dimensionality reduction-based efficient software fault prediction using Fisher linear discriminant analysis (FLDA). J. Supercomput. 74(9), 4568–4602 (2018) 10. http://promise.site.uottawa.ca/SERepository/datasets-page.html

Chapter 45

Deep Learning-Based Cell Outage Detection in Next Generation Networks Madiha Jamil, Batool Hassan, Syed Shayaan Ahmed, Mukesh Kumar Maheshwari, and Bharat Jyoti Ranjan Sahu

Abstract 5G and beyond wireless networks will support high data rate, seem-less connectivity and a massive number of users as compared to 4G network. It is also expected that the end-to-end latency in transferring data will also reduce significantly, i.e., 5G will support ultra-low latency services. To provide the users with all these advantages, 5G utilizes the Ultra-Dense Networks (UDN) technique. UDN helps manage the explosive traffic data of users as multiple small cells are deployed in both indoor and outdoor areas, for seamless coverage. However, outage is difficult to detect in these small cells as these small cells have high density of users. To overcome this hindrance, Cell Outage Detection (COD) technique is utilized which aims to detect outage autonomously. This reduces maintenance cost and outages can be detected beforehand. In this paper, Long Short Term Memory (LSTM) is used for outage detection. The LSTM network is trained and tested on subscriber activities values which include SMS, Call and Internet activity. Our proposed LSTM model has classification accuracy of 85% and a FPR of 15.7303%.

45.1 Introduction 5G network is expected to provide higher data rate, provide ultra-low latency and massive number of connecting users. It is expected that billions of devices will be connected to the network because of the 5G IoT ecosystem provided by the technology [1]. Ultra-Dense Networks (UDN) is one of the emerging techniques of 5G that will help manage the explosive traffic of data. In UDN, multiple small cells, i.e., femto, pico and microcells are combined to compensate large number of users. Small cells are deployed in both indoor and outdoor areas to provide seamless

M. Jamil · B. Hassan · S. S. Ahmed · M. K. Maheshwari (B) Bahria University, Karachi Campus, Pakistan e-mail: [email protected] B. J. R. Sahu Siksha O Anusandhan University, Bhubaneswar, India © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2022 D. Mishra et al. (eds.), Intelligent and Cloud Computing, Smart Innovation, Systems and Technologies 286, https://doi.org/10.1007/978-981-16-9873-6_45

491

492

M. Jamil et al.

coverage. These small cells encounter heavy traffic as connectivity has increased which enables thousands of users to be connected in a single cell [2]. Cell outage is encountered by users when they are unable to access the network. The probability of outage has increased in 5G network because of the immense traffic of users in a single cell and other reasons of outage can be rain, snowfall, other environmental factors and software or hardware failure [3]. Cell Outage Detection (COD) is a method of detecting outage cells among healthy cells. The existing method of COD is through human monitoring [4]. It will be difficult to detect outage through this method because of small cells deployment and an increase in users. The new features of 5G come with a great deal of complexities and challenges. UDN allows large number of users to be accommodated in a single cell but also increases the probability of outage. 5G is a hybrid and integrated technology that combines existing technologies. The increasing number of base stations causes handover issues. Suppose number of base stations are available during handover, the network has to decide which base station the call has to be transferred. If calls of number of users are transferred to the same base station, the traffic data for the particular base station will increase and this results in outage [5]. Figure 45.1 shows a case of outage in a small cell. The base station in outage cell is not able to receive network services from the core network. The traditional way of COD is through visits to base stations, analyzing users’ data and statistics and manual drive tests. These methods are costly and time-consuming. The existing COD methods can impose difficulties for operators to detect outage in 5G. The densely situated small cells, i.e., femto, pico and microcells make it even harder for outage to be detected. Due to the complexity of these small cells, outage

Fig. 45.1 Cell outage in 5G

45 Deep Learning-Based Cell Outage Detection …

493

may not be detected for hours or even days [4]. This is the reason researchers are now working on deep learning algorithms that detect outage autonomously. In previous research, KPI parameters were used as input in [4] to train LSTM network for COD in multi-tiered networks. The authors used LSTM technique for COD and the algorithm had 77% classification accuracy. KPIs information was used as an input in RNN deep learning model for small cell outage prediction [6]. The model yielded 96% prediction accuracy. References [7, 8] utilized Autoencoder technique for outage detection. In [9], sleeping cells were detected in next generation networks and Deep Autoencoder and One-Class SVM results were compared. Reference [10] used Deep Convolutional Autoencoder to detect outage and obtained better accuracy than Deep Autoencoder. Hidden Markov Model was used in [11] for outage detection in UDN. The model predicted a cell as outage with 95% accuracy. K-Nearest Neighbor (KNN) classification algorithm was implemented in [12] for outage detection. Minimization of Drive Tests (MDT) measurements cannot detect outage in small cells. Reference [13] proposes a M-LOF algorithm that overcame this problem by using handover statistics for outage detection.

45.1.1 Motivations Previous studies have mainly focused on working with KPI parameters for outage detection. The KPI parameters are not enough to accurately detect outage in small cells. Most of the studies also focused on detecting outage from readings of neighboring cells. We were motivated to design an algorithm that uses real-time readings of users’ activities, i.e., Call Detail Record (CDR) data of different cells. Algorithm for Data Preprocessing Inputs: CDR dataset: Data of 1000 Cell IDs of a single day Output: Xtrain, Xnew Method: 12. Import a file from CDR dataset and store randomly selected 1000 Cell IDs data into a matrix. 2. Separate first 700 Cell IDs for training of LSTM . Remaining 300 Cell IDs will be used for testing the trained model. Remove the column containing country code and time stamp. For each cell ID. Sum each subscriber activity and store it into a matrix. End Store each matrix value as one example in Xtrain. Repeat the same procedure for test data and store its value. in Xnew.

494

M. Jamil et al.

45.1.2 Contributions This research paper utilizes the deep learning technique LSTM for outage detection. The neural network was trained and tested on a dataset that contained subscriber activities of different cells. The subscriber activities include SMS in, SMS out, Call in, Call out and Internet. The rest of the paper is structured as follows: In Sect. 45.2, we briefly discuss LSTM and its functionality. In Sect. 45.3, the preprocessing of dataset and simulation results are discussed. Finally, the paper is concluded in Sect. 45.4.

45.2 Our Proposal In this paper, we have used the deep learning technique Long Short Term Network (LSTM). The LSTM network has been previously utilized for outage detection in [4]. LSTM is part of the Recurrent Neural Network (RNN). RNNs differ from other neural networks as they have the ability to remember their previous inputs. They use memory of previous input values to make decisions regarding the output value and current state. This makes RNN a useful tool in deep learning for time series data classification purposes [14]. We designed an algorithm that trains the neural network on subscriber activities of users, i.e., CDR data of different cells [15]. The reason of using CDR data is because as [13] suggested that MDT measurements are not enough to detect outages in small cells. The UDN is densely populated with small cells and each cell has traffic of thousands of users. With an algorithm that is trained on CDR data it would be easier to detect anomaly in subscriber activities. LSTM is specially used for time series data that requires memorization of long term data. CDR data consists data of thousands of Cell IDs so it will be easier to train the dataset with LSTM since it can memorize the data pattern. The architecture of LSTM consists of gates, i.e., the sigmoid layer and point wise multiplication operator, cell state, input gate, forget gate and output gate. The cell state represents memory of LSTM. The cell state changes when a memory is removed from the network through forget gate or when a new memory is added through input gate. The input gate decides which information is to be added to the memory. The forget gate discards the information that is not useful to the network. The output gate produces an output from the memory [16]. LSTM network is trained in a supervised manner, i.e., the inputs should be labeled, and the output should be known. This is known a supervised learning. The dataset that we are utilizing for training of LSTM network consists of CDR data which includes SMS, Calls, and Internet activity. The LSTM network is trained in a manner where it remembers the pattern of user activities. The network can detect any anomaly that occurs in the values of user activities. For, e.g., for a given cell if there is a spike in SMS activity than normal then the neural network will detect it. The network is trained in a similar manner for all the subscriber activities, hence any anomaly in

45 Deep Learning-Based Cell Outage Detection …

495

the CDR data can be detected by the network. CDR data contains data of thousands of users, LSTM network is a useful deep learning technique to detect anomalies in the data. LSTM neural network as mentioned above is a powerful for time series classification. The CDR data contains data of Cell IDs at different time intervals. The LSTM deep learning network can be trained on this time series data and learn the pattern of the user activities. If outage is encountered by any cell, the value of CDR data will deviate from the normal values, and this will be detected by the LSTM network.

45.3 Performance Analysis The raw CDR dataset for training and testing of neural networks was downloaded from [15]. Table 45.1 shows a sample of the CDR dataset. The dataset contained data of subscriber activities of different Cell IDs of 62 days. Each day consisted data of 10000 Cell IDs. The dataset was in raw form so before utilizing it for training of neural networks, we preprocessed it. The preprocessing technique is similar to work [17]. The preprocessing algorithm is summarized in Algorithm 1 [17]. In the preprocessing algorithm, the raw CDR data was imported and the data of 1000 Cell IDs of one day was extracted. The country code column was removed to ensure security. The data of each subscriber activity was summed and stored in matrices Xtrain and Xnew. The data was divided into training and testing with a 70 and 30% ratio, i.e., 700 Cell IDs for training and 300 Cell IDs for testing. As mentioned before, LSTM is based on supervised learning. It is necessary for the dataset to be labeled for training and testing. Supervised learning requires labeled input data, and the output class should be known. Labels were created by manipulating the equation used in [17] since the data in CDR dataset was not labeled. The labels were created for Xtrain and Xnew matrices. kμ − σ k > kμx (i) − σx (i)k

(45.1)

In Eq. (45.1), σ represents standard deviation and μ represents mean. Example x(i) is a single vector value of Xtrain and Xnew. The corresponding output label y(i) is created for YTrain and Ynew, respectively. Where i represents the ith row of Table 45.1 Hyperparameters values utilized for ADAM

Hyperparameters No. of iteration/epochs

1000

Learning rate

0.001

Mini-batch size

5

Gradient threshold

1

No. of hidden units

100

496

M. Jamil et al.

the matrices. The label is marked outage if the norm of x(i) deviates more than the standard deviation (σ ∈ R5 ) from mean (μ ∈ R5 ) otherwise healthy. The training hyperparameters of the LSTM network are given in Table 45.2. Since the network is trained on the subscriber activities of users, the input size is set to 5. The neural network has a single LSTM layer with 100 hidden units. The minibatch size is set to 5 with a learning rate of 0.001. ADAM optimizer was used for better training results. Figure 45.2 demonstrates the training accuracy of the neural network and Fig. 45.3 demonstrates the training loss of the network. It can be seen that with every epoch the training accuracy increases, and the training loss decreases. The model was tested and yielded an accuracy of 85% and an FPR of 15.7303%. To check the robustness and accuracy of our model we conducted random tests on 1000 Cell IDs of other days. Table 45.3 shows testing comparison of our model and random test conducted on 1000 Cell IDs of Day 2. It is noticed from Table 45.3 that the random test yielded an accuracy of 88.8% with an FPR of 12.9159%. The error rate has also decreased from 15 to 11.2%. Figure 45.4 demonstrates the overall performance of the LSTM network. The precision and recall rate show the accuracy of the testing model. The low error rate demonstrates how the network detects cells as outage or healthy with low chance of false alarm. The confusion matrix shown in Fig. 45.5 displays the performance of our testing model. Our model has an overall prediction accuracy of 85%. The tested data has 50% of healthy cells and 35% of outage cells. The RMSE plot is depicted in Fig. 45.6.

45.4 Conclusion In this research paper, we designed a deep learning algorithm using LSTM that classifies cells as healthy or outage. The model is trained and tested on subscriber activities values, i.e., SMS in, SMS out, Call in, Call out and Internet activity. Previously, research studies have used KPI parameters for training of neural networks which is a drawback as outage in small cells cannot be detected through KPIs. The LSTM network is trained in such a manner that it can detect any anomaly in the values of subscriber activities. The model has a testing accuracy of 85% and FPR of 15.7303%. We also conducted random tests on Cell IDs of day 2 which increased our testing accuracy from 85 to 88.8%. The model can be utilized in 5G networks and next generation networks as it can predict outage in densely populated small cells. The classification accuracy can be further increased by working on larger datasets and considering subscriber activities of more users.

1385858400000

1385859000000

1385859600000

1385860200000

10000

10000

10000

10000

39

39

39

39

Country code

0.004774318

0.054231319

0.085994773

0.085994773

SMS in

0.171989546

0.312215638

0.054231319

0.280452185

SMS out

Subscriber activities

0.004774318

0.213019632

0.208245313

0.054231319

Call in

0.085994773

0.085994773

0.434466179

0.162693958

Call out

13.39198912

9.873021891

18.11289499

17.87857366

Internet

Every entry represent beginning of 10-minute interval in Unix epoch. For example, 1385858400000 represent Sunday, December 1, 2013 12:40:00 AM (GMT) [15]

a

Time stampa (ms)

Cell ID

Table 45.2 Sample of CDR dataset from 1st December 2013

45 Deep Learning-Based Cell Outage Detection … 497

498

M. Jamil et al.

Fig. 45.2 Mini-batch loss

Fig. 45.3 Mini-batch accuracy

Table 45.3 Performance statistics of our COD model

Metric

Model testing (%)

Random testa (%)

Accuracy

85

88.8000

Error rate

15

11.2000

Precision

78.9474

87.0334

Recall

89.0656

90.5930

FPR

15.7303

12.9159

a

1000 cell IDs of day 2 were utilized for the test

45 Deep Learning-Based Cell Outage Detection … Fig. 45.4 Performance metric

Fig. 45.5 Confusion matrix

Fig. 45.6 Root mean square error

499

500

M. Jamil et al.

References 1. Agiwal, M., et al.: Next generation 5G wireless networks: a comprehensive survey. IEEE Commun. Surv. Tutor. 18(3), 1617–1655 (2016) 2. 5G Ultra Dense Networks (5G-UDN) [Online]. Available at: https://icc2018.ieee-icc.org/ 3. The Top 5 Causes of Network Outages [Online]. Available at: https://segron.com/ 4. O˘guz, H.T., Kalaycio˘glu, A., Akbulut, A.: Femtocell outage detection in multi-tiered networks using LSTM. In: 2019 11th International Conference on Electronics, Computers and Artificial Intelligence (ECAI), Pitesti, Romania, 2019, pp. 1–5 5. Handover Problems and Solutions for 5G Communication Technology [Online]. Available at: https://www.ukessays.com/ 6. Ming, Y.W., Lin, Y.H., Tseng, T.H., Hsu, C.M.: A small cell outage prediction method based on RNN model. J. Comput. 30(5), 268–278 (2019) 7. Lin, P.C.: Large-scale and high-dimensional cell outage detection in 5G self-organizing networks. In: 2019 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC), pp. 8–12. IEEE (2019) 8. Asghar, M.Z., Mudassar, A., Khaula, Z., Pyry, K., Timo, H.: Assessment of deep learning methodology for self-organizing 5G networks. Appl. Sci. 9(15), 2975 (2019) 9. Masood, U., Asghar, A., Imran, A., Mian, A.N.: Deep learning based detection of sleeping cells in next generation cellular networks. In: 2018 IEEE Global Communications Conference (GLOBECOM), Abu Dhabi, United Arab Emirates, 2018, pp. 206–212 10. Ping, Y.H., Lin P.C.: Cell outage detection using deep convolutional autoencoder in mobile communication networks. In: 2020 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC), pp. 1557–1560 (2020) 11. Alias, M., Saxena, N., Roy, A.: Efficient cell outage detection in 5G HetNets using hidden Markov model. IEEE Commun. Lett. 20(3), 62–565 (2016) 12. Xue, W.W., Peng, M., Ma, Y., Zhang, H.: Classification-based approach for cell outage detection in self-healing heterogeneous networks. In: 2014 IEEE Wireless Communications and Networking Conference (WCNC), Istanbul, Turkey, 2014. 28222826 13. Zhang, T., Lei, F., Peng, Y., Shaoyong, G., Wenjing, L., Xuesong, Q.: A handover statistics based approach for cell outage detection in self-organized heterogeneous networks. In: 2017 IFIP/IEEE Symposium on Integrated Network and Service Management (IM), pp. 628–631 (2017) 14. Understanding LSTM Networks [Online]. Available at: https://colah.github.io/ 15. Telecommunications—SMS, Call, Internet—MI [Online]. Available at: https://dataverse.har vard.edu/ 16. Basic Understanding of LSTM [Online]. Available at: https://blog.goodaudience.com/ 17. Hussain, B., Du, Q., Zhang, S., Imran, A., Imran, M.A.: Mobile edge computing-based datadriven deep learning framework for anomaly detection. IEEE Access 7, 137656–137667 (2019)

Chapter 46

Image Processing and ArcSoft Based Data Acquisition and Extraction System Yanglu, Asif Khan, Amit Yadav, Qinchao, and Digita Shrestha

Abstract Characteristic data acquisition and extraction are one of the core realizations of artificial intelligence application development, which is completed through the data output by traditional hardware equipment as the basis, coupled with efficient artificial intelligence applications, and then provide accurate and reliable data support for the data analysis stage. Focusing on image filtering and recognition technology, and framework development applying to portable monitoring systems, here is a realization of data acquisition and extraction system. OpenCV graphics library is used to complete the data acquisition of monitoring terminal and a bilateral Gaussian filter algorithm with configurable parameters adapted to the Raspiberry hardware is designed to filter out noise and smooth the image while saving/protecting the edges. The filtering output as the image sample data basis, a rapid development framework (V1.0) with the ArcSoft artificial intelligence engine as the core is realized to filter out redundant data and extract facial feature data on demand (what users care about) with the conversion and storage in proper format, and such extraction tool is suitable for face data processing.

46.1 Introduction Data is the cornerstone of the development of computer science and the key to the informatization of traditional industries. Using computer technology, according to the characteristics of different industries, an intelligent system developed that meets the needs of the industry is developed will effectively improve the efficiency of data collection in traditional industries, accurately extract reliable, effective, and practical Yanglu · Qinchao Department of Computer Science and Engineering, Chengdu Neusoft University, Chengdu, China A. Khan Department of Computer Application, Integral University, Lucknow, India University of Electronic Science and Technology of China, Chengdu, China A. Yadav (B) · D. Shrestha College of Engineering IT and Environment, Charles Darwin University, Casuarina, NT, Australia © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2022 D. Mishra et al. (eds.), Intelligent and Cloud Computing, Smart Innovation, Systems and Technologies 286, https://doi.org/10.1007/978-981-16-9873-6_46

501

502

Yanglu et al.

data, and complete the computing tasks beyond human capacity. In the process of traditional industry informatization, how to collect data efficiently and extract useful data that the industry cares about from large amount is the top priority of this task [1, 2]. The first goal of the system in this paper is how to solve the noise problem in the image without improving the hardware performance during the data collection process of the monitoring terminal. The second is how to accurately extract facial feature data from a large amount of monitoring data and convert it into an appropriate form. The monitoring terminal is based on the video monitoring terminal equipment, on which a simple video acquisition function is implemented that periodically collects video data samples, some frames of which are the sample data for image processing whose first step is filter processing. And the Raspberry Pi 3B+ as the main control board, we need to complete the implementation and optimization of the filter from 0 to 1 to ensure the function of “edge preservation and noise reduction” on lowperformance devices. Then amount of data will be processed by the extraction tool that is based on self-developed framework v1.0 to filter out redundant data and convert and store what user cares about in a specific format [3–10].

46.2 System Analysis 46.2.1 System Function Structure The system can be defined as two parts of data collection and data extraction. Data collection is the realization of the monitoring terminal part, and data extraction is the custom framework based on which data extraction tool is implemented and then the identification software for interaction. The functional architecture diagram of the system is shown in Fig. 46.1. Monitoring Terminal. Video recording and storage, regular collection of sample images, filter processing of sample images, etc. Customized Framework v1.0. Based on the ArcSoft artificial intelligence engine, the framework is only developed for the processing of human facial data. Data Extraction Tool. For data extraction operations on sample images, including single person (or multiples) feature detection, feature data separation, number detection, facial data recognition, age recognition, gender recognition, binary data extraction, conversion, storage, etc.

46 Image Processing and ArcSoft Based Data Acquisition …

503

Data Extraction Tool ArcSoft AI Engine Facial binary data Facial yuv image

Linux

Customized Framework v1.0

Work Space

outputSpace *.jpg

Conversion tool

Raspberry Pi

Input Space *.yuv

Camera Monitoring Terminal

Fig. 46.1 Functional architecture diagram

46.2.2 System Hardware Requirements The control board used in monitoring terminal is the Raspberry Pi 3B+ (as shown in Fig. 46.2), and the camera used is Waveshare OV5647 (as shown in Fig. 46.3). The most common types of papers accepted for publication are full papers (10–20+ pages) and short papers (6+ pages), where a page constitutes 300–400 words. We only wish to publish papers of significant scientific content. Very short papers (of fewer than four pages) may be moved to the back matter. Such papers will neither be

Fig. 46.2 Raspberry Pi 3B+

504

Yanglu et al.

Fig. 46.3 Waveshare OV5647

available for indexing nor visible as individual papers on SpringerLink. They will, however, be listed in the Table of Contents. Raspberry Pi 3B+ . After start-up of the system within 10 s, and the software system self-starting within 15 s, the data acquisition is started. It uses 64-bit quad-core ARM Cortex-A53 processor, supports WiFi, Bluetooth4.0 low energy consumption, has onboard chip antenna, comes with Raspbian system, or Linux system or Windows 10 IOT system can be implanted in. Its performance is very in line with the needs of the monitoring terminal system of this design. It is designed as a mobile low-power device, which can run independently for a long time under non-human factors and non-controllable forces, and supports hot start. The hardware configuration is shown below. • • • •

CPU: 1.4Ghz, quad-core, Broadcom BCM2837, 64-bit ARM A53 processor. Memory: 1 GB RAM. Hard Disk: Sandisk 32 GB. Interconnection: Gigabit Ethernet, onboard 802.11b/g/n/ac, 2.4/5G dual-band. WIFI, low energy Bluetooth 4.2. • Main interfaces: USB2.0 port, pin expansion GPIO, HDMI, and RCA video output, TF card port, double-row pin POE interface, DSI display port, composite video output. • OV5647: 1080px, 6 mm focal length, diagonal viewing angle of 60.6°. Waveshare OV5647. This camera is an external module designed specifically for Raspberry Pi, whose biggest advantage is compatible with any version of the Raspberry Pi series. The best pixel of sensor is 1080px, which is more than enough for monitoring functions. The key is that the static image resolution is 2592 * 1944, and the sample image in this design is used as the basis of the extraction function, and enough pixels are necessary.

46 Image Processing and ArcSoft Based Data Acquisition …

505

Fig. 46.4 System workflow

46.3 System Design 46.3.1 Process Design The main workflow of the system is shown in Fig. 46.4. The camera collects video data, and periodically samples the image to be preprocessed, which is transferred to the OS. The OS will call the software system that calls the bilateral filter to process the sample image, and then save it to the designated area. Then the format is processed manually by using a conversion tool, and the filtered image with its format converted is transferred to the data extraction tool. The right part of the figure is the extraction part whose framework v1.0 is obtained with the ArcSoft engine as the core, which is used for the development of data extraction tool. The user inputs the data to be extracted and enters the specified extraction method to complete the data extraction operation [11, 12].

46.3.2 Data Extraction Design Sample data provided by the monitoring terminal as the input, the user’s instruction input determines the appropriate data extraction method, and its function is realized by calling the face processing program in the extraction tool to realize the extraction function (as shown in Fig. 46.5), and finally, the extracted data is stored in YUV format or binary format according to the specified method [13].

506

Yanglu et al.

Fig. 46.5 Structure of data extraction system

The framework v1.0 is designed as a development library centered on the ArcSoft artificial intelligence engine. The library focuses on face data processing, with currently implemented branches including face detection, number determination, face recognition, and combined with which, users can quickly complete the development of data extraction tool because each branch is designed as a separate module. Users only need to provide I/O operations in the main function, and then reference the specified module in the code. Frame v1.0 (the development library designed) takes ArcSys as the core directory, and it is upper-level packaging designed (as shown in Fig. 46.6). The self-developed source files are all located in the directory called inc and according to the design shown in Fig. 46.7, the user only needs to include one

Fig. 46.6 Framework V1.0 design (1)

46 Image Processing and ArcSoft Based Data Acquisition …

507

Fig. 46.7 Framework V1.0 design (2)

header file to access all the APIs of the entire face processing during the secondary development [12–14].

46.4 Implementation 46.4.1 Environment Deployment We use VIM and XSell5 for the development of monitoring terminals, involving the use of OpenCV and C++ 11 standard libraries. Connect the Raspberry Pi 3B+ main control board with the PC under the same LAN through the router. In the Xshell5 tool, set the Raspberry Pi 3B+ as the designated IP and port 22 to remotely connect to the Raspbian system of the control board. Then use the VIM tool to write the operating instructions, complete the update of the Raspbian system firmware, installation of tools for building OpenCV, the image format tool, compilation and installation of OpenCV computer vision library. The Raspbian system is compatible with the mainstream C++ GUI graphics framework and supports the application development with OpenCV graphics library 3.0 Version. The framework v1.0 in the data extraction and its final tool are all developed on the Ubuntu 16.04 system (the programming and testing environment is Ubuntu), using VIM development tool, and with C++ 11 standard library in the development environment. Firstly, we need to install and use ArcSoft artificial intelligence engine. The realization of video collection and sampling both use the OpenCV processing module. The images which are collected regularly according to the time set by the user will be preprocessed by bilateral filters that output data as jpg format to ArcSoft custom framework v.1.0 as input.

508

Yanglu et al.

46.4.2 Function Realization of Monitoring Terminal The realization of video collection and sampling both use the OpenCV processing module. The images which are collected regularly according to the time set by the user will be preprocessed by bilateral filters that output data as jpg format to ArcSoft custom framework v.1.0 as input. Filtering Realization. Image filtering relies on an optimized bilateral filtering algorithm, that is, two Gaussian filters are used for combined filtering. On the basis of the original two-dimensional Gaussian, the Gaussian near the space and the Gaussian near the similarity are optimized, respectively [3, 4]. All codes are designed in C language to ensure their efficiency and stability. The first Gaussian filter is based on the principle of spatial proximity to smooth the image and remove noise points. However, the edge of the key part of the image is also one high signal, and the edge together will be processed with that filtering method alone. Therefore, a second Gaussian filter is introduced based on the principle of similarity proximity. When the similarity between the pixel and the surroundings is low (the color gap is large), it means that the pixel is on the edge, so the second Gaussian filter will affect the first one, making the first Gaussian infinitely close to failure. That is the effect of no processing for edge and smoothing for non-edge, which is the so-called “edge preservation and noise reduction” [5, 6]. The key logic code is shown below.

46 Image Processing and ArcSoft Based Data Acquisition …

509

// Calculate the space weight part (unnormalized) // The first Gaussian filter (spatial proximity) double** myBilateralFilter::get_space_Array(int size, int channels, double sigmas) // [1-1] Initialize the array {inti, j;double **spaceArray = NULL; spaceArray = new double*[size]; for (i = 0; i< size; i++) {spaceArray[i] = new double[size];} // [1-2] Gaussian distribution calculation int center_i,center_j;center_i=center_j=size/2; // [1-3] Gaussian function for(i=0;i(size / 2) - 1 &&i< (*src).rows - (size / 2) && j < (*src).cols - (size / 2)) { //[3] Find the image input point and align the input point with the kernel center. //The kernel is the central reference point, x, y represent the weight coordinates of the convolution kernel, i, j represent the image input point coordinates. //Convolution operator (f*g)(i,j) = f(i-k,j-l)g(k,l), f represents image input, g represents kernel. //Bring into thekernel reference point (f*g)(i,j) = f(i-(k-ai), j-(l-aj))g(k,l), ai, aj are the kernel reference points. //Weighted summation,the coordinates of the core start at 0,0 on the upper left. int x, y, values; double space_x_color = 0.0f; double space_x_color_sum = 0.0f; //Bilateral formula: (f1*m1+f2*m2+.. +fn*mn) / (m1+m2 ... + mn) //Calculate the denominator of the weight double sum[3] = { 0.0 ,0.0, 0.0 };for (int k = 0; k < size; k++) {for (int l = 0; l < size; l++) { x = i - k + (size / 2); //The original image x, (x,y) is the input point y = j - l + (size / 2); //The original image y, (i,j) is the current scan point values = abs((*src).at(i, j)[0] + (*src).at(i, j)[1] + (*src).at(i, j)[2] - (*src).at(x, y)[0] - (*src).at(x, y)[1] - (*src).at(x, y)[2]); space_x_color = (_colorArray[values] * _spaceArray[k][l]); space_x_color_sum = space_x_color + space_x_color_sum; //Calculate the numerator of the weight for (int c = 0; c < 3; c++) {sum[c] += ((*src).at(x, y)[c] * space_x_color);}}} for (int c = 0; c < 3; c++) {temp.at(i, j)[c] = sum[c] / space_x_color_sum;}}}} (*src) = temp.clone();return;}

Effect Analysis. The purpose of bilateral filtering is to filter out the unclear noise before image processing and retain the edge information of the image to reduce the

510

Yanglu et al.

workload of the next stage and reduce the influence of unclear data on subsequent analysis results. In order to verify that our algorithm is indeed effective, a concise effect analysis is made here. Figure 46.8 is comparison diagrams of Lana’s grayscale images before and after filtering. From an intuitive point of view, the image before filtering appears rough due to the influence of noise points, the image after filtering is clearer, and the edges of the image are rarely affected. In order to further illustrate the filtering effect, we have drawn a statistical graph of pixel values for two images (horizontal axis: pixel value, vertical axis: number of pixels). The pixel values of the image before filtering are concentrated in the range of 30– 150, the image information is mostly concentrated in this peak area, and the noise is generally 0 or more than 200. It can be seen from Fig. 46.9 that there are also a lot of noise data with a pixel value of 0 and a pixel value greater than 200. Through bilateral filtering, as shown in Fig. 46.10, the values of 0–15 and the values exceeding

(a) Before filtering Fig. 46.8 Direct effect of filtering

Fig. 46.9 Pixel distribution before filtering

(b) After filtering

46 Image Processing and ArcSoft Based Data Acquisition …

511

Fig. 46.10 Pixel distribution after filtering

200 are directly cleared, and the pixels of the entire image is also smoothed, and the edge details are retained higher.

46.4.3 Function Realization of Data Extraction With ArcSoft artificial intelligence engine as the core, the following code mainly shows a simple package API to describe how to package into a library in this design.

512

Yanglu et al. /* Face Detection * Interface description: detect one face in the picture and output the result matrix of the face * const char*pic_yuv——picture to be detected( yuv(640*480)) * struct faceRect *face—— outputparameters(test results) * struct SysEngineInfo *engineInfo——appip,sdkkey[s]( to start the engine) */ void faceDetection(const char*pic_yuv, struct faceRect *face, struct SysEngineInfo *engineInfo){ // Workspace MByte *pWorkMem = (MByte*)malloc(FD_WORKBUF_SIZE); if(pWorkMem == NULL){ fprintf(stderr, "[face_detection] fail to malloc workbuf\r\n");exit(0);} //Engine and initialization MHandlehEngine = NULL;int ret = AFD_FSDK_InitialFaceEngine(engineInfo->appid, engineInfo->fdsdkkey, pWorkMem,FD_WORKBUF_SIZE,&hEngine, AFD_FSDK_OPF_0_HIGHER_EXT, 16, 1);if (ret != 0) { fprintf(stderr, "[face_detection] fail to AFD_FSDK_InitialFaceEngine(): 0x%x\r\n", ret); free(pWorkMem);exit(0);} // Information of engine const AFD_FSDK_Version*pVersionInfo = AFD_FSDK_GetVersion(hEngine); printf("%d %d %d %d\r\n", pVersionInfo->lCodebase, pVersionInfo->lMajor, pVersionInfo->lMinor, pVersionInfo->lBuild); printf("[face_detection] [Version] %s\r\n", pVersionInfo->Version); printf("[face_detection] [BuildDate] %s\r\n", pVersionInfo->BuildDate); //Image reading ASVLOFFSCREEN inputImg = { 0 }; inputImg.u32PixelArrayFormat = FD_INPUT_IMAGE_FORMAT; inputImg.i32Width = FD_INPUT_IMAGE_WIDTH; inputImg.i32Height = FD_INPUT_IMAGE_HEIGHT;inputImg.ppu8Plane[0] = NULL; fu_ReadFile(pic_yuv, (uint8_t**)&inputImg.ppu8Plane[0], NULL);if (!inputImg.ppu8Plane[0]) { fprintf(stderr, "[face_detection] fail to fu_ReadFile(%s): %s\r\n", pic_yuv, strerror(errno)); AFD_FSDK_UninitialFaceEngine(hEngine);free(pWorkMem);exit(0);} // Only ASVL_PAF_I420 format accepted if (ASVL_PAF_I420 == inputImg.u32PixelArrayFormat) {inputImg.pi32Pitch[0] = inputImg.i32Width; inputImg.pi32Pitch[1] = inputImg.i32Width/2;inputImg.pi32Pitch[2] = inputImg.i32Width/2; inputImg.ppu8Plane[1] = inputImg.ppu8Plane[0] + inputImg.pi32Pitch[0] * inputImg.i32Height; inputImg.ppu8Plane[2] = inputImg.ppu8Plane[1] + inputImg.pi32Pitch[1] * inputImg.i32Height/2;} else { fprintf(stderr, "[face_detection] unsupported Image format: 0x%x\r\n",inputImg.u32PixelArrayFormat); free(inputImg.ppu8Plane[0]);AFD_FSDK_UninitialFaceEngine(hEngine);free(pWorkMem);exit(0);} // Start of face detection LPAFD_FSDK_FACERES faceResult; ret = AFD_FSDK_StillImageFaceDetection(hEngine, &inputImg, &faceResult);if (ret != 0) { fprintf(stderr, "[face_detection] fail to AFD_FSDK_StillImageFaceDetection(): 0x%x\r\n", ret); free(inputImg.ppu8Plane[0]);AFD_FSDK_UninitialFaceEngine(hEngine);free(pWorkMem);exit(0);} // The data filling of the only one face for (int i = 0; inFace;i++) {printf("face %d:(left=%d,top=%d,right=%d,bottom=%d)\r\n", i, faceResult->rcFace[i].left, faceResult->rcFace[i].top,faceResult->rcFace[i].right, faceResult->rcFace[i].bottom);face->left =faceResult->rcFace[i].left;face->top = faceResult>rcFace[i].top; face->right =faceResult->rcFace[i].right;face->bottom = faceResult->rcFace[i].bottom;} // Release free(inputImg.ppu8Plane[0]);AFD_FSDK_UninitialFaceEngine(hEngine);free(pWorkMem);}

It will display ArcSoft engine information when using the data extraction tool, that one is to remind users to pay attention to the version during development, and

46 Image Processing and ArcSoft Based Data Acquisition …

513

Fig. 46.11 Display of data extraction

the other is to pay tribute to ArcSoft. Then the user needs to put the sample image to be processed into a specified file, and activate the extraction tool (the activation method is the same operation as the Linux command, which is activated by calling the executable file), and the processed image is placed behind the tool, Fig. 46.11 as shown below start for face recognition mode. Finally, the matching effect is shown, and users can do batch processes all through the Shell. After all these works are done, a simple identification software is made based on the data extraction tool in this system for human-computer interaction.

46.5 Conclusion Here is a data collection and extraction system in the monitoring field. The system has formulated and implemented a bilateral filtering image processing algorithm of “edge preservation and noise reduction” for data preprocessing on Raspberry Pi hardware. And based on the artificial intelligence engine of ArcSoft Inc and excellent algorithm in the field of artificial intelligence, we independently designed a set of rapid development framework v1.0, then with this self-developed case suitable for face data processing, the data extraction tool has been completed to achieve the goal of data collection, processing, and extraction. In the follow-up work, the bilateral

514

Yanglu et al.

algorithm can be optimized, and at the same time, the standardization and engineering of the framework and tool structure can be improved. The purpose is to make the system more efficient and more in line with relevant industry standards.

References 1. Jiakun, S.: System design and implementation of face recognition based on surveillance scenarios. Beijing University of Posts and Telecommunications (2018) 2. Mengjia, X.: Research and implementation of related technology of construction site intelligent monitoring system based on image recognition. University of Electronic Science and Technology of China (2016) 3. Lei, W.: Research on key technologies in bilateral filtering. Harbin Engineering University (2017) 4. Huifen, L., Xiangqian, J., Zhu, L.: Research and improvement on the robustness of Gaussian filtering. Chin. J. Sci. Instrum. (2004) 5. Lei, Z.: Research on image denoising algorithm based on principal component analysis and bilateral filtering. Qufu Normal University (2018) 6. Yangti, H.: A texture removal algorithm for preserving special details based on bilateral filtering. Hunan University (2018) 7. Chaudhury, K.N., Dabhade, S.D.: Fast and provably accurate bilateral filtering. IEEE Trans. Image Process. 25(6), 2519–2528 (2016) 8. Ghosh, S., Chaudhury, K.N.: Fast bilateral filtering of vector-valued images. arXiv preprint arXiv:1605.02164 (2016) 9. Elad, M.: On the origin of the bilateral filter and ways to improve it. IEEE Trans. Image Process (2002). 11(10), 1141–1151 10. Weiss, B.: Fast median and bilateral filtering. ACM Trans. Graph. 25(3), 519–526 (2006) 11. Xichao, K.: Embedded video surveillance system based on face recognition. South China University of Technology (2016) 12. Tao, J.: Design of intelligent video surveillance system based on OpenCV. Jiangsu University of Science and Technology (2017) 13. Brunelli, R., Poggio, T.: Face recognition: features versus templates. IEEE Trans. Pattern Anal. Mach. Intell. 15(10), 1042–1052 (1993) 14. Qianxiang, X.: Linux C Function library reference manual. China Youth Publishing House (2002)

Part V

Intelligent Computing (eHealth)

Chapter 47

Machine Learning Model for Breast Cancer Tumor Risk Prediction Lambodar Jena, Lara Ammoun, and Bichitrananda Patra

Abstract Cancer is one of the most complex diseases and the most widespread in the world, especially lung cancer and breast cancer. Breast cancer is one of the most common diseases especially among women, which causes an increase in the death rate for them. Studies have revealed the importance of early detection of breast cancer as it increases the likelihood of cure and reduces mortality. Due to the difficulty of manually diagnosing the disease in addition to the lack of efficiency of automated systems that diagnose the disease, therefore we need to develop an automated system that facilitates the diagnostic process with high efficiency. In this paper, the risk prediction of breast cancer is carried out through the use of machine learning techniques that use the previous data to train the network and can predict the new input data through the use of classification algorithms. So that it can classify benign and malignant tumors. Results of multiple machine learning algorithms will be compared to obtain high classification accuracy.

47.1 Introduction Cancer is one of the most dangerous and complex diseases according to global statistics, breast cancer is one of the diseases that cause the most death in women (the second disease after lung cancer) [1]. It arises from abnormal growth of cells and may spread rapidly throughout the body and then it is difficult to control it. So, the best way to avoid the many deaths resulting from this disease, is early detection, as early detection leads to speeding up treatment, dispensing with many difficult surgeries, and increasing the survival rate of patients [2, 3]. While searching for the causes of breast cancer, we find that there is no clear evidence of this disease, so we can say that the true cause of breast cancer is unknown, but there are many dangerous L. Jena K L Deemed to be University (KLEF), Vijayawada, Andhra Pradesh, India L. Ammoun · B. Patra (B) Siksha ‘O’ Anusandhan Deemed to be University, Bhubaneswar, Odisha, India e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2022 D. Mishra et al. (eds.), Intelligent and Cloud Computing, Smart Innovation, Systems and Technologies 286, https://doi.org/10.1007/978-981-16-9873-6_47

517

518

L. Jena et al.

factors that may be the cause of its existence, such as genetic factors [4, 5], exposure to chemical rays, dietary fats, exposure to pesticides and solvents, as well as pressure, Psychological and stress [6], we also note that the presence of a lump in the breast may be a sign of breast cancer, but not all lumps are evidence of the presence of this disease, and there are some possibilities that may be evidence of the possibility of developing this disease, such as Lymph node changes, Nipple secretions, Nipple retraction or inverted nipple, Redness, Swelling, Skin texture changes, noting that the main cause of breast lumps are cysts and acute breast cysts. To support the idea of the necessity of early detection of this disease, many researches have been done to be able to detect breast cancer automatically, as there are many machine learning techniques [7, 8] that greatly help in predicting the breast cancer presence and classifying tumors into benign and malignant tumors, which helps in early detection of this disease and thus Reducing mortality, increasing access to treatment, getting proper health care, avoiding unnecessary surgeries, and saving costs. In this paper, a comparison will be made between the different techniques in machine learning that researchers have used with the aim of expanding our knowledge regarding classification of breast cancer by looking at different previous studies in the selection of traits, the use of different data sets, different algorithms and different tools. There is a need for early detection of the presence of cancer to increase the likelihood of obtaining an optimal treatment and thus achieving a cure through patient management according to classification cases. The prime objectives of this work are to implement machine learning classifiers for (1) Early prediction of the tumor for optimal treatment and (2) Accurate diagnosis of the disease to avoid the suffering of the patient resulting from unnecessary surgical procedures. In this work, four classifiers Naïve Bayes, Decision Tree, Linear SVM, and RBF SVM are adopted in experiment for the dataset. These algorithms are relatively suitable to train and test. Thus, use of them became suitable for prediction of breast cancer and the performance is compared based on various parameters such as accuracy, F1-score, precision, support, and recall in results with comparison to existing system.

47.2 Literature Review A deep study of numerous research papers related to the classification of cancer patients using machine learning techniques was covered, where the objectives, advantages, and limitations of several research papers were compared (shown in following Table 47.1).

The research provides a classification of several areas, first to predict the presence of cancer before diagnosis, the other field to predict the diagnosis and treatment, and the last field to predict the outcome

Increase the accuracy of prediction of breast cancer by a modified method called ANOVA-BOOTSTRAP-SVM

Breast cancer prediction using machine learning [11]

Detection of malignant and benign breast cancer using the ANOVA-BOOTSTRAP-SVM [12]

The possibility of using the proposed algorithm to conduct research and to increase the accuracy of detection of different types of cancer

The research studies the possibility of predicting with effective accuracy using different techniques as per requirements

(continued)

The sensitivity of the algorithm varies according to the parameter C value and the type of kernel used

The paper proposes to reduce error rates while achieving high accuracy by choosing an in-depth study of the features and minimizing the dimensions

The paper proposes to improve prediction accuracy through the use of feature selection techniques and the application of proposed classification algorithms using a balanced data set

This research analyzed a set of big data GE and DM and a combined dataset that contains both GE and DM using the Spark and Weka platforms, and get the lowest error rate and highest accuracy using the SVM algorithm on the Spark platform

Study the ability to advance cancer prediction with high accuracy for big data

Limitations

On the scalability of machine-learning algorithms for breast cancer prediction in big data context [10]

Advantages The researchers suggest trying the CSSFFS algorithm in other areas to confirm its usefulness in improving classification

Objectives

A Novel SVM based CSSFFS feature This paper aims to predict the presence Determine which feature set has the selection algorithm for detecting of cancer by using the CSSFFS most impact on classification and breast cancer [9] algorithm to achieve an optimal feature remove irrelevant and useless features set with the lowest error rate to improve classification accuracy

Title of the research paper

Table 47.1 Comparison of research outcomes related to their objective, advantage, and limitation

47 Machine Learning Model for Breast Cancer Tumor Risk Prediction 519

A study is presented to determine the range of trait values associated with cancer

Determination of the blood, hormone and obesity value ranges that indicate the breast cancer, using data mining based expert system [17]

Building an automated system that helps diagnose breast cancer based on the values of some features using a Decision Tree

It has been observed that, in most cases, setting limits on neural network weights results in increased classification accuracy

Building an automated model that uses Features were determined based on routine blood analysis data to correlation and study their impact on determine the presence of breast cancer different algorithms in patients

The proposed algorithms are a useful classifier both in the diagnostic and in the medical field

Improving the classification This research provides a new efficiency of an ANN Utilizing a new methodology for training the neural training methodology [16] network to obtain a model that efficiently predicts breast cancer

Comparison of the performance of machine learning algorithms in breast cancer screening and detection: A protocol [15]

Prediction of breast cancer using The ability to classify breast cancer support vector machine and K-nearest with high accuracy using two neighbors [14] algorithms, Support vector machine and K-nearest neighbors

Develop a system for early detection of The techniques provide good accuracy cancer using some techniques of data and the paper claims that SVM mining achieves the best classifier among the proposed technologies

Prediction of breast cancer using supervised machine learning techniques [13]

Advantages

Objectives

Title of the research paper

Table 47.1 (continued)

(continued)

The paper proposes to use more data and study the limits of other features to produce reliable results for higher-precision classification

Determining the optimal limits of weights is a difficult process and requires further study, research, and experimentation on a different set of data

The paper suggests using a larger database and thus potentially higher classification accuracy

The sensitivity of the algorithm is affected by the kernel used

Difficulty choosing the best dimensions and features to improve performance

Limitations

520 L. Jena et al.

Study the effectiveness of routine analysis of blood in the detection of cancer using several algorithms

Breast cancer diagnosis by different machine-learning methods using blood analysis data [19]

A comparative study for breast cancer Study feature selection techniques and The results showed that the prediction using machine learning compare their effect on the accuracy of random-forest algorithm gives the and feature selection [20] some classification algorithms highest accuracy with feature selection, in addition, the f-test gives better results for the smaller data set, while the serial forward selection gives better results for the larger data set

Parameters that help achieve effective accuracy were researched using four machine-learning techniques

Predict the recurrence of breast cancer The model can be considered a tool for three years by building a model that with good accuracy to measure the uses a Decision Tree-based Learning likelihood of cancer recurrence algorithm

Development of a model to predict breast cancer recurrence using Decision Tree-based learning algorithms [18]

Advantages

Objectives

Title of the research paper

Table 47.1 (continued)

(continued)

It was suggested that an in-depth study of the importance of features and their ability to categorize by testing them on another dataset and using other algorithms

The accuracy rate was not very high but the usefulness of this type of data in diagnosing breast cancer using ML methods has been studied

The paper proposes an in-depth study of the patient database that is increasing every year so that the useful patterns of information present in it can be revealed, thus improving the model

Limitations

47 Machine Learning Model for Breast Cancer Tumor Risk Prediction 521

Comparison of several algorithms to classify cancer with high accuracy

study the effect of human parameters in Increased ease of diagnosis of cancer routine blood analysis on the classification accuracy of cancer patients

Using machine-learning algorithms for breast cancer risk prediction and diagnosis [22]

Using Resistin, glucose, age, and BMI to predict the presence of breast cancer [23]

High resolution and low error were obtained using the SVM algorithm

Developing a tool to give an accurate assessment of the overall situation when diagnosing breast cancer and to support decision-making for the most difficult cases

Diagnosing breast cancer by building a Bayesian network and identifying the most influential features for greater accuracy in knowing a person’s likelihood of developing breast cancer

A Hybrid model to support the early diagnosis of breast cancer [21]

Advantages

Objectives

Title of the research paper

Table 47.1 (continued)

(continued)

Difficulty deciding on which characteristics achieve better classification accuracy and lower sensitivity

The paper proposes to study the effectiveness of algorithms over other data sets to confirm their robustness and measure their sensitivity across the data

The paper proposes applying the algorithm to a larger database to obtain more consistent and accurate results, as well as applying it to another data set to study the effect of various other characteristics in increasing decision support

Limitations

522 L. Jena et al.

Breast cancer diagnosis by using k-Nearest Neighbor algorithm was k-nearest neighbor with different studied with different types of distances and classification rules [26] classification rules and distances to breast cancer diagnosis

High accuracy of breast cancer classification was achieved using a hybrid classifier called RFSVM

Effective results using both the Euclidean and Manhattan distance types

The limitations of the need to adjust parameters that we face when using SFM have been overcome in addition to the limitations related to the problem of over-allocation of random forests

The use of both types of Euclidean distance and Manhattan leads to effective results in the expectation but consumes a lot of time

The paper suggests searching for the best way to handle data and explore the best weight features so as to develop the system through depth study of the data set

Limitations

A novel classification technique for breast cancer diagnosis [25]

Advantages

The paper proposes to delve into the robustness of the algorithm by testing it against another standard dataset

Objectives

Abnormality detection using weighed improve the classification accuracy of a The proposed algorithm outperformed particle swarm optimization and breast cancer prediction system using a many algorithms and received high smooth support vector machine [24] hybrid classifier called WPSO-SSVM classification accuracy

Title of the research paper

Table 47.1 (continued)

47 Machine Learning Model for Breast Cancer Tumor Risk Prediction 523

524

L. Jena et al.

47.3 Experimental Setup In experiment various classification algorithms like Support Vector Machine, Decision Tree, and Naïve Bayesian are used and the results with respect to different performance indicators are closely observed in order to obtain the best classification. The search provides a comparison of several supervised machine-learning algorithms to obtain high accuracy in the tumor classification of breast cancer patients. The Wisconsin dataset is used for experiment with 11 columns [“ID”, “Clump”, “UnifSize”, “UnifShape”, “MargAdh”, “SingEpiSize”, “BareNuc”, “BlandChrom”, “NormNucl”, “Mit”, “Class”] and 699 rows representing patient data.

47.4 Result Analysis We observe by applying all the features to train the classifier to classify the class [benign or malignant] of each patient, including the ID to each patient [the ID feature has no disease association]. So when it is included it will affect the training of the classifier. Thus the results of all classifiers are analyzed when all features are used with ID of patients. When the patient’s ID feature is included it will affect the training of the classifier and the accuracy will be lower. It is observed that the accuracy is 0.66, i.e., 66% (Table 47.2) for the RBF SVM, 0.69, i.e., 69% (Table 47.3) for the Linear SVM, and 0.80, i.e., 80% (Table 47.4) for Naïve Bayes. But according to the fact that Python programs contain a special function of the Decision Tree algorithm that chooses Table 47.2 Results for RBF SVM classifier F1-score

support

precision

recall

4 (malignant)

0.00

47

0.00

0.00

2 (benign)

0.79

90

0.66

1.00

Weighted average

0.52

137

0.43

0.66

Macro average

0.40

137

0.33

0.50

Accuracy

0.66

137

Table 47.3 Results for linear SVM classifier F1-score

Support

Precision

Recall

4 (malignant)

0.25

47

0.78

0.15

2 (benign)

0.81

90

0.69

0.98

Weighted average

0.62

137

0.72

0.69

Macro average

0.53

137

0.73

0.56

Accuracy

0.69

137

47 Machine Learning Model for Breast Cancer Tumor Risk Prediction

525

Table 47.4 Results for Naïve Bayes classifier 4 (malignant)

F1-score

Support

Precision

Recall

0.60

47

1.00

0.43

2 (benign)

0.87

90

0.77

1.00

Weighted average

0.78

137

0.85

0.80

Macro average

0.73

137

0.88

0.71

Accuracy

0.80

137

the best features to train the classifier depending on the depth of the tree, here it is considered that the maximum depth of the tree is 3. So, we note that the accuracy of the Decision Tree algorithm is high which is 0.95, i.e., 95% (Table 47.5). The confusion matrix generated for all the above four classifiers is shown in Figs. 47.1, 47.2, 47.3, and 47.4. Table 47.5 Results for Decision Tree classifier F1-score

support

precision

recall

4 (malignant)

0.93

47

0.88

0.98

2 (benign)

0.96

90

0.99

0.93

Weighted average

0.95

137

0.95

0.95

Macro average

0.94

137

0.94

0.96

Accuracy

0.95

137

Fig. 47.1 Confusion matrix of RBF SVM

526

L. Jena et al.

Fig. 47.2 Confusion matrix of linear SVM

Fig. 47.3 Confusion matrix of Naïve Bayes

There are two target classes for risk prediction of breast cancer namely “benign” and “malignant” with the performance indicators like precision, recall, and F1-score shown in Tables 47.6 and 47.7. For both the classes “benign” and “malignant” the precision is represented in Figs. 47.5 and 47.6, recall is represented in Figs. 47.7 and 47.8, and F1-score is represented in Figs. 47.9 and 47.10. It is seen from Figs. 47.5 and 47.6 that in case of benign class Decision Tree classifier performs better with precision value of 0.99 as compared to other three

47 Machine Learning Model for Breast Cancer Tumor Risk Prediction

527

Fig. 47.4 Confusion matrix of Decision Tree

Table 47.6 Benign class

Precision

F1-score

RBF SVM

0.66

1.00

0.79

Linear SVM

0.69

0.98

0.81

Naïve Bayes

0.77

1.00

0.87

Decision Tree

0.99

0.93

0.96

Precision

Recall

F1-score

Table 47.7 Malignant class

Fig. 47.5 Precision of benign class

Recall

RBF SVM

0.00

0.00

0.00

Linear SVM

0.78

0.15

0.25

Naïve Bayes

1.00

0.43

0.60

Decision Tree

0.88

0.98

0.93

Precision 1 0.5 0

Precision

528 Fig. 47.6 Precision of malignant class

L. Jena et al.

Precision 1

0

Precision Fig. 47.7 Recall of benign class

1

1 0.98 1

0.93

0.8

Recall Fig. 47.8 Recall of malignant class

0.98

1 0.5

0

0.15

0.43

0

Recall

classifiers. In case of malignant class, the Naïve Bayes classifier performs better with precision value of 1.00 as compared to other three classifiers. Figure 47.11 shows the classification accuracy level obtained, 0.66, i.e., 66% by the RBF SVM classifier, 0.69, i.e., 69% by the Linear SVM classifier, 0.80, i.e., 80% by Naïve Bayes classifier and 0.95, i.e., 95% by Decision Tree classifier. .

47 Machine Learning Model for Breast Cancer Tumor Risk Prediction

529

F1-score

Fig. 47.9 F1-score of benign class

1

0

F1-score

RB F SV M

Lin ear SV M

Naï ve Bay es

0.79

0.81

0.87

Dec isio n Tre e 0.96

F1-score

Fig. 47.10 F1-score of malignant class

1 0.8 0.6 0.4 0.2 0

RBF SVM

F1-score

Fig. 47.11 Accuracy of the classification algorithms with ID attributes

100 50 0

66

RBF SVM

0

69

Linea r SVM 0.25

80

Linear SVM Naïve Bayes

Naïve Baye s 0.6

Decis ion Tree 0.93

95

Decision Tree

Acuracy in %

47.5 Conclusion Several researchers in their research papers have highlighted the use of many algorithms and techniques to diagnose breast cancer. But as a result of the urgent need for early detection of breast cancer, the research proposes to study several algorithms on a dataset in order to judge the effectiveness of these algorithms in classifying

530

L. Jena et al.

breast cancer and obtaining high accuracy in diagnosing this disease. In this work, four machine-learning classifiers have been used for classification of breast cancer tumors into benign class and malignant class. From the experiment outcome, it is seen that Decision Tree algorithm results better in comparison to other three classifiers with classification accuracy of 95% when all the features are used. Decision Tree algorithm performs better in precision value (0.99) for Benign tumor class, whereas Naïve Bayes algorithm works fine in precision value (1.0) for Malignant tumors. The recall and F1-score values are also analyzed for both the classes and compared with one another. The observation and outcomes of machine-learning methods implemented in the work have significant role in risk prediction of breast cancer tumors in early stage.

References 1. Jena, L., Swain, R.: Work-in-progress: chronic disease risk prediction using distributed machine learning classifiers. In: 2017 International Conference on Information Technology (ICIT), pp. 170–173 (2017).https://doi.org/10.1109/ICIT.2017.46 2. Patra, B., Jena, L., Bhutia, S., Nayak, S.: Evolutionary hybrid feature selection for cancer diagnosis. In: Mishra, D., Buyya, R., Mohapatra, P., Patnaik, S. (eds.) Intelligent and Cloud Computing. Smart Innovation, Systems and Technologies, vol. 153. Springer, Singapore (2021). https://doi.org/10.1007/978-981-15-6202-0_28 3. Patra, B.N., Bisoyi, S.K.: CFSES optimization feature selection with neural network classification for microarray data analysis. Published in IEEEXplore. In: 2nd International Conference on Data Science and Business Analytics (ICDSBA) on 21–23 Sept 2018, pp. 45–50, ISBN-13: 978-1-5386-8431-3 4. Jena, L., Nayak, S., Swain, R.: Chronic disease risk (CDR) prediction in biomedical data using machine learning approach. In: Mohanty, M., Das, S. (eds.) Advances in Intelligent Computing and Communication. Lecture Notes in Networks and Systems, vol. 109. Springer, Singapore (2020). https://doi.org/10.1007/978-981-15-2774-6_29 5. Patra, B.N., Bhutia, S., Panda, N.: Machine learning techniques for cancer risk prediction. Test Eng. Manage. 83, 7414–7420 (2020), ISSN: 0193-4120, May/June 6. Jena, L., Kamila, N.K.: A Model for prediction of human depression using Apriori algorithm. Int. Conf. Inform. Technol. 2014, 240–244 (2014). https://doi.org/10.1109/ICIT.2014.65 7. Arya, J.L., Mohanty, R., Swain, R.: Role of deep learning in screening and tracking of COVID19. In: Das S., Mohanty, M.N. (eds.) Advances in Intelligent Computing and Communication. Lecture Notes in Networks and Systems, vol. 202. Springer, Singapore (2021). https://doi.org/ 10.1007/978-981-16-0695-3_63 8. Rath, S., Mohanty, R., Jena, L.: Machine learning approach for analyzing symptoms associated with COVID-19 risk factors. In: Mishra, S., Mallick, P.K., Tripathy, H.K., Chae, G.S., Mishra, B.S.P. (eds.) Impact of AI and Data Science in Response to Coronavirus Pandemic. Algorithms for Intelligent Systems. Springer, Singapore (2021) https://doi.org/10.1007/978-981-16-278 6-6_4 9. Aruna, S., Dr, S.: A novel svm based CSSFFS feature selection algorithm for detecting breast cancer. Int. J. Comput. Appl. 31, 14–20 (2011) 10. Alghunaim, S., Al-Baity, H.H.: On the scalability of machine-learning algorithms for breast cancer prediction in big data context. IEEE Access, 7, 91535–91546 (2019) 11. Rawal, R.: Breast cancer prediction using machine learning. J. Emerg. Technol. Innov. Res. 7(5), 13–24 (2020)

47 Machine Learning Model for Breast Cancer Tumor Risk Prediction

531

12. Borislava Petrova Vrigazova: Detection of malignant and benign breast cancer using the anovabootstrap-svm. J. Data Inform. Sci. 5(2), 62–75 (2020) 13. Shravya, C., Pravalika, K., Subhani, S.: Prediction of breast cancer using supervised machine learning techniques. Int. J. Innov. Technol. Exploring Eng. (IJITEE) 8(6), 1106–1110 (2019) 14. Islam, Md.M., Iqbal, H., Haque, Md.R., Hasan, Md.K.: Prediction of breast cancer using support vector machine and k-nearest neighbors. In: 2017 IEEE Region 10 Humanitarian Technology Conference (R10-HTC). IEEE, pp. 226–229 (2017) 15. Salod, Z., Singh, Y.: Comparison of the performance of machine learning algorithms in breast cancer screening and detection: a protocol. J. Public Health Res. 8(3) (2019) 16. Livieris, I.E.: Improving the classification efficiency of an ANN utilizing a new training methodology. In: Informatics, vol. 6. Multidisciplinary Digital Publishing Institute, p. 1 (2019) 17. Akben, S.B.: Determination of the blood, hormone and obesity value ranges that indicate the breast cancer, using data mining based expert system. IRBM 40(6), 355–360 (2019) 18. Dawngliani, S.L.M.S., Chandrasekaran, N.: Development of a model to predict breast cancer recurrence using decision tree based learning algorithms. Think India J. 22(10 Nov 2019), 4008–4013, ISSN: 0971-1260 19. Aslan, M.F., Celik, Y., Sabanci, K., Durdu, A.: Breast cancer diagnosis by different machine learning methods using blood analysis data. Int. J. Intell. Syst. Appl. Eng. 6(4), 289–293 (2018) 20. Dhanya, R., Paul, I.R., Akula, S.S., Sivakumar, M., Nair, J.J.: A comparative study for breast cancer prediction using machine earning and feature selection. In: 2019, International Conference on Intelligent Computing and Control Systems (ICCS). IEEE, pp. 1049–1055 (2019) 21. Carvalho, D., Pinheiro, P.R., Pinheiro, M.C.D.: A hybrid model to support the early diagnosis of breast cancer. Procedia Comput. Sci. 91, 927–934 (2016) 22. Asri, H., Mousannif, H., Moatassime, H.A., Noel, T.: Using machine learning algorithms for breast cancer risk prediction and diagnosis. Procedia Comput. Sci. 83, 1064–1069 (2016) 23. Patrício, M., Pereira, J., Crisóstomo, J., et al.: Using Resistin, glucose, age and BMI to predict the presence of breast cancer. BMC Cancer 18, 29 (2018). https://doi.org/10.1186/s12885-0173877-1 24. Latchoumi, T.P., Parthiban, L.: Abnormality detection using weighed particle swarm optimization and smooth support vector machine. Biomed. Res. 28(11) (2017) 25. Soni, B., Bora, A., Ghosh, A., Reddy, A.: RFSVM: a novel classification technique for breast cancer diagnosis. Int. J. Innov. Technol. Exploring Eng. (IJITEE) 8(12), ISSN: 2278-3075 (2019) 26. Medjahed, S.A., Saadi, T.A., Benyettou, A.: Breast cancer diagnosis by using k-nearest neighbor with different distances and classification rules. Int. J. Comput. Appl. 62(1) (2013)

Chapter 48

Comparative Analysis of State-Of-the-Art Classifier with CNN for Cancer Microarray Data Classification Swati Sucharita, Barnali Sahu, and Tripti Swarnkar Abstract Cancer is currently one of the leading causes of death in the World. Microarray data has a crucial role in lowering the death rate. However, the intrinsic complexity of microarray data processing, such as high dimension, data redundancy, and noisy data, makes the process difficult. To address these challenges, machine learning provides solutions such as dimensionality reduction and optimization, however implementing these will raise computing complexity. When compared to traditional ML models, deep learning can narrate a solution without dimensionality reduction and optimization with excellent performance. Though both machine learning and deep learning aim to extract patterns from a dataset, deep learning is preferred because of its efficiency in handling microarray data. In the current research work, deep learning-based CNN has been considered as the classification model. The proposed CNN model has been evaluated over four benchmark datasets namely brain cancer, lung cancer, prostate cancer, and colon cancer. Next, the empirical analysis has been done in contrast to several ML algorithms such as SVM, RF, DT, kNN, LR, and NB. The result shows that CNN performs well as compared to other state-of-the-art ML algorithms.

48.1 Introduction Cancer is a typical disease that is caused by several irregularities in the cell structure. As per the survey of WHO and NCI in the year 2020 approximately 20 million people had been identified as cancer patients out of which around 10 million people died due to the lack of early detection methods [1]. So, cancer classification is becoming a prominent area of research. The cancer disease is being properly diagnosed by S. Sucharita (B) · B. Sahu Department of Computer Science and Engineering, Siksha ‘O’ Anusandhan (Deemed to be University), Bhubaneswar, India T. Swarnkar Department of Computer Application, Siksha ’O’ Anusandhan (Deemed to be University), Bhubaneswar, India e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2022 D. Mishra et al. (eds.), Intelligent and Cloud Computing, Smart Innovation, Systems and Technologies 286, https://doi.org/10.1007/978-981-16-9873-6_48

533

534

S. Sucharita et al.

using the microarray dataset and it contains the genetic expression details. The basic need of doing the microarray data analysis is typically for the cancer microarray data to increase the probability of early detection which can cause the degradation in death rate. The resultant of the cancer microarray data analysis is to detect the cells which cause the abnormality. Hence developing an appropriate model for classifying the cancerous cell is now-a-day becomes the largest avenue and AI gives room for improvising the microarray data analysis by introducing machine learning and deep learning [2]. Microarray data suffers from high dimensionality issues. This can indicate two self-contradictory flaws, such as for any model, processing the whole feature set is not always practicable, and processing a part of a feature set can affect model accuracy [3]. Machine Learning (ML) narrates the solution in the form of dimensionality reduction for the high dimension issue and optimization for handling redundancy and noisy features. The challenge present behind the ML implementation on microarray data has made way to the deep learning approach. As compared to the ML the deep learning (DL) performs well with the presence of a high size dataset. In microarray data analysis the deep learning approach considers a higher number of features at processing time in contrast to ML which affects the accuracy of the developed model. In addition, the DL reduces the cost and complexity in terms of expertise and computational time [3, 4]. The ML and DL differ from each other by feature extraction method. The DL does not require any feature extraction method whereas in the case of ML this is considered as the important part to boost the performance level. In ML the feature extraction entails using a collection of data preparation procedures on the raw data, then aggregating these characteristics into a featured dataset, and then fitting and evaluating a model on this dataset. This technique allows ML to extract prominent features from raw data to be submitted to the learning algorithm. It should unravel different interrelationships and complicate input variables, allowing for the application of simpler modeling algorithms like linear ML techniques. In current research work, the Convolutional Neural Network (CNN) is considered as the instance of the DL approach for cancer microarray data analysis. The main reason for choosing the CNN as the classifier is its capability to deal with the small sample dataset which can improve the accuracy and also able to integrate the related features of a cancer dataset by identifying the latent characteristic. Basically, the DL approach is famous for dealing with high-definition image data but in the current research work, the empirical analysis with four benchmark microarray cancer datasets shows that the CNN has also a high performance with high dimensional data like image data. The objective of the research work has been summarized below. • To propose the CNN as the classification model. • To analyze the performance of CNN with four benchmark datasets namely Brain Cancer, Lung Cancer, Colon Cancer, and Prostate Cancer. • The performance of the CNN is compared with state-of-art machine learning classification algorithms. The performance is being calculated by calculating some influencing parameters such as accuracy, specificity, sensitivity, and F1-score.

48 Comparative Analysis of State-Of-the-Art Classifier with CNN …

535

The structure of the research work is as follows. Section 48.2 represents the literature survey done during the research work. The materials, methods, and dataset description are specified in Sect. 48.3. The proposed work has been entitled in Sects. 48.4 and 48.5 shows the empirical analysis of CNN and the performance comparison is done with the ML classification algorithms. The overall conclusion has been drawn in Sect. 48.6.

48.2 Related Work Xiao et al. in [5] had adopted a deep learning-based ensemble method for cancer prediction. The DNN has been considered as the classification technique and experimental analysis has been done on the basis of three benchmark datasets such as LUAD, STAD, and BRCA. The results obtained show that the proposed model gives accuracy as 96.80%, 96.59%, and 95.76% for LUAD, STAD, and BRCA datasets, respectively with cross-validation technique. The author in [6] had used multi-task CNN for cancer diagnosis and detection. The proposed method has been analyzed with 12 benchmark datasets and the empirical analysis shows that the developed model shows the highest accuracy as 96.66% in the case of the Leukemia dataset. The author in [7] has developed a model for cancer classification by using the DNN method. The result of the proposed model has been compared with Decision Tree, Naïve Bayes, and MLP and shown that the proposed model shows the highest accuracy against the Leukemia dataset. Kilicarslan et al. in [8] had used CNN as the deep learning approach for cancer classification with Relief and AE as the feature selection method. The result obtained has been compared with the SVM ML classifier against three benchmark datasets such as Leukemia, Ovarian, and CNS. The empirical analysis shows that the SVM shows the highest accuracy as 96.14% for the ovarian cancer dataset while the CNN shows 98.6, 99.86, and 83.95% for the Ovarian, Leukemia and CNS dataset. Islam et al. in [9] had introduced DNN as the classification technique to analyze the Breast Cancer dataset. The results obtained there were compared with a state-of-the-art ML classification algorithm. The analysis shows that the proposed DNN has the highest accuracy as 99.5% with the top 450 selected features as compared to other classification techniques. Wang et al. in [10] had developed a model based on the stochastic version of AE known as the DA for Lung cancer classification. The empirical shows that the proposed model shows 95.4% classification accuracy. In [11] the author had used DNN as the classification technique with NSCLC lung cancer dataset with 614 samples. The work shows that the proposed model shows AUC as 81.63% and CA as 75.44%. The author also mentioned the importance of feature selection to improve classification accuracy as the future scope of the developed model. The author in [12] had proposed a model based on RNN as the cancer classification technique for two benchmark datasets such as Leukemia and Breast Cancer. The experimental analysis shows that the proposed model shows the highest accuracy for the Leukemia dataset as 95.3%. Zhang et al. in [13] had proposed

536

S. Sucharita et al.

an ensemble model based on LDA, AE, and DNN for Breast Cancer classification. The results are thereafter compared with GPMKL, MLP, and SWK methods to show the difference and impact of feature selection with deep learning. The result obtained there shows that the highest accuracy is obtained by using the proposed model as 98.27%. Doppalapudi et al. in [14] had proposed a model based on AE and CNN for lung cancer survival prediction. In the work, the author had focused on both classification and regression problems and shown that the proposed model shows the highest accuracy as compared to other DL approaches such as CNN, ANN, and RNN. The proposed method shows 71.78% classification accuracy and RMSE as 13.5% while predicting the survival period of Lung cancer patients. Arya et al. in [15] had developed a deep learning-based model for Breast Cancer survival prediction. For the work, they have used sigmoid gated CNN or SiGaAtCNN model over METABRIC and TCGA-BRCA datasets. The results obtained there are compared with other deep learning methods such as DNN and RNN. The result shows that the proposed model shows the highest accuracy 91.2% as compared to the other methods. In author in [16] had proposed a deep learning-based model for cancer diagnosis from imbalanced data. For this, they have used the WGAN and RNN model for solving the class imbalance and to improve the prediction accuracy over LUAD, STAD, and BRCA datasets. The empirical analysis done in the research shows that the proposed system shows the highest accuracy as 96.67% over the STAD dataset. The author in [17] the author had utilized the CNN over 11 benchmark cancer datasets for cancer classification. The proposed model had been compared with MSVM-RFE and varSelRF model. The empirical analysis shows that the proposed system shows the highest accuracy as 98% in the case of the Leukemia dataset.

48.3 Materials and Methodology This section discusses different aspects of materials and methods used during the research work.

48.3.1 Dataset Description For the current research work, four different benchmark datasets such as Lung Cancer, Brain Cancer, Prostate Cancer, and Colon Cancer have been taken from UCI public data repository. The data description has been specified in Table 48.1.

48 Comparative Analysis of State-Of-the-Art Classifier with CNN … Table 48.1 Dimension of the used dataset

Dataset

Dimension

537 Number of classes

Brain cancer

28 * 1071

2

Lung cancer

181 * 1627

2

Prostate cancer

102 * 340

2

Colon cancer

62 * 2001

2

48.3.2 Convolutional Neural Network The Convolutional Neural Network or CNN is being developed by Lecun in 1990. It is a hybrid version of a neural network that is being inspired by the human brain structure. Its basic building block consists of the input layer, latent layer, and output layer. The latent layer is a combination of more than one convolutional layer, pooling layer, fully connected layer. The convolution layer is one of the primary layers which focuses on the iterative implementation of the specific function. In this layer-specific filters are being applied in each convolutional layer and at the end, the outputs are combined. The pooling layers focus on extracting the features obtained from the convolutional layer. In the case of image data, the pooling layer creates various grids from the output of the convolutional layer. The fully connected layer is almost a complete CNN by enabling the dense function. The main objective of this layer is to enable the system to provide the input which can be transmitted into the network with predefined vector length while maintaining the data integrity.

48.4 Proposed Work In the current work, the CNN has been proposed as a DL classification model over four benchmark datasets namely brain cancer, lung cancer, prostate cancer, and colon cancer with the dimension specified in Table 48.1. The considered CNN has two convolutional layers, one pooling layer, one fully connected layer. Figure 48.1 shows the architecture of the proposed CNN. The convolutional layer is a 1-Dimensional layer as here the microarray data have been considered. The main principle of the convolutional layer is to extract the features from the input data by utilizing filters with defined dimensions. For brain cancer, the input matrix is 28 × 1 with kernel size 2 along with relu activation function. The total number of filters used in the first convolutional layer is 16 with the number of considered features per filter as 3. The next convolutional layer is having the number of filters as 32 with 3 features per filter. After the batch normalization process, the pooling is being applied to obtain a better feature set. The maxpool technique is being applied in this current research work with maxpool size as 2 and stride as 1. To avoid the overfitting issue the dropout function is applied and the dropout for current research work is 20%. The total number of features obtained at the full connected layer is 896 after flattening the network. For

538

S. Sucharita et al.

Fig. 48.1 Architecture of proposed CNN model

lung cancer, the input matrix of the first convolutional layer is 181 × 1 with numbers of the filter as 16 and 3 features per filter. The input to the second convolutional layer is 180 × 1 with 32 filters. After the batch normalization, the maxpool technique is being applied with size 2 and stride as 1. At the full connected layer after pooling and flattening the network the total number of features obtained is 1357. For prostate cancer, the input matrix for the first convolutional layer is 102 × 1 with kernel size = 2, kernel = 16, and relu as the activation function, and the number of features taken per filter is 3. The second convolutional layer takes the input matrix as 101 × 1 with the number of filters as 32 and 3 features are considered for filter. At the output of the second convolutional layer, the maxpool is being applied with pool size as 2 and stride as 1. Finally, the full connected layer contains 266 features after pooling and flattening the network with dropout as 0.2. For colon cancer, the input matrix for the first convolutional layer is 62 × 1 with kernel size as 2 and the number of filters as 16 with 3 features per kernel. The second convolutional layer consists of 61 × 1 input matrix with kernel size as 2 and the number of the filter as 32. After the batch normalization, the maxpool layer has been considered with pool size as 2 and stride as 1. Finally, the full connected layer contains 1878 features after pooling and flattening the network with dropout as 0.2. To enhance the classification accuracy the classification process is executed in a probabilistic manner with sigmoid function. As for the current research work, binary class dataset has been considered so sigmoid function has been used for finding the classification accuracy. The sigmoid normalizes the obtained values into probabilistic values. Figure 48.1 shows the basic architecture of the proposed CNN model and Fig. 48.2 shows the flow of the research work.

48 Comparative Analysis of State-Of-the-Art Classifier with CNN …

539

Fig. 48.2 Workflow of the model developed in the research work

48.5 Result and Analysis The proposed model is executed in Anaconda with Keras technology. Initially, the dataset is being preprocessed by using the standard scalar method. After that, the dataset set is sliced into training and testing groups with a distribution ratio of 0.1. An HP laptop with core i5 process having 2.6 GHz clock speed and 8 GB RAM is used to create the deep learning environment and to execute the proposed model. The overall time taken to train and test the proposed model is approximately 2 h. For performance analysis, the train test validation method is adopted during the research work. Thereafter the proposed model is compared with the state-of-the-art ML classification algorithms such as SVM, Random Forest, Decision Tree, Naïve Bayes, Logistic Regression, and kNN. Table 48.2 shows the results obtained by using the proposed model and ML models. Figure 48.3 shows the performance analysis done over the proposed model and the ML classification models. Figure 48.4 shows the loss of CNN per epoch for every dataset. Table 48.3 shows the performance comparison of proposed work with existing models based on CNN. Figure 48.3 and Table 48.2 shows that the proposed CNN model performs well as compared to other state-of-the-art machine learning algorithms. Figure 48.4a–d shows the model loss per epoch for various datasets and shows that for the brain

540

S. Sucharita et al.

Table 48.2 Performance analysis of different classifiers over different dataset Dataset

Methodology

Accuracy

Precision

Specificity

Sensitivity

Brain cancer

CNN

0.998

0.988

0.985

0.979

SVM

0.917

0.911

0.911

0.898

RF

0.905

0.901

0.892

0.889

DT

0.891

0.884

0.879

0.869

KNN

0.867

0.875

0.875

0.873

LR

0.887

0.869

0.871

0.876

NB

0.896

0.88

0.872

0.866

CNN

0.987

0.976

0.971

0.979

SVM

0.903

0.897

0.895

0.9

RF

0.891

0.886

0.886

0.893

Lung cancer

Prostate cancer

Colon cancer

DT

0.9

0.891

0.876

0.895

KNN

0.89

0.886

0.881

0.887

LR

0.865

0.859

0.862

0.858

NB

0.854

0.849

0.831

0.835

CNN

0.976

0.961

0.968

0.966

SVM

0.912

0.896

0.902

0.89

RF

0.886

0.87

0.881

0.876

DT

0.875

0.861

0.856

0.856

KNN

0.865

0.859

0.855

0.854

LR

0.827

0.815

0.806

0.812

NB

0.846

0.844

0.836

0.832

CNN

0.981

0.969

0.962

0.958

SVM

0.865

0.859

0.854

0.858

RF

0.901

0.9

0.896

0.895

DT

0.905

0.892

0.889

0.891

KNN

0.783

0.776

0.775

0.781

LR

0.796

0.791

0.789

0.786

NB

0.813

0.798

0.786

0.792

cancer dataset the model loss is low as compared to other datasets. In the case of brain cancer, the CNN shows the accuracy as 99.8% whereas in the case of the lung cancer dataset the CNN shows the accuracy as 98.7%. Considering prostate cancer, the CNN shows 97.6 classification accuracy and for colon cancer, the proposed CNN shows 98.1% classification accuracy.

48 Comparative Analysis of State-Of-the-Art Classifier with CNN …

541

Fig. 48.3 Performance Analysis for different Classifiers

48.6 Conclusion With the advent of cancer disease, the death toll is in a continuous increasing mode which creates a path for developing various models for diagnosing the cancer disease in its early stage which can degrade the death rate. To correctly diagnose the disease the microarray data plays a vital role. In the conventional approach by using the ML various models have been developed but the lacuna present behind the microarray data makes the way challengeable for ML-based models. Though for ML approach the solution like dimensionality reduction and optimization are there but implementing those will increase the computational complexity. So, in this research work the DLbased method CNN has been adopted for microarray data analysis. The proposed CNN has been implemented over four benchmark datasets and the performance has been compared with the state-of-the-art ML classification algorithms. The empirical analysis shows that the CNN performs well as compared to other ML models and gives the highest accuracy in the case of the brain cancer dataset.

542

S. Sucharita et al.

(a)

(b)

(c)

(d)

Fig. 48.4 a Loss per epoch for Brain cancer. b Loss per epoch for Lung cancer. c Loss per epoch for Prostate cancer. d Loss per epoch for Colon cancer Table 48.3 Performance comparison of the proposed model with the existing model Method

CT

Dataset

Dataset type

Accuracy

[6]

CNN

Leukemia

Image

96.6

[14]

CNN

Lung cancer

Image

71.78

[15]

CNN

TCGA-BRCA

Image

91.2

[17]

CNN

Leukemia

Image

98

Proposed

CNN

Brain cancer

Microarray

99.8

48 Comparative Analysis of State-Of-the-Art Classifier with CNN …

543

References 1. Danaee, P., Ghaeini, R., Hendrix, D.A.: A deep learning approach for cancer detection and relevant gene identification. In: Pacific Symposium on Biocomputing 2017. World Scientific, Singapore (2017) 2. Bolón-Canedo, V., Alonso-Betanzos, A., López-de-Ullibarri, I., Cao, R.: Challenges and future trends for microarray analysis. In: Microarray Bioinformatics, pp. 283–293. Humana, New York, NY (2019) 3. Daoud, M., Mayo, M.: A survey of neural network-based cancer prediction models from microarray data. Artif. Intell. Med. 97, 204–214 (2019) 4. Iqbal, M.S., Ahmad, I., Bin, L., Khan, S., Rodrigues, J.J.: Deep learning recognition of diseased and normal cell representation. Trans. Emerg. Telecommun. Technol. e4017 (2020) 5. Xiao, Y., Wu, J., Lin, Z., Zhao, X.: A deep learning-based multi-model ensemble method for cancer prediction. Comput. Methods Programs Biomed. 153, 1–9 (2018) 6. Liao, Q., Ding, Y., Jiang, Z.L., Wang, X., Zhang, C., Zhang, Q.: Multi-task deep convolutional neural network for cancer diagnosis. Neurocomputing 348, 66–73 (2019) 7. Chandrasekar, V., Sureshkumar, V., Kumar, T.S., Shanmugapriya, S.: Disease prediction based on micro array classification using deep learning techniques. Microprocessors Microsyst. 77, 103189 (2020) 8. Kilicarslan, S., Adem, K., Celik, M.: Diagnosis and classification of cancer using hybrid model based on ReliefF and convolutional neural network. Med. Hypotheses 137, 109577 (2020) 9. Islam, M.M., Huang, S., Ajwad, R., Chi, C., Wang, Y., Hu, P.: An integrative deep learning framework for classifying molecular subtypes of breast cancer. Comput. Struct. Biotechnol. J. 18, 2185–2199 (2020) 10. Wang, J., Xie, X., Shi, J., He, W., Chen, Q., Chen, L., Zhou, T.: Denoising autoencoder, a deep learning algorithm, aids the identification of a novel molecular signature of lung adenocarcinoma. Genomics Proteomics Bioinfo. (2020) 11. Lai, Y.H., Chen, W.N., Hsu, T.C., Lin, C., Tsao, Y., Wu, S.: Overall survival prediction of non-small cell lung cancer by integrating microarray and clinical data with deep learning. Sci. Rep. 10(1), 1–11 (2020) 12. Zhu, W., Xie, L., Han, J., Guo, X.: The application of deep learning in cancer prognosis prediction. Cancers 12(3), 603 (2020) 13. Zhang, X., et al.: Deep learning based analysis of breast cancer using advanced ensemble classifier and linear discriminant analysis. IEEE Access 8, 120208–120217 (2020). https://doi. org/10.1109/ACCESS.2020.3005228 14. Doppalapudi, S., Qiu, R.G., Badr, Y.: Lung cancer survival period prediction and understanding: deep learning approaches. Int. J. Med. Inf. 148, 104371 (2021) 15. Arya, N., Saha, S.: Multi-modal advanced deep learning architectures for breast cancer survival prediction. Knowl.-Based Syst. 221, 106965 (2021) 16. Xiao, Y., Wu, J., Lin, Z.: Cancer diagnosis using generative adversarial networks based on deep learning from imbalanced data. Comput. Biol. Med. 104540 (2021) 17. Zeebaree, D.Q., Haron, H., Abdulazeez, A.M.: Gene selection and classification of microarray data using convolutional neural network. Int. Conf. Adv. Sci. Eng. (ICOASE) 2018, 145–150 (2018). https://doi.org/10.1109/ICOASE.2018.8548836

Chapter 49

Comparative Study of Machine Learning Algorithms for Breast Cancer Classification Yashowardhan Shinde, Aryan Kenchappagol, and Sashikala Mishra

Abstract Breast cancer is one of the diseases with a high number of deaths every year and has become very prominent among Indian women in recent times. Cancer must be detected at the early stages of its formation to improve the mortality rates. This paper focuses on a comparison of multiple machine learning models with the use of different filters for the detection of breast cancer. Models used in this experiment include logistic regression, random forest, extreme gradient boosting, Gaussian Naive Bayes, K-nearest neighbors, and it is observed that the performance of these models significantly improve with the implementation of various data manipulation techniques. The best model for classification is found to be random forest (Accuracy = 98.04% and F1-Score = 96.27%) and XGBoost classifier (Accuracy = 98.32% and F1-Score = 98.31%).

49.1 Introduction Breast cancer originates in the breast. Cancer begins when tumor cells start to grow rapidly. The tumor formed by the breast cancer cells can be detected on an x-ray or when the presence of a lump is generally felt. This occurs mostly in women, but men can get breast cancer too [1–4]. There are more benign breast lumps than malignant ones (malignant). Non-cancerous breast tumors cause abnormal growths in the breast, but they cannot spread to other parts of the body. Despite their benign nature, some forms of benign breast lumps can increase a woman’s chances of developing breast cancer. India now has more cases of breast cancer than of cervical cancer [2–4], which was previously the most common type. The areas that need attention are early detection, primary prevention, diagnostic modalities including treatment, pathology, translational research including biomaterials, and demulcent care [2, 3]. Statistically,

Y. Shinde (B) · A. Kenchappagol International Institute of Information Technology, Pune, India S. Mishra Symbiosis Institute of Technology, Pune, India © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2022 D. Mishra et al. (eds.), Intelligent and Cloud Computing, Smart Innovation, Systems and Technologies 286, https://doi.org/10.1007/978-981-16-9873-6_49

545

546

Y. Shinde et al.

India has a much lower number of breast cancer patients compared to Western countries, even after considering the age structure regarding population—around 1/3 in urban areas and 1/9 in rural regions [1]. In India, there are fewer population screenings, but doing so is unlikely to make up for the health disparity, but other factors, such as lifestyle, reproduction, and diet, do also contribute to it [2–4]. The factors that protect Indian women against breast cancer need to be researched, preserved, and promoted in a systematic way. Due to this, breast cancer is detected in early stages to minimize the death rate due to this cancer, to do so machine learning can play a crucial role to detect breast cancer, data can be extracted from mammographic images and then fed to machine learning models which can then be used for predicting if the patient has cancer or not [1].

49.2 Related Work Miller AB et al. tried to study if it was possible to detect breast cancer and reduce the death rate in women aged 40–49 through a combination of various methods such as mammography, physical examination and, the teaching of breast self-examination. It was found that their method detected more node-negative, smaller tumors than expected but had no impact on the death rate from this disease [5]. Abien Fred et al. performed a comparative study of various algorithms for machine learning in detection of breast cancer like GRU-SVM, nearest neighbor (NN) search, support vector machine (SVM), multilayer perceptron (MLP), linear regression, and softmax regression. It was concluded that MLP had the best performance (Accuracy = 99.04%) among all the other models on the Wisconsin diagnostic breast cancer dataset [6]. M. R. Al-Hadidi et al. tried detecting breast cancer using the KNN algorithm, their simulation involves image processing of mammography images and extracting the features followed by feeding these features as input to 2 ML algorithms logistic regression and back propagation neural network [7]. Hiba Asri et al. focused on achieving the correctness for the classification of prognostic data regarding effectiveness and efficiency considering the algorithms in terms of precision, specificity, sensitivity, and accuracy. The experimental results of the same show up to an accuracy of 97.1% with support vector machine [8]. Anji Reddy Vaka et al. proposed an unusual method for detecting breast cancer with machine learning. The researchers experimented on a dataset to study the results of their produced model. It produced efficient and highly accurate results compared to the available methods [9]. A. Osareh and B. Shadgar proposed to solve this problem using 3 machine learning algorithms, namely support vector machines and a combination of K-nearest neighbors, including probabilistic neural networks based-classifiers which yielded an accuracy of 98.80% and 96.33%, respectively [10]. Mohammed S. A et al. have proposed to detect breast cancer using machine learning models like decision tree (J48), sequential minimal optimization (SMO), and Naïve Bayes (NB), the researchers have also used resampling filters to optimize the results, which have proved to be effective [11]. Shubham Sharma et al. aimed at comparing various machine learning algorithms and their techniques

49 Comparative Study of Machine Learning Algorithms …

547

which are commonly used for breast cancer prediction. The results obtained after training and testing of the data through the Wisconsin breast cancer dataset were very competitive and can be used as an application for detection and its treatment [12]. Chaurasia et al. used the Wisconsin breast cancer dataset for the classification of breast cancer, and to do so they have analyzed the comparative study of the performance of 3 machine learning models. The results convey that BF and IBK tree methods had relatively a lower accuracy than the sequential minimal optimization (SMO) accuracy, i.e., 96.2% [13]. Yixuan Li and Zixuan Chen explore the development of various data mining methods which help in providing an effective way in the classification of breast cancer using 5 different models. The two datasets used were Wisconsin breast cancer database (WBCD) and breast cancer Coimbra dataset (BCCD) [14]. Ganggayah at el. used the hospital-based dataset for breast cancer from a Malaysian Hospital. Most importantly, the insights from the experiment include features such as tumor size, cancer stage classification, count of positive lymph nodes, types of primary treatment, methods of diagnosis, and the number of total axillary lymph nodes removed [15]. Islam, M. M. et al. measured the performance of five different supervised machine learning algorithms, namely K-nearest neighbors, support vector machines (SVMs), random forests, logistic regression, and artificial neural networks (ANNs) on the Wisconsin breast cancer dataset. Various classification evaluation metrics included accuracy, precision, F1-score, etc., for determining performances of each algorithm. Artificial neural networks (ANNs) performed the best among all with F1-Score, accuracy, precision and of 98.90%, 98.57%, and 97.82%, respectively [16]. Ahmad LG et al. studied the advanced mining techniques that help in discovering hidden patterns and relationships among the various features present in the dataset, for the development of predictive models for breast cancer recurrence in patients. The results include accuracy of 93.6%, 95.7%, and 94.7% of the DT, SVM, and ANN models, respectively. The results achieved were after tenfold cross-validation for each model [17].

49.3 Overview of Machine Learning Models a.

Accuracy

Accuracy is one of the classification evaluation metrics used for assessing classification. For binary classification problems, accuracy can be determined in terms of negative and positive predictions made by the model. Accuracy =

(TP + TN) (TP + FP + TN + FN)

(49.1)

Here, FP = False positive, FN = False negative, TP = True positive, TN = True negative. b.

F1-Score

548

Y. Shinde et al.

F1-score is a qualitative and extensively used evaluation metric for the evaluation of classification models, F1-score is calculated based on two terms called recall and precision. The formula for calculating F1-Score is F1score = 2 ∗

precision ∗ recall precision + recall

(49.2)

Precision is defined as the ratio of the true positives and the total number of positives in the dataset. When precision is equal to 0, it means that the model is wrongly classifying all the positives in our dataset. precision =

TP TP + FP

(49.3)

Here, TP = True positives, FP = False positives. Recall is defined as the ratio of the TP and the sum of TP and FN. recall =

TP TP + FN

(49.4)

Here, TP = True positives, FN = False negatives. c.

Area under Curve (AUC) Score

Area under curve (AUC) gives the probability of random positive examples being placed to the right of random negative examples. The AUC score always lies between 0 and 1. When a model predicts all the samples correctly, the AUC score is said to be 1 and when the model makes no correct prediction, then the AUC score automatically becomes 0.

49.4 Dataset Description Two datasets are used in the experiment, such as the Wisconsin breast cancer diagnostic and prognostic dataset, these can be found on the UCI machine learning repository. The diagnostic dataset contains 569 instances and 32 attributes and there are no missing values in the dataset [21]. The prognostic dataset has 198 samples and 34 attributes with no missing values [22]. The initial 30 features of the prognostic, as well as the diagnostic dataset, have been calculated from a digital image of a fine needle aspirate of a breast mass.

49.5 Proposed Model Figure 49.1 shows the flow of the proposed model and the steps involved in the

49 Comparative Study of Machine Learning Algorithms …

549

Fig. 49.1 Flow diagram of proposed model

process viz. EDA and data preprocessing, oversampling data followed by handling multicollinearity and testing models over data with and without multicollinearity. These steps are further explained in detail.

49.5.1 Exploratory Data Analysis and Data Preprocessing Initially, both the datasets had an imbalance in the number of positive and negative data points in the prognostic dataset, 148 samples were of class non-recurrent and 46 samples were of the class recurrent. Similarly, the diagnostic dataset has 357 samples of the B (Benign) class and 212 samples of the M (Malignant) class. To fix the imbalance, the concept of random oversampling is applied and the ratio of the number of samples in each class is converted to 1:1. The distribution of the variables in both the datasets is examined, it is found that the distribution of many variables of the dataset has skewness, to fix this problem, quantile transform was used and the data were converted into a normal distribution. Further on calculating the correlation between the variables of the dataset, it is observed that multiple columns have a high correlation with each other which can have an abnormal effect on the final prediction. In the heat map plotted in Figs. 49.2 and 49.3, it can be seen that there is a very high correlation between the perimeter, area, and radius as well as the concave points and concavity with respect to the mean, worst, and standard deviation values, on further investigation, it is discovered that the “mean,” “worst” as well as “SE” values have a high correlation with their respective attributes. All the variables with

550

Y. Shinde et al.

Fig. 49.2 Heat map for correlation of variables in diagnostic dataset

Fig. 49.3 Heat map for correlation of variables in diagnostic dataset

high correlation are shown in Figs. 49.2 and 49.3, where it is seen that a high number of variables are correlated in both datasets.

49.5.2 Model Evaluation and Results The models used for training and evaluation purposes are random forest, XGBoost classifier, and KNN, the models are evaluated using stratified K-fold cross-validation where K is selected to be 10, which means the training to test ratio is 9:1, this helps to get a better overview of how the model performs in the real world and there is not any kind of bias created. The models are trained over data with fewer correlated columns as well as data with all the columns for comparison purposes. The results obtained from the evaluation process are as shown in Tables 49.1 and 49.2.

49 Comparative Study of Machine Learning Algorithms …

551

Table 49.1 Diagnostic dataset results Dropping correlated columns Model name

Accuracy

F1-score

Precision

Recall

AUC score

Random forest

98.04

98.06

98.61

97.53

99

XGBoost classifier

98.04

98.07

98.88

97.33

99

KNN

95.51

95.49

94.66

96.42

99

Without dropping correlated columns Random forest

97.34

97.37

98.33

96.48

99

XGBoost classifier

98.32

98.31

98.61

98.06

99

KNN

96.49

96.46

95.79

97.20

99

Table 49.2 Prognostic dataset results Dropping correlated columns Model name

Accuracy

F1-score

Precision

Recall

AUC score

Random forest

93.54

94.16

97.34

92.14

99

XGBoost classifier

92.54

93

97.34

89.63

99

KNN

73

75.27

81.14

71.02

80

Without dropping correlated columns Random forest

96.27

96.46

98.67

94.73

99

XGBoost classifier

91.15

92.2

99.34

86.72

99

KNN

64.21

66.82

71.62

64.48

73

From the results obtained in Table 49.1, it is visible that all the models perform well on the diagnostic dataset, the model which stands out the most is the XGBoost classifier. It is also observed that there is no significant performance boost or lag after dropping the columns for avoiding multicollinearity in any of the models. From the results obtained in Table 49.2, it is visible that all the models perform well on the prognostic dataset, the model which stands out the most is the random forest classifier. It is also observed that there is a significant performance boost after dropping the columns for avoiding multicollinearity achieved through XGBoost classifier (Fig. 49.5). The results obtained show that the random forest classifier and XGBoost classifiers performed well on both the diagnostic as well as prognostic dataset but on the other KNN was able to perform well on the diagnostic dataset but lacked performance on the prognostic dataset. One of the other observations is that multicollinearity is not a problem for the tree-based models, i.e., random forest and XGBoost but it does affect the performance of KNN in the prognostic dataset. The AUC-ROC curve of the models can be seen in Figs. 49.3 and 49.4 where it is visible that random forest and XGBoost were consistent with the results but KNN was not.

552

Y. Shinde et al.

Fig. 49.4 AUC-ROC curves for KNN, random forest, and XGBoost for diagnostic dataset

Fig. 49.5 AUC-ROC curves for KNN, random forest and XGBoost for prognostic dataset

49 Comparative Study of Machine Learning Algorithms …

553

49.6 Conclusion It can be concluded that random forest classifier works well with both the datasets with an accuracy = 98.04% and F1-score = 98.06% on the diagnostic dataset and accuracy = 96.27% and F1-score = 96.47% on the prognostic dataset. This trend is followed by XGboost classifier which also shows good results on both the datasets with an accuracy = 98.32% and F1-score = 98.31% on the diagnostic dataset and an accuracy = 92.54% and F1-score = 93% on the prognostic dataset. Unlike these two classifiers, KNN performs well on the diagnostic dataset, achieving an accuracy = 96.49% and F1-score = 96.46% but performs poorly on prognostic dataset where it gives an accuracy = 64.21% and F1-score = 66.82%. For the future scope, an ensemble model with multiple models can be created for achieving a better overall performance.

References 1. Gupta, S.: Breast cancer: Indian experience, data, and evidence. South Asian J. Cancer 5(3), 85–86 (2016). https://doi.org/10.4103/2278-330X.187552. PMID: 27606287; PMCID: PMC4991143 2. Bhattacharyya, S.G., Doval, D.C., Desai, C.J., Chaturvedi, H., Sharma, S.,Somashekhar, S.P.: Overview of breast cancer and implications of overtreatment of early-stage breast cancer: an Indian Perspective. JCO Global Oncol. 6, 789–798 (2020). https://doi.org/10.1200/GO.20. 00033, PMID: 32511068 3. Mathur, P., Sathishkumar, K., Chaturvedi, M., Das, P., Sudarshan, K.L., Santhappan, S., Nallasamy, V., John, A., Narasimhan, S., Roselind, F.S.: Cancer statistics, 2020: Report From National Cancer Registry Programme, India. JCO Global Oncol. 6, 1063–1075 (2020). https:// doi.org/10.1200/GO.20.00122, PMID: 32673076 4. Prusty, R.K., Begum, S., Patil, A., et al.: Knowledge of symptoms and risk factors of breast cancer among women: a community based study in a low socio-economic area of Mumbai India. BMC Women’s Health 20, 106 (2020). https://doi.org/10.1186/s12905-020-00967-x 5. Miller, A.B., Baines, C.J., To, T., Wall, C.: Canadian national breast screening study: 1. Breast cancer detection and death rates among women aged 40 to 49 years. CMAJ. 147(10), 1459– 1476 (1992). Erratum Can. Med. Assoc. J. 148(5), 718 (1993). PMID: 1423087; PMCID: PMC1336543 6. Agarap, A.F.M.: On breast cancer detection: an application of machine learning algorithms on the Wisconsin diagnostic dataset. In: Proceedings of the 2nd International Conference on Machine Learning and Soft Computing ICMLSC’18. Association for Computing Machinery, New York, NY, USA, pp. 5–9 (2018). https://doi.org/10.1145/3184066.3184080 7. Al-Hadidi, M.R., Alarabeyyat, A., Alhanahnah, M.: Breast cancer detection using K-nearest neighbor machine learning algorithm. In: 2016 9th International Conference on Developments in eSystems Engineering (DeSE), pp. 35–39 (2016). https://doi.org/10.1109/DeSE.2016.8 8. Asri, H., Mousannif, H., Moatassime, H.A.: Thomas noel, using machine learning algorithms for breast cancer risk prediction and diagnosis. Procedia Comput. Sci. 83, 1064–1069 (2016), ISSN 1877-0509. https://doi.org/10.1016/j.procs.2016.04.224 9. Vaka, A.R., Soni, B., Sudheer Reddy, K.: Breast cancer detection by leveraging Machine Learning. ICT Express 6(4), 320–324 (2020), ISSN 2405-9595. https://doi.org/10.1016/j.icte. 2020.04.009

554

Y. Shinde et al.

10. Osareh, Shadgar, B.: Machine learning techniques to diagnose breast cancer. In: 2010 5th International Symposium on Health Informatics and Bioinformatics, pp. 114–120 (2010). https:// doi.org/10.1109/HIBIT.2010.5478895 11. Mohammed, S.A., Darrab, S., Noaman, S.A., Saake, G.: Analysis of breast cancer detection using different machine learning techniques. In: Tan, Y., Shi, Y., Tuba, M. (eds.) Data Mining and Big Data. DMBD 2020. Communications in Computer and Information Science, vol. 1234. Springer, Singapore (2020). https://doi.org/10.1007/978-981-15-7205-0_10 12. Sharma, S., Aggarwal, A., Choudhury, T.:Breast cancer detection using machine learning algorithms. In: 2018 International Conference on Computational Techniques, Electronics and Mechanical Systems (CTEMS), pp. 114–118 (2018). https://doi.org/10.1109/CTEMS.2018. 8769187 13. Chaurasia, V., Pal, S.: A novel approach for breast cancer detection using data mining techniques (June 29, 2017). Int. J. Innov. Res. Comput. Commun. Eng. (An ISO 3297: 2007 Certified Organization) 2(1) (2014) 14. Li, Y., Chen, Z.: Performance evaluation of machine learning methods for breast cancer prediction. Appl. Comput. Math. 7(4), 212–216 (2018). https://doi.org/10.11648/j.acm.201 80704.15 15. Ganggayah, M.D., Taib, N.A., Har, Y.C., et al.: Predicting factors for survival of breast cancer patients using machine learning techniques. BMC Med Inform Decis Mak 19, 48 (2019). https:// doi.org/10.1186/s12911-019-0801-4 16. Islam, M.M., Haque, M.R., Iqbal, H., et al.: Breast cancer prediction: a comparative study using machine learning techniques. SN Comput. Sci. 1, 290 (2020). https://doi.org/10.1007/s42979020-00305-w 17. Ahmad, L.G., Eshlaghy, A.T., Poorebrahimi, A., Ebrahimi, M., Razavi, A.R.: Using three machine learning techniques for predicting breast cancer recurrence. J Health Med. Inform. 4, 124 (2013). https://doi.org/10.4172/2157-7420.1000124 18. Patel, J., Tejal, U.D., Patel, S.: Heart disease prediction using machine learning and data mining techniques. Heart Dis. 7(1), 129–137 (2015) 19. XingFen, W., Xiangbin, Y., Yangchun, M.: Research on user consumption behavior prediction based on improved XGBoost algorithm. IEEE Int. Conf. Big Data (Big Data) 2018, 4169–4175 (2018). https://doi.org/10.1109/BigData.2018.8622235 20. Shah, D., Patel, S., Bharti, S.K.: Heart disease prediction using machine learning techniques. SN Comput. Sci. 1, 345 (2020). https://doi.org/10.1007/s42979-020-00365-y 21. UCI Machine Learning Data Repository—Wisconsin breast cancer prognostic dataset 22. UCI machine learning data repository—Wisconsin breast cancer diagnostic dataset

Chapter 50

miRNAs as Biomarkers for Breast Cancer Classification Using Machine Learning Techniques Subhra Mohanty, Saswati Mahapatra, and Tripti Swarnkar

Abstract MicroRNA or (miRNA) is a small noncoding RNA that regulates the expression levels of gene in post transcriptional order. Small noncoding miRNA expression are altered between cancer tissue and normal tissue. miRNAs have been emerged as candidate key for breast cancer (BC) which accounts for one-fourth of all diagnosed cancers and affects one in eight females. There is practical need for identifying the most influencing miRNAs in BC for clear understanding of cancer prognosis and treatment. In this work, miRNA expression data (GSE58606) has been utilized for identifying key miRNAs breast cancer. A filter-based feature selection approach has been used for candidate miRNA identification. Further, the predictive performance of the identified miRNAs was validated using classifiers; Naïve Bayes, support vector machine, K-nearest neighbor, neural networks under various evaluation matrices. For all the classifiers, the nine selected miRNA candidates were observed with equivalent predictive accuracy in comparison to the all the miRNAs in the data set. The selected miRNAs were also verified to be significantly differentially expressed and take part in crucial biological activities in breast cancer.

50.1 Introduction Breast cancer the deadliest cancer showing in females and it remains a challenge for carcinogenic diagnosis. In 2020, as data generated by NCI and WHO, 2.3 million women were diagnosed with breast cancer and over 685,000 deaths were recorded globally. Breast cancer causes in the lining epithelium cells of the lobules (15%) or S. Mohanty Department of Computer Science and Engineering, Siksha O Anusandhan Deemed to be University, Bhubaneswar, India S. Mahapatra (B) · T. Swarnkar Department of Computer Application, Siksha O Anusandhan Deemed to be University, Bhubaneswar, India e-mail: [email protected] T. Swarnkar e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2022 D. Mishra et al. (eds.), Intelligent and Cloud Computing, Smart Innovation, Systems and Technologies 286, https://doi.org/10.1007/978-981-16-9873-6_50

555

556

S. Mohanty et al.

the ducts (85%) occurring in the glandular tissue of the breast [1]. Initially cancerous tissue growth is confined to the lobule or duct (“in situ”) and it generally causes the patient with no symptoms and has minimal potential for spread (metastasis) [2]. Over the time, these cancer situ progress and spread in the surrounding breast tissue then it may spread other organs as well. Wide spread of metastasis may cause a women death. Therefore, early diagnosis of cancer emerging as a major research challenge in the omics era. [3]. Combination of different types therapy and surgical treatment may lead to timely therapy of breast cancer. MicroRNAs are a small noncoding RNA having ~22 nucleotide which functions post transcriptional gene regulation in various biological and anatomical processes by binding selective mRNAs. It can cause for negative regulation of messenger RNAs (mRNAs), which may lead to either translational repression or mRNA degradation [4]. Aberrant expression of miRNAs is involved in the regulation of breast cancer cells [5]. Different types of biological parameter such as apoptotic response, cell proliferation, metastasis, chemo resistance and cancer recurrence are regulated by either tumor suppressor miRNA (tsmiR) or oncogenic miRNA (oncomiR). In miRNAs, tsmiRs inhibit oncogenes expression and it promote breast tumorigenesis [6]. Onco miRs are over expressed in breast cancer and attributed to breast malignancy. Many computational approaches have been proposed in literature addressing the breast cancer diagnosis and treatment using miRNA signatures. Microcalcifications are early breast cancer symptoms. A partially automated segmentation method was proposed in [6] to identify the microcalcifications in breast mass. The predictive accuracy of the proposed deep learning model was evaluated on large datasets. Sathipati and Ying Ho used support vector machine (SVM)-based classifier SVMBRC for early and advantage diagnosis of breast cancer [1]. SVM-BRC was also performing well on optimal feature selection method. They combined genetic algorithm to identify miRNA biomarkers. Sarkar et al. identified the most significant miRNA biomarkers of breast cancer by the helps of NGS DATA [7]. This work aims at identifying the key miRNA signatures with high predictive ability in breast cancer diagnosis and prognostic applications. In this study, we have performed the experiment by taking the miRNA expression dataset (GSE58606). A co-relation-based feature selection has been applied which produced a reduced set of features. Further, the predictive ability of the selected miRNA features has been accessed and compared using four different classifiers, support vector machine (SVM), Naïve Bayes (NB), K-nearest neighbor (K-NN (n = 3)) and neural networks (NN). The selected miRNA features were biologically confirmed with respect to their differential expression and functional association in various cancer pathways. The reduced set of features were able to successfully classify the patients with primary breast cancer or normal breast tissue subjects with an accuracy of 100% on SVM model, Naive Bayes model, KNN, NN and KNN model score reached 95%.

50 miRNAs as Biomarkers for Breast Cancer Classification …

557

50.2 Dataset Used miRNA expression data set with GEO accession number GSE5806 from NCBI (http://www.ncbi.nlm.nih.gov), has been utilized in this study. The dataset comprised of 1926 miRNA features with 133 samples, among which 122 samples were from patients having primary breast cancer and 11 samples were from patients having normal breast tissue.

50.3 Proposed Model The following sections elaborates the steps of our proposed work for significant miRNA feature identification. Figure 50.1 illustrates the steps of the proposed work.

Fig. 50.1 Proposed model for identification of significant miRNAs for breast cancer classification

558

S. Mohanty et al.

50.3.1 Data Preparation miRNA features obtained from NCBI suffer from missing value problem due to various experimental reasons. The presence of missing values can cause adversely affect in downstream analysis. Hence, before moving ahead, features with more than 30% of NULL values across samples were removed [8] which resulted in 1560 miRNA features and further missing values were replaced with mean value of observed data. As per our dataset strategies, the normal breast tissue samples are fairly less as compare to the primary breast cancer tissue samples. Hence, to balance the ratio of tumor to normal samples in the considered data set, a data augmentation technique taking features from our dataset has been applied on the input data. Initially the filtered dataset consists of 1560 miRNA features with 133 samples, after applying data augmentation we obtained the data set with 175 samples and 1560 miRNA features among them 122 patients having primary breast tissue and 53 patients having normal breast tissue. Finally, min-max normalization has been applied to bring the data within a specific range.

50.3.2 Feature Selection A co-relation-based feature selection (CFS) has been applied on the normalized dataset which ranks the features based on the co-relation coefficient value. This CFS approach assumes features showing high co-relation with the class label are more relevant than the features showing low co-relation with the class. Pearson correlation has been used to find the correlation between the features and the target. Correlation coefficient value >0.5 has been considered for identifying key miRNA features. The feature selection step of our approach resulted in 9 miRNA features. Further, the predictive ability of these selected features was examined in the following step.

50.3.3 Classifier Models Four different classifiers such as neural networks (NN), Naïve Bayes (NB), K-nearest neighbor (KNN) and support vector machine (SVM) have been used to examine the predictive performance of the 9 selected features. The classifier performance of the selected features was also compared with that of the base features. A.

Neural Networks

Neural networks are series of algorithm or circuit of neurons consisting of nodes or artificial neurons and it is made up with perceptron layer it also known as “multilayer perceptron”. In neural network architecture, neurons receive information in the set of input. The connection between neuron to neuron are known as weights.

50 miRNAs as Biomarkers for Breast Cancer Classification …

559

Negative weights denote an inhibitory connection, whereas positive weights denote an excitatory connection. By updating the input value all weights are modified and their summation is considered. An activation function controls the amplitude of the output. Usually, the acceptable range of output is between 0 and 1. ANN technique was used in [9] for early detection of breast cancer. In this work, sigmoid activation function with learning rate of 0.001 has been used to train the NN model. B.

K-Nearest Neighbors

The K-nearest neighbor is used to predict target variable by using multiple independent variables, for prediction it uses supervised machine learning algorithm [10]. KNN classify an object by considering multitudes of votes of its neighbors. The object will be assigned to the class which is most common among its k-nearest neighbors. We did repeated experiments with different values of k and finally considered the value for which it gave best classification performance. We have considered k = 3 for classification purpose. C.

Support Vector Machine

SVM is the mostly used supervised machine learning algorithm based on the statistical framework. SVM algorithm maps each training samples in a n-dimensional space, where n is the number of features [1]. Value of each feature signifies a value that correspond to a particular coordinate. In this way it constructs a hyperplane or a set of hyperplanes which can be used for classification. Then, it performs classification for new samples by mapping them to the same n-dimensional space and predicting them to belong to a category based on which side of the hyperplane they fall. D.

Naïve Bayes

It is a classification algorithm. It calculates the prior probability for class labels with each attribute for every class, then it put the values in Bayes formula for calculation of posterior probability [11]. The input will be assigned to the higher probability class. Here, Gaussian Naïve Bayes has been used for binary classification, it returns the probabilities of the two classes by taking the label set y as input.

50.3.4 Evaluation Parameters The performance of the different classifier models was evaluated based on the parameters sensitivity, specificity, F-score, recall, precision and accuracy. The parameters are defined as follows: (TP: true positive, TN: true negative, FP: false positive, FN: false negative). Accuracy: Accuracy shows how close a measured value is to an actual value and is calculated as: Acc = (TP + TN)/(TP + TN + FP + FN)

560

S. Mohanty et al.

Recall: Recall is the ratio of correctly positive observation to the all observations in actual class. Recall = TP/(TP + FN) Precision: Precision is the ratio between the number of accurate positives and the number of actual positives. Precision = TP/(TP + FP) F-Score: F-score defined as the harmonic mean between precision and recall. F − score = 2 ∗ (Precision ∗ Recall)/(Precision + Recall)

50.4 Results Data preparation step of our approach resulted in 1560 miRNA features with 175 samples which is further inputted to feature selection step. A correlation-based feature selection approach was adopted to identify most relevant miRNAs. A set of nine miRNA features were obtained as a result of feature selection. Further, four different classifier models (NN, NB, KNN, SVM) were trained using these nine miRNA features and the predictive ability of the key miRNAs were compared with the base miRNA features (1560 input features). Input data was spitted in the ratio of 75– 25% for preparing training dataset and test dataset, respectively, and tenfold cross validation was used. Precision, recall, F-score, accuracy was used to measure the performance of the classifiers. Table 50.1 illustrates the performance of the classifier models NN, NB, KNN and SVM while taking all the base miRNA features as input. All the classifier models were observed with precision, recall, F-score and overall accuracy values in the range of 0.94–1. SVM and NB classifiers were confirmed with 100% scores. Since our model was well trained with all human miRNA breast cancer dataset (base features). We Table 50.1 Class performance of the base and filtered miRNA features Model

Base miRNA features Pr

Rec

F-sc

Filtered miRNA features Acc

Pr

Rec

F-sc

Acc

SVM

1.0

1.0

1.0

1.0

1.0

1.0

1.0

1.0

Naïve Bayes

1.0

1.0

1.0

1.0

1.0

1.0

1.0

1.0

KNN (k = 3)

1.0

0.94

0.96

0.95

1.0

1.0

1.0

1.0

NN

1.0

0.93

0.96

0.94

1.0

1.0

1.0

1.0

Pr precision, Rec recall, F-sc F-score, Acc accuracy

50 miRNAs as Biomarkers for Breast Cancer Classification … Table 50.2 Results obtained from data base of differentially expressed miRNAs in human cancer (dbDEMC)

miRNA ID hsa-miR-1278

logFC 0.55

561 Expression status in BC Up regulated

hsa-miR-154-5p

−1.18

Down regulated

hsa-miR-4668-3p

−0.98

Down regulated

hsa-miR-548w

0.19

Up regulated

hsa-miR-4664-3p

0.62

Up regulated

hsa-miR-3977

0.69

Up regulated

hsa-miR-101-5p

−0.83

Down regulated

hsa-miR-548as-3p

−1.33

Down regulated

hsa-miR-576-5p

0.49

Up regulated

wanted to know how our model would perform on filtered human miRNA dataset. We further run the same classifier models taking the significant nine miRNA features as input. Table 50.2 illustrates the predictive ability if the identified miRNA features. To our surprise the nine selected miRNA features were observed with 100% score with respect to precision, recall, accuracy and F-score. This confirms the potential of the identified miRNA features as predictor variable. ROC curve showing the performance of the filtered miRNA features is demonstrated in Fig. 50.2.

50.4.1 Biological Relevance Analysis In order to authenticate the biological involvement of the identified miRNAs in breast cancer, we validated the differential expression of the selected miRNAs using the database of differentially expressed miRNAs in human cancer (dbDEMC) version 2.0 [12]. dbBEMC is a collection of differentially expressed miRNAs covering 36 human cancer types and 73 subtypes. Result validated from dbDEMC confirms that the nine selected miRNA features show quantitative changes in expression levels between diseased and normal breast cancer tissue. Nine selected miRNAs with their respective fold change value and expression status in breast cancer is illustrated in Table 2. The association of the nominated miRNAs in various biological processes specific to cancer were analyzed using TAM 2.0. TAM 2.0 is a tool for miRNA set enrichment analysis by exploring their functional and disease association [13]. The result obtained from TAM 2.0 is shown in Fig. 50.2. The identified miRNA features are associated in various biological processes including plasma cell differentiation, DNA damage repair and hormone mediating signaling pathway with a significant p-value (Fig. 50.3).

562

S. Mohanty et al.

Fig. 50.2 Performance results in terms of a ROC of proposed SVM model b ROC of proposed Naïve Bayes model c ROC of proposed KNN (k = 3) d ROC of proposed NN model

Fig. 50.3 Graphically showing the functionality of miRNAs with corresponding −log10(p-values)

50.5 Conclusion A very pertinent medical problem is the early prediction of breast cancer that prevents the patients from getting into a serious stage of breast cancer. Aberrant expression

50 miRNAs as Biomarkers for Breast Cancer Classification …

563

levels of miRNAs can be a cause for negative regulation of mRNAs, which may lead to either translational repression or mRNA degradation and finally results in disrupted development of cancer cells. In this study, miRNA expressions have been utilized for breast cancer classification using different machine learning models. The miRNAs identified in our approach showed equivalent performance in comparison to the base miRNA features. The filtered human miRNA breast cancer signatures also showed better performance with a good prognostic ability. The hand-picked miRNA signatures are also confirmed to show disrupted expression levels between diseased and normal breast cancer tissue with significant association in crucial biological processes. The candidate miRNA signatures obtained in this study can be considered for further analysis including their involvement in characterizing the molecular definition of cancer which may contribute to anticancer therapy.

References 1. Sathipati, S.Y., Ho, S-Y.: Identifying a miRNA signature for predicting the stage of breast cancer. Sci. Rep. 8(1), 1–11 (2018) 2. Evans, S.C., et al.: MicroRNA target detection and analysis for genes related to breast cancer using MDLcompress. EURASIP J. Bioinform. Syst. Biol. 1–16 (2007) 3. Liu, X., et al.: Construction of a potential breast cancer-related miRNA-mRNA regulatory network. BioMed Res. Int. 2020 (2020) 4. Zhao, Y., et al.: Decrease of miR-202-3p expression, a novel tumor suppressor, in gastric cancer. Plos One 8(7), e69756 (2013) 5. Sherafatian, M.: Tree-based machine learning algorithms identified minimal set of miRNA biomarkers for breast cancer diagnosis and molecular subtyping. Gene 677, 111–118 (2018) 6. Wang, J., et al.: Discrimination of breast cancer with microcalcifications on mammography by deep learning. Sci. Rep. 6(1), 1–9 (2016) 7. Sarkar, J.P., et al.: Machine learning integrated ensemble of feature selection methods followed by survival analysis for predicting breast cancer subtype specific miRNA biomarkers. Comput. Biol. Med. 131, 104244 (2021) 8. Mahapatra, S., Mandal, B., Swarnkar, T.: Biological networks integration based on dense module identification for gene prioritization from microarray data. Gene Rep. 12, 276–288 (2018) 9. Rani, K.U.: Parallel approach for diagnosis of breast cancer using neural network technique. Int. J. Comput. Appl. 10(3), 1–5 (2010) 10. Medjahed, S.A., Saadi, T.A., Benyettou, A.: Breast cancer diagnosis by using k-nearest neighbor with different distances and classification rules. Int. J. Comput. Appl. 62(1) (2013) 11. Hazra, A., Mandal, S.K., Gupta, A.: Study and analysis of breast cancer cell detection using Naïve Bayes, SVM and ensemble algorithms. Int. J. Comput. Appl. 145(2), 39–45 (2016) 12. Yang, Z., Wu, L., Wang, A., Tang, W., Zhao, Y., Zhao, H., Teschendorff, A.E.: dbDEMC 2.0: updated database of differentially expressed miRNAs in human cancers. Nucleic Acids Res. 45(D1), D812–D818 (2017). https://doi.org/10.1093/nar/gkw1079 13. Li, J., Han, X., Wan, Y., Zhang, S., Zhao, Y., Fan, R., Cui, Q., Zhou, Y.: TAM 2.0: tool for MicroRNA set analysis. Nucleic Acids Res. 46(W1), W180–W185 (2018). https://doi.org/10. 1093/nar/gky509

Chapter 51

A Computational Intelligence Approach for Cancer Detection Using Artificial Neural Network Rasmita Dash, Rajashree Dash, and Rasmita Rautray

Abstract In cancer detection, early diagnosis of cancer is necessary for cancer research. To come off with the solution, these problems correlate biomedical or bioinformatics with the field of computational intelligence approaches. The computational intelligence tools help to identify cancer-affected patients, level of risk, i.e., high or low in the cancer patient, detection of principal feature for the presence of cancerous cell. Inspired from the previous research of machine learning, in this research diagnosis, authors have tried to detect cancerous cell in breast cancer data using artificial neural network.

51.1 Introduction Continuous efforts are going on for the early and accurate diagnosis of cancer patient detection. Manual diagnosis has lot of disadvantages such as it requires oncologists to examine the data, and it is time-consuming and inaccurate analysis. Thus, efficient and accurate models and methods are required to come up with the solutions for such complex and critical analysis. Neural network, it is a part of computational intelligence technique, which is successfully solving many critical issues in many fields. The various application areas in which neural network is successful are biomedical analysis, weather forecasting, satellite communication, speech recognition, natural language processing, etc. Motivated from the tremendous application of neural network, in this work authors have tried to build a neural network-based cancer data classification model. As the R. Dash (B) · R. Dash · R. Rautray Department of Computer Science and Engineering, Siksha ‘O’ Anusandhan (Deemed to be) University„ Bhubaneswar, Odisha, India e-mail: [email protected] R. Dash e-mail: [email protected] R. Rautray e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2022 D. Mishra et al. (eds.), Intelligent and Cloud Computing, Smart Innovation, Systems and Technologies 286, https://doi.org/10.1007/978-981-16-9873-6_51

565

566

R. Dash et al.

features are huge in cancer data, first a feature selection approach is implemented and then classification is done implementing various optimizers of neural network. Thus, a comparative analysis is presented for superior model selection.

51.2 Literature Survey There exist a lot of real-world situation in which computational intelligence techniques especially the neural networks and its variants are successful. In this literature survey, few neural network-based model for breast cancer data analysis are highlighted. In this paper [1], breast cancer data is analyzed using principal component analysis (PCA) and artificial neural network. The breast cancer data is of 718 × 11, i.e., 718 number of observations, and 11 are the features. In this analysis, out of 11 features two principal components are extracted, i.e., PC1 and PC2, which significantly increases the performance of the neural network (NN). In this work, it has been concluded that this combination PCA and NN is one of the suitable model for large dataset analysis. Convolution neural network (CNN) and variants of NN are used in paper [2]. Authors have applied these techniques for effective diagnosis and treatment in healthcare system. Both multilayer perceptron neural network (MLPNN) and CNN are used for breast cancer data analysis and both the models show comparable result. In this research work [3], neural dynamic algorithm-based NN is used for the first time in the field of pattern recognition for breast cancer data analysis. A voting convergence difference is applied to neural network which improves the performance to 100%. In this paper [4], feature weighting concept is added to improve the classification performance. Thus, for feature weighting ant lion optimization algorithm is used, and for classification back-propagation neural network is implemented. This model is utilized in breast cancer dataset. This analysis shows that the combination of wrapper technique and NN is one of the suitable models for high-dimensional data classification.

51.3 Proposed Model Cancer data is a high-dimensional data. It is characterized by large number genes and with a few samples. Thus, it is unbalanced data as collection of samples is too difficult. Noise is unavoidable when the raw data is generated. This may be due to measurement tool error, any random error, etc. Thus, analysis of such data becomes too complex [5–7]. Following the existing literature, this research work suggests a simple computational intelligence technique for dealing with high-dimensional breast cancer data. This investigation works in two phases. In the first part of this model, data preprocessing is done two levels. First to bring the data values to restricted

51 A Computational Intelligence Approach for Cancer Detection …

Data reduction using Feature selection approach

Data Preprocessing

567

Cancer Gene Expression Data

Performance analysis using performance metrics

Fig. 51.1 A neural network-based cancer data classification model

range (small range) min-max normalization is applied. Further as it is high dimension in nature, it contains both significant features and few insignificant features. Due to the presence of such features in cancer-affected sample, identification becomes too difficult. Thus, these features required to be removed. At this stage, the feature selection technique called linear discriminant analysis (LDA) is applied and the entire dataset is reduced. Thereafter in the second stage, a neural network model considering different optimizers, the cancer data is classified and thus analyzed. The pictorial representation of the proposed model is presented in Fig. 51.1

51.4 Result Analysis 51.4.1 Neural Network Model for Proposed Work For this section, we are going to use an artificial neural network, which will be a sequential model having at least two hidden layers. Unlike the last section, there will be only one model (The ANN) but we shall use multiple optimizers for the models to see their effect. The stepwise execution of the proposed work is as follows Step1: Preprocessing of data. Step2: Splitting the data into K folds.

568

R. Dash et al.

Step3: Creating the ANN model with at least one input, one output and two hidden layers. Step4: Compile the model after choosing an optimizer and a loss function. Step5: Using the mean of the K accuracy with standard deviation as the final accuracy of the model. Step6: Check for different optimizers by redoing steps 1–5. Step7: Choose the optimizers which provide the highest accuracy.

51.4.2 Dataset Description and Result Analysis Wisconsin Breast Cancer Dataset (WBCD) happens to be one of the most popular dataset used for machine learning projects. The dataset performed outstandingly well with all the above-stated models. We have divided the results into two categories: without using dimensionality reduction and with using dimensionality reduction. The resultant output is binary (Malignant-1 or Benign-0). Using dimensionality reduction leads to a slight improvement in most of the models. The model specification is as follows. Model specifications: Input layer: input parameters—29 activation—ReLU Hidden layers: no of layers—2, Neurons in each layer—15, activation—ReLU Output layer: loss function—Binary Cross-Entropy Metrics—accuracy. The model performs significantly well with all optimizers. Using dimensionality reduction does not improve the accuracy by much. All models seem to be reaching their maximum accuracy with a value of 97–89%. The proposed model output is presented in Table 51.1. Model specifications: Input layer: input parameters—10, activation—ReLU Hidden layers: no of layers—2, Neurons in each layer—5, activation—ReLU Output layer: loss function—Binary Cross-Entropy Metrics—accuracy. Table 51.1 Accuracies on Wisconsin dataset

NN optimizer

Accuracy (std deviation) Without dimensionality reduction

Accuracy (std deviation) Dimensionality reduction LDA (n = 2)

Adam

97.53(±2.55)

97.80(±1.74)

RMSprop

97.71(±2.04)

97.85(±1.74)

AdaDelta

97.05(±2.58)

97.89(±1.74)

51 A Computational Intelligence Approach for Cancer Detection … Table 51.2 Accuracies on OPEN_ML dataset

NN optimizer

Accuracy (std deviation) Without Dimensionality Reduction

569 Accuracy (std deviation) Dimensionality Reduction LDA (n = 2)

Adam

94.56(±2.07)

96.02(±2.05)

RMSprop

96.29(±2.02)

98.25(±1.54)

AdaDelta

97.71(±1.94)

97.71(±1.94)

One of the few datasets put forth by the oncology institute which has been used in machine learning literature several times. There are nine attributes present in this dataset and two classes as the dependent variable. The dependent variable has two categories, one having 201 instances and there other with 85 instances. In OPEN ML dataset, neural network performs well with the highest accuracy of 98.25% for RMSprop optimizers after dimension reduction. For rest of the optimizer, result is shown in Table 51.2.

51.5 Conclusion This research work is a very basic computational intelligence-based data processing approach for breast cancer data analysis. Here two breast cancer data have been taken. The features which are redundant are removed using LDA, and comparative analysis has been shown taking without feature reduction and with feature reduction. In every outcome, result is always improved. For this data classification work, neural network with different optimizers is considered.

References 1. Buci´nski, A., B˛aczek, T., Krysi´nski, J., Szoszkiewicz, R., Załuski, J.: Clinical data analysis using artificial neural networks (ANN) and principal component analysis (PCA) of patients with breast cancer after mastectomy. Rep. Pract. Oncol. Radiother. 12(1), 9–17 (2007) 2. Desai, M., Shah, M.: An anatomization on breast cancer detection and diagnosis employing multi-layer perceptron neural network (MLP) and Convolutional neural network (CNN). Clin. eHealth (2020) 3. Zhang, Z., Chen, B., Xu, S., Chen, G., Xie, J.: A novel voting convergent difference neural network for diagnosing breast cancer. Neurocomputing, 437, 339–350 4. Dalwinder, S., Birmohan, S., Manpreet, K.: Simultaneous feature weighting and parameter determination of neural networks using ant lion optimization for the classification of breast cancer. Biocybernetics Biomed. Eng. 40(1), 337–351 (2020) 5. Dash, R., Dash, R., Rautray, R.: An evolutionary framework based microarray gene selection and classification approach using binary shuffled frog leaping algorithm. J. King Saud Univ. Comput. Inf. Sci. (2019)

570

R. Dash et al.

6. Dash, R.: An adaptive harmony search approach for gene selection and classification of high dimensional medical data. J. King Saud Univ. Comput. Inf. Sci. 33(2), 195–207 (2021) 7. Dash, R., Misra, B.B.: Pipelining the ranking techniques for microarray data classification: a case study. Appl. Soft Comput. 48, 298–316 (2016)

Chapter 52

Brain MRI Classification for Detection of Brain Tumors Using Hybrid Feature Extraction and SVM G. Tesfaye Woldeyohannes and Sarada Prasanna Pati

Abstract Brain tumor is one form of brain abnormalities that affects the brain tissues. Magnetic resonance imaging (MRI) is the most popular imaging modality that capture and preserve greatest quality brain image with rich information that provides anatomical structures and internal contents of the brain. It is very challenging in the part of the radiologist to detect the abnormal structures of human brain using clinical expertise and manual image identification methods. Computerassisted diagnosis (CAD), on the other hand, helps in early identification of brain disorders much faster and easier. In this paper, we proposed and studied a support vector machine (SVM)-based classifier model that implemented a hybrid feature extraction and reduction technique to categorize an input MRI images as normal or tumorous. The model initially implements two-dimensional discrete wavelet transform (2D-DWT) to extract the features of given input image followed by principal component analysis (PCA) to minimize the dimensions of extracted features. The experiments were conducted with 400 MR images comprising of 160 normal and 240 abnormal MRIs with brain tumors. Classifications were performed with five traditional classifiers, namely SVM, random forest, logistic regression, backpropagation neural network (BPNN), and K-nearest neighbor (KNN). On evaluating the classifier models, the proposed SVM-based model achieved better performance compared to the rest of classifiers in terms of accuracy, specificity and sensitivity, and the area under curve (AUC) values.

52.1 Introduction Brain disease exists in many different forms and goes by many different names such as meningitis, encephalitis, stroke, brain tumors, trauma, epilepsy, dementia, and Alzheimer’s, those interfere with normal functioning of the brain. In order to effectively treat any such disease, it is important to identify the brain disease at G. Tesfaye Woldeyohannes · S. P. Pati (B) Department of Computer Science and Engineering, Siksha ‘O’ Anusandhan Deemed to be University, Bhubaneswar, Odisha, India e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2022 D. Mishra et al. (eds.), Intelligent and Cloud Computing, Smart Innovation, Systems and Technologies 286, https://doi.org/10.1007/978-981-16-9873-6_52

571

572

G. Tesfaye Woldeyohannes and S. P. Pati

the earliest stage of its occurrence [1]. The early detection of the disease is necessary to enhance the survival chance of the patient. Brain imaging techniques are mostly recommended for diagnosis of brain related abnormalities. Magnetic resonance imaging (MRI) is the most used and preferred imaging techniques due to its high resolution, non-radiation, non-invasive, and painless properties. MRI is also preferred as it clearly depicts different soft tissues, white and gray matters, and thus can produce high accurate images of brain and brain stem [2]. Diagnosis of brain abnormalities by specialists and radiologist based on MRI images is a tedious and time-consuming duty. Computer-assisted diagnosis (CAD), on the other hand, helps in early identification of brain disorders much faster and easier. CAD helps radiologists and clinical experts in classifying brain MR images, reduces the high burden of workload on clinician, and improves the accuracy of classification by applying image preprocessing and machine learning techniques. CAD is a system for diagnostic that combines different image analyzing approaches like feature extraction, segmentation, and classification of MRI images for identification of brain disorders. Any medical image diagnosis system involves three major steps, namely image preprocessing, feature extraction, and finally, the classification of images to normal or diseased [3]. Image preprocessing is the technique of de-noising, enhancing the image data prior to computational processing. Generally, preprocessing helps to perform steps that minimize the complexity and enhance the accuracy of the applied model [4]. Feature extraction is the primary steps in computer-aided techniques based on machine learning model. Its main objective is to find out useful features from MR image data that identify unique characteristics from the preprocessed MR images [5]. Feature extraction steps provide a reduced, unrepeated, and relevant interpretation of observation. MRI classification is a subcategory of pattern recognition technique by which image data are classified into any classes of the predefined categories. The main purpose of this phase is to categorize various input brain MRI into normal or abnormal depending on favorable feature sets. The objective of the work was to study and propose a hybrid classifier model that can classify and thereby detect tumorous brain conditions with better accuracy. This paper is organized in five different sections including the introduction. The second section discusses the proposed model. The third section presents the dataset used and the details the experiments conducted. Fourth section is on the result analysis and comparison with other classifier techniques. The final section presents the conclusion of the study.

52.2 Proposed Model Our proposed model consists of four major steps; image preprocessing, feature extraction using two-dimensional discrete wavelet transform (2D-DWT), feature reduction using principal component analysis (PCA), and finally, classification by implementing kernel-based SVM. Figure 52.1 depicts the schematic diagram of the proposed model and the steps used.

52 Brain MRI Classification for Detection of Brain Tumors …

573

Fig. 52.1 Schematic diagram for the proposed methodology

Preprocessing: In medical image preprocessing, a preprocessing step plays a vital role in order to remove impurities, stabilizing the intensity, enhancing the image data prior to computational processing [6]. Preprocessing helps to perform steps that minimize the complexity and enhance the accuracy of the forthcoming process such as segmentation and classification. In our proposed model, preprocessing tasks like image resizing, changing images to grayscale, and standardization has been done on the image data. Due to the preprocessing, the contrast images and regions become enhanced. Feature Extraction: DWT is a potential tool for feature extraction wherein wavelet is applied to analyze various frequencies of an image utilizing various scales. DWT is implemented to extract coefficients of wavelets from input brain MRI images. The wavelet has the capacity to concentrate frequency information of signal operation that is necessary to classification process. It is an effective application of the wavelet transform; it implemented a dyadic scales and positions [7]. Haar wavelet used in this work is one of the efficient wavelets chosen from the wavelet family that is very fast and support easy extraction of basic structural information from the given input image [8]. 2D image that involved DWT is applied in every dimension distinctly. It results four sub-band images (LL, LH, HL, and HH) at every scale. LL sub-band is applied for further level of transformation and is taken as approximation coefficient component of the image. The remaining LH, HL, and HH sub-bands are considered as the detail component of the image. Feature Reduction: The main objective in applying PCA is to find the eigenvectors or orthonormal-based vectors of covariance matrix for image datasets in which an image can be taken as an individual point in high dimensional surface. PCA has higher capacity in processing MRI data, and once it has got the required pattern in the data, it will condense the data by decreasing the number of dimensions by

574

G. Tesfaye Woldeyohannes and S. P. Pati

keeping the original information [9]. Vectors on PCA are based on the eigenvector of the covariance matrix of the input data. It is important for the exploratory data analysis of multi-variations as the new dimensions is known as principal components PCs. Minimized dimension is produced by selecting the PCs associated with highest eigenvalues. Classification: SVM is one of the guided learning approaches, applied for solving problems in classification as support vector classification (SVC) and solving regression problems as support vector regression (SVR) [10]. As a task of classification, it searches for optimal hyperplane that best separates the features into different domains. Whenever the dataset is more dispersed from linear distribution, the dataset becomes inseparable. For this case, kernels are applied to non-linearly map input dataset with high dimension. SVM has a potential to manage large feature spaces and has efficient comprehensive properties as compared to any other conventional classifiers since misclassification problem is being reduced in training phase of SVM classifier [11]. Selecting the best kernel-based application or problems that will further enhance the SVM classifier performance. A linear kernel function is described as in Eq. 52.1: K X − X = X, X

(52.1)

RBF is one of the common kernel functions applied in dataset classification of support vector machine. Suppose, X and X will be feature vectors in the input space, then these can be described as in Eq. 52.2: 2 x − x K X, X = exp 2σ 2

(52.2)

52.3 Experimental Study Dataset and Experimental Setup: The dataset comprises of brain MR images with different three kinds of tumors (Glioma, Meningioma, and pituitary) and images of normal brains. The dataset contains total 3264 images taken from Kaggle dataset warehouse site [12] collected on 20 February 2021. Only 400 images were selected from the dataset, out of which 160 are images of a normal brain, and 240 images of a tumorous brain. All images are in grayscale colors. We applied 80% of our datasets for training and 20% of it for testing the model out of these images 50% of testing datasets were used for validation. Figure 52.2 shows a typical representation of MRI images as normal, meningioma, glioma, and pituitary tumor. For our study, we applied a machine learning framework using PYTHON 3.9.0 on processor with an Intel(R) Core (TM)-i7-7500U CPU @2.70 GHz and @2.90 GHz,

52 Brain MRI Classification for Detection of Brain Tumors …

575

Fig. 52.2 Schematic diagram for preprocessing, feature extraction, and reduction

and 8 GB RAM running under the Ubuntu 20.04 LTS operating system. The algorithm can be checked on any computer platform where PYTHON is accessible. Image preprocessing: The first preprocessing step was to adjust the different image dimensions into uniform 256 × 256 size image with the JPG format and the conversion of RGB images into grayscale images. The dataset contained T2 weighed axial, sagittal, and coronal view of the brain images. Brain image taken from dataset repository contained features in different dimensions and scales. Therefore, we have done resizing, converting to grayscale, and scaling the image data on preprocessing before proceeding to build a model. Data standardization done using the StandardScaler () function in order to transform the data so as to have a mean value as 0 and standard deviation value as 1. This adjusts the data in a manner to have a standard normal distribution. Feature Extraction using Haar Wavelet (DWT): For our proposed model, we applied the third level decomposition of 2D-DWT of the Haar wavelet. It breaks down the input image with a dimension 256 × 256 into smaller sub-bands, in Fig. 52.4, the second image as shown on the top of left corner, is taken as the wavelet coefficient representing the third-level decomposition of the image with the approximation coefficients, whose size becomes 32 × 32 = 1024. The features at this stage can be taken as the initial features. Feature Reduction using PCA: The features extracted by Haar wavelet are size still larger. High dimension of feature vector may lead to design complexity of the model and may also result in over fitting/under fitting of data [13]. PCA was used to extract feature vectors and collect the greatest variance components. The desired

576

G. Tesfaye Woldeyohannes and S. P. Pati

number features need to be kept above 90% of the variance. For our experiment, we tried repeatedly by taking number features in the range of 20–35 but the accuracy obtained was not satisfactory. In our study, we gained higher success rate only by using 36 principal components. Classification: In final phase of the experiment, the reduced feature vectors were applied to build SVM-based classifier model. In medical image processing, applying SVM classification algorithm provides mathematically tractable solution with high accuracy and gives better geometrical interpretation. Traditional SVM technique uses a hyperplane to classify the labeled data. The kernel approach in SVM can be applied to transform a nonlinear dividing objective into a linear transformation. Application of SVM kernel induces advantages like fewer parameters to tune, implement practically into various fields successfully, and involves in complex quadratic optimization. For our study, linear SVM kernel was applied to classify our image data into normal and abnormal. In training the model, first, we divided our dataset into a training set that contains 80% of the data and testing set containing the rest 20% data. From the testing dataset, we took half of it applied for cross validation. In cross validation, continuous experiments are carried out based on holdout and five-folds cross validation technique utilized with different thresholds to tune the value. In order to find the best classifier, we applied our dataset trained and tested on five different classifiers, namely SVM, random forest, logistic regression, BPNN, and KNN and tested to achieve best classification performance.

52.4 Result Analysis The classification performance of the proposed classifier and that of the other four classifier models were finally evaluated in terms of the most common performance metrics, namely accuracy, precision, sensitivity (true positive rate), specificity (true negative rate), and AUC score. The values obtained for these metrics are given in Table 52.1. From this table, we infer that the SVM-based classifier outperforms the other classifier models. The highest result obtained with sensitivity in KNN a bit Table 52.1 Classification performance of the classifier models Classifier

Accuracy

KNN

92.5

Precision 90.6

Sensitivity 100.0

Specificity 72.7

AUC 86.4

Logistic regression

87.5

96.2

86.2

90.9

90.0

BPNN

87.5

96.2

86.2

90.9

93.0

Random forest

92.5

93.3

96.5

81.8

92.0

SVM

98.0

100.0

96.5

100.0

96.0

52 Brain MRI Classification for Detection of Brain Tumors …

577

higher than our proposed SVM model but in accuracy, specificity, and precision, SVM model out performs the rest of classifier models. AUC quantify the entire two-dimensional area within the whole receiver operating characteristic (ROC) curve. It gives a cumulative measure of performance among all possible classification thresholds. Figure 52.3 depicts the comparison between various classifier with related to its area under the ROC curve result obtained. The ROC curves clearly indicate that the SVM-based classifier exhibits better classification performance. The performance results obtained with support vector machine, random forest, logistic regression, BPNN, and KNN classifier are finally compared in the histogram as shown in Fig. 52.4, which clearly indicates that SVM-based classifier gives highest overall classification performance.

Fig. 52.3 Comparison of classification performance with respect to ROC and AUC

Fig. 52.4 Comparison of classification performance with respect to common metrics

578

G. Tesfaye Woldeyohannes and S. P. Pati

52.5 Conclusion Brain tumor is the primary cause of brain causality in most of the people having brain diseases. Various supervised techniques are employed to classify brain MRIs to detect tumors and other abnormal brain conditions. The objective of the work was to study and propose a hybrid classifier model that can classify and thereby detect tumorous brain conditions with better accuracy. In order to get the system with better performance needs to have a robust feature extraction, reduction, and well-trained classification algorithm. The study was conducted with 400 MR images comprising of 160 normal and 240 abnormal MRIs with brain tumors. DWT was applied for feature extraction from MRIs followed by PCA for further feature reduction of the image dataset. Classifications were performed with five traditional classifiers, namely SVM, random forest, logistic regression, BPNN, and KNN. On evaluating the classifier models, the proposed SVM-based model achieved better performance compared to the rest of classifiers in terms of accuracy, specificity and sensitivity, and the AUC values. Even though the proposed model shows a better performance as compared to the outcomes of some other research work those applied similar dataset, working with a more dataset may be more challenging in this regard with various tumor types.

References 1. Bashayer Fouad, M.: Automatic classification of brain tumor and Alzheimer’s disease in MRI. In: Science Direct Procedia Computer Science, vol. 163, pp. 74–84 (2019) 2. John, W.C.: Handbook of MRI technique, 4th edn. Wiley and Sons, (2014) 3. El-Dahshan, E-S.A. Hosny, T., Salem, A-B. M.: Hybrid intelligent techniques for MRI brain images classification. Digital Signal Process. 20, 433–441 (2010) 4. Rogowsky, J.: Overview and fundamentals of medical image segmentation. In: Handbook of medical imaging, processing and analysis, vol. 194. IEEE, pp. 69–85 (2000) 5. Kaplan, K., Yilmaz, K.: Data pre-processing techniques in data mining. In: Medical Hypothesis, vol. 139, pp. 9866–9877 (2020) 6. Hemarith, D., Anitha, J.: Modified genetic algorithm approaches for classification of abnormal MRI brain tumor images. Appl. Soft Comput. 75, 21–28 (2018) 7. Patnaik, L.M., Chaplot, S., Jagannathan, N.R.: Classification of magnetic resonance brain images using wavelets as input to support vector machine and neural network. Biomed. Signal Process. Control 1(1), 86–92 (2006) 8. Ghazali, K.H., Mansor F.H., Mustafa, M, Hussain, A.: Feature extraction technique using discrete wavelet transform for image classification. In: 5th Student Conference on Research and Development, Selangor, (Malaysia), pp. 1–4 (2007) 9. Sergeto, T.: Principal component analysis for feature extraction and NN pattern recognition in sensor monitoring of chip form during turning. CIRP J. Manuf. Sci. Technol. 7, 202–209 (2014) 10. Di Ruberto, C., Putzu, L.: A feature-learning framework for histology images classification. In: Emerging Trends in Applications and Infrastructures for Computational Biology, Bioinformatics and Systems, vol. 18, pp 125–136 (2016) 11. Salankar, S., Nandpuru, H., Bora, V.R.: MRI brain cancer classification using support vector machine. In: IEEE Students’ Conference on Electrical, Electronics and Computer Science, Bhopal, pp. 1–6 (2017), https://doi.org/10.1109/SCEECS.2014.6804439

52 Brain MRI Classification for Detection of Brain Tumors …

579

12. https://www.kaggle.com/sartajbhuvaji/brain-tumor-classification-mri 13. Pati, S.P., Mishra, D.: Classification of brain MRIs using improved firefly algorithm-based ensemble model. Int. J. Innov. Technol. Exploring Eng. 8(5), 599–604 (2019)

Chapter 53

Enhancing the Prediction of Breast Cancer Using Machine Learning and Deep Learning Techniques M. Thangavel, Rahul Patnaik, Chandan Kumar Mishra, and Smruti Ranjan Sahoo Abstract With the surge of breast cancer, researchers have proposed many predicting methods and techniques. Currently, mammograms and analyzing the biopsy images are the two traditional methods used to detect the breast cancer. In this paper, the objective is to create a model that can classify or predict whether breast cancer is benign or malignant. Typically, a pathologist will take several days to analyze a biopsy, while the model can analyze thousands of biopsies in few seconds. For the numerical data, various machine learning classifications with supervised learning algorithms such as random forest (RF), K-nearest neighbor (KNN), Naïve Bayes, support vector machines (SVM), and decision trees (DT) are used. Then, deep learning—convolutional neural network is used to analyze the biopsy images from a dataset of images. An accurate result from the prediction are determined for saving the lives of people.

53.1 Introduction Breast cancer is a disease with encroaching nature in women and men (in some cases). Breast cancer ranks as the most occurring cancer among Indian women. As statistics, 1 in every 28 Indian women is at risk of being diagnosed with cancer in the breasts. As 2018 showed over 1.5 lakhs, the cases have been rising for a long time. The statistics with the foreign countries do not convey the entire picture as the awareness of the disease is relatively high there. At the same time, breast cancer remains an underdiagnosed disease in the country. Early detection of the disease, as well as correct prediction, might help a lot. The diagnosis of breast cancer is not very easy, and the awareness of the disease also makes it difficult for people suffering from stalling before it gets late. A woman in M. Thangavel (B) School of Computing Science and Engineering, VIT Bhopal University, Bhopal-Indore Highway, Kothrikalan, Sehore, Madhya Pradesh 466114, India R. Patnaik · C. K. Mishra · S. R. Sahoo Department of Computer Science and Engineering, Siksha ‘O’ Anusandhan (Deemed to be University), Bhubaneswar, Odisha, India © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2022 D. Mishra et al. (eds.), Intelligent and Cloud Computing, Smart Innovation, Systems and Technologies 286, https://doi.org/10.1007/978-981-16-9873-6_53

581

582

M. Thangavel et al.

her 30 s starts by planning her entire life ahead in this age bracket, making the whole process hard for her and family. For every four minutes, one woman was diagnosed and for every thirteen minutes, one woman dies due to breast cancer, which makes it most prevalent cancer among Indian women. The ratio of getting diagnosed with breast cancer is generally high for women, i.e., 1 in 28 women. Looking at the conditions and the challenges the problem possesses, researchers assess that a proper prediction and training of a model to predict precise results of the diagnosis will help thousands of people in the world. As technology seems to be increasing and upgrading itself, each second, medical science and computer science work parallelly through various innovations and solutions. Artificial intelligence is one such field that mimics the way a human thinks and upgrades itself much more than the capacity of what it was intended. Machine learning is a subset of artificial intelligence, while deep learning is a subset of machine learning and artificial intelligence acts as one of the special tools in predicting and detecting diseases in the medical field. Our contributions is to predict the breast cancer by analyzing numerical datasets using machine learning algorithms like random forest (RF), K-nearest neighbor (KNN), Naïve Bayes, support vector machines (SVM), and decision trees (DT), and by analyzing biopsy images dataset using convolutional neural network (CNN)—deep learning.

53.2 Related Works Breast cancer refers to the uncontrolled growth of cells within the breast tissues. With the surge in technology, the bio medical field has been the most developing and promising to eliminate or minimize any treatment’s risk. Several ways have been developed to identify, diagnose, and nullify (remove) breast cancer. The biopsy is one of these procedures to remove a sample of cells from a patient’s body analyzed in laboratories. With this, there are many risks associated as well with the procedure. The risks involved after the biopsy are bruising and swelling of breasts, bleeding at the site of biopsy, altered appearance depending upon the size of tissue removed, and considering unwanted circumstances additional surgery will be required too. A needle biopsy uses a thin hollow needle and a syringe to extract tissue from suspicious lumps from the body. There are two types of needle biopsy: Fine Needle Aspiration Cytology (FNAC) and Core Needle Biopsy (CNB). The least invasive method among these two is FNAC. FNAC uses a thin hollow needle, which can be done rapidly and efficiently. At the same time, CNB uses a large needle that removes a small cylinder of tissue. These traditional methods are being applied currently. However, the researchers need to put forth a technique that can give accuracy than them and better results.

53 Enhancing the Prediction of Breast Cancer …

583

53.2.1 Machine Learning Techniques Breast cancer diagnosis at early stages with machine learning techniques has been encouraged by researchers. Alghunaim et al. [13] proposed nine models were created using three distinct classification algorithms: Random forest, support vector machine (SVM), and decision tree. More research should be done utilizing a balanced dataset and feature selection strategies to improve the effectiveness of these classification approaches. In Fatima et al. [14], predicting breast cancer strategies are compared with data mining, machine learning, and deep learning. Researchers can use data augmentation strategies to overcome the problem of restricted dataset availability. In Bayrak et al. [15], two machine learning approaches were utilized for the classification process. The classification performance of these approaches was compared using the values of precision, accuracy, ROC area, and recall. After comparing the efficiency of several machine learning approaches in terms of essential performance measures such as accuracy, precision, recall, and ROC area, SVM (Sequential minimal optimization algorithm) outperformed them all. The goal of Teixeria et al. [16] study is to evaluate machine learning classification predicting models in terms of objectivity, accuracy, and reproducibility in identifying malignant tumors using fine-needle aspiration. The University of California Irvine (UCI) breast cancer Coimbra dataset was proposed to create efficient ensemble machine learning models employing classifiers [17]. With the ensemble approach, decision trees and KNN provided 100% accuracy. In Sengar et al. [18], on the Wisconsin (Diagnostic) dataset, the logistic regression and decision tree algorithms were used to predict cancer stage. There is a comparison made between decision trees and logistic regression, which is quite unusual. In Thomas et al. [19], a comparison of multiple machine learning algorithms on the state of Wisconsin early detection of breast cancer using a diagnostic dataset. After testing 20% of the data on all models, it was discovered that the ANN model had the greatest accuracy of 97.85%.

53.2.2 Deep Learning Techniques In Ismail et al. [1], the solution proposed was to compare breast cancer diagnosis using two deep learning model networks. Because this approach will automatically select the characteristics that are essential for classification from data and then determine which characteristics are accountable for producing excellent outcomes. The unusual images can be identified as benign tumors or malignant for future research. Deep learning algorithms are using variety of activation functions to classify different forms of breast cancer. Mekha et al. [2] discovered that their classification model had a high accuracy of 96.99%, superior to previous classification models. In Prakash et al. [3], the solution proposed was constructing a deep neural network capable of predicting breast cancer malignancy. The model received an upgrade by early stops and dropouts, which helped to reduce overfitting. According to the model of neural

584

M. Thangavel et al.

network, the F1 score for benign category was 98%, while the F1 score for the malignant category was 99%. Zou et al. [4] addressed to find out whether the user has breast cancer and how to treat it. According to the results, the convolutional neural network works well in breast cancer picture classification, and the Global Average Pooling effect is somewhat better than the Spatial Pyramid Pooling effect. According to the study, GAP is a better fit for breast cancer dataset categorization. Timmana et al. [5] addressed the problem by reasonable examination to cure the malignancy. A deep learning-based topic for analyzing breast cancer dataset’s bosom illness. The results of the studies reveal that the suggested model performed well on an unbalanced dataset, with good accuracy, recall, and F1 score. It is possible to expand and test the dataset. This research proposed a convolutional neural system that diagnoses breast cancer data using a deep learning methodology. The dataset imbalance, skewness, and lack of data were all rectified by this experiment. Also presented are the hyperparameters that should be used to train the model for the best accuracy. Xiang et al. [6] proposed a deep convolutional neural network to classify breast cancer histopathology pictures. The results of the studies reveal that approach is more accurate. The accuracy obtained in this work updates the new baseline of the categorization of the BreaKHis database, allowing it to be used in CAD systems. To assess the clinical outcome of breast cancer, an ensemble classification method based on the deep learning (DL) method [7] was created. Deep learning frameworks are said to produce 98.27% accuracy, which is higher than many other approaches, including genetic data and pathological pictures. Zheng et al. [8] proposed the Adaboost formula for breast cancer diagnosis has been mathematically projected using sophisticated process approaches with the help of deep learning. Their quick response is owing to their greater capacity to forecast, and their deep learning classifier outperforms other classifiers. In comparison to other current systems, the findings indicate remarkable accuracy. Shahidi et al. [9] proposed investigating the various deep learning algorithms for categorizing histopathology images of breast cancer. Over traditional learning models, CNN models are the most popular. To increase the accuracy results for the eight categories, the database size and balanced classes are critical. Mahmood et al. [10] proposed benefits and hazards of breast multi-imaging modalities, classification of breast disorders, feature extraction, and segmentation methods using state-of-the-art deep learning algorithms. They analyzed medical multi-image modalities to assess the strengths, limits, and performance of contemporary deciliter and metric capacity unit systems. ML approaches are found unsatisfactory for accurate segmentation of densities; however, DL approaches minimize false-positive ratio in the segmentation of masses. Convolutional neural networks may be used to interpret the thermograms [11]. When deep learning is used to classify medical images, it achieves greater accuracy than other neural network approaches. To accomplish 100% accuracy validation, the deep learning method necessitates a vast dataset. These goals will be achieved by using representative datasets, excellent ROIs, assigning excellent kernels, and creating lightweight CNN models, which will reduce convolution calculation time and improve accuracy rates. Yari et al. [12] approached the problem of breast cell morphologies in histopathology images that senior pathologists should

53 Enhancing the Prediction of Breast Cancer …

585

study to determine breast cancer. In order to enhance existing progressive systems in binary and multiclass classification, two efficient deep transfer learning-based models are used, which is based on pre-trained DCNN and many ImageNet dataset images. The findings in all categories (400x, 200x, 100x, 40x), both dependent and independent magnification, are far better than the current state-of-the-art.

53.3 Proposed System Figure 53.1 explains the system and working of our machine learning model. The steps are described below: Fig. 53.1 Flowchart for machine learning model

586

M. Thangavel et al.

Loading the dataset and importing necessary libraries The panda’s library was imported, with the help of which the Wisconsin dataset was loaded into our program as a data frame, a 2D data structure where the data is aligned in a table. Data Preprocessing The dataset is examined to find if it has any missing values or categorical values. The first column of the dataset, i.e., the ‘id’ column is dropped from the dataset as it contains the reference id of the data and is not a useful attribute for predicting the output. Along with it, the standard errors and worst values of all the features are also dropped. Only the significant mean values are taken into account to work with only relevant data and avoid errors, complex calculations, and overfitting of our model. With all the unnecessary attributes dropped, it is checked if the dataset has any null or NAN values. Since there are no such missing values, we can proceed to the next step. Transforming Data The dataset is now divided into two parts. The first part is the one containing all the independent variables (x), and the second part is the one that contains our dependent variable (y) or response. As y contains categorical values, i.e., the values are in the string, they should be converted into numerical values. So, they are mapped as malignant (M) = 1, benign (B) = 0. Splitting the dataset The dataset is divided into two parts, namely, the training set and testing set, in the ratio of 8:2. So, 80% of the total data is going to be used for the training purpose, i.e., our machine is going to learn from this 80% data by using algorithms based on classification like KNN, decision tree, random forest, SVM, and Naïve Bayes. After they are trained these models are going to be tested on the testing set data. The different models are then tested for their accuracy, and the results shows that support vector machine classifier produces the maximum accuracy of 93.86%. So, support vector machine classifier can be used to predict the user’s input-output. Figure 53.2 explains the system and working of our deep learning model. The steps are explained below: Importing necessary libraries Different libraries like ‘tensorflow’ and ‘ImageDataGenerator’ were imported before working on the dataset for implementing different functionalities provided by python language. Preprocessing the training set and testing set Some transformations are applied to the images of the training set only to avoid overfitting of our model. These transformations include simple geometric transformations like shifting some of the pixels, zooming the image, and rotating the image

53 Enhancing the Prediction of Breast Cancer …

587

Fig. 53.2 Flow chart for deep learning model

in different directions. The image is rescaled on each and every pixel by dividing the value by 255. Each pixel is divided specifically by 255 as each takes a value between 0 and 255. So, by dividing by 255, we are going to get all the pixel values between 0 and 1. When applying all these transformations to the training set, the image size is reduced to 64X64 so that it will be easy for our convolutional neural network to learn, and the size of the training set is specified. In addition, the remaining images get specified under the testing set.

588

M. Thangavel et al.

After the training set preprocessing, the testing set is pre-processed just by rescaling the images. And the image size is also resized to that of the training set images. Initializing the Convolutional Neural Network Now that the images are properly pre-processed, they can be fed to the CNN. So, the CNN is initialized as a sequence of layers to which more layers can be added. Since, microscopic images are taken into account and the size of the dataset is small it would be very difficult to achieve good accuracy score by adding neurons one by one in the hidden layer, transfer learning was used. In most cases of deep learning, transfer learning can be used, because deep neural network can be trained with comparatively little data as it involves using models trained on one problem as a start line on a related problem. So, ResNet152 and VGG16 was added to the CNN. Flattening the Convolutional layer and adding the ANN The output from the CNN is flattened, i.e., the 2D matrix is converted into 1D matrix so that it can be fed to the artificial neural network. The hidden layer of this ANN is known as the fully connected layer. Finally, the output layer is added with the Sigmoid Activation function so that the output values will be in the range of 0–1. Compiling the CNN The CNN is now compiled with ‘adam’ optimizer, a stochastic gradient decent algorithm that helps in the proper assignment of weights to the neurons in the hidden layer by multiple forward and backpropagation. The ‘ReduceLROnPlateau’ callback is also used to improve the learning rate if the model stops improving. Training and testing the CNN Finally, the CNN model goes through training and testing simultaneously, with a specified number of epochs. The CNN trains and tests on the whole dataset per epoch. In the last epoch, we got an accuracy score of 95.71% and a validation accuracy score of 82.93%, which is the final accuracy of our deep learning model using ResNet152, while we got an accuracy score of 90.18% and validation accuracy score of 78.67%, which is the final accuracy of our deep learning model using VGG16.

53.4 Experimental Results The proposed system implemented with python language. To be explicit, for the aim of machine learning, anaconda is being employed, and for deep learning Google Colab is employed. Machine learning and deep learning play an important role in predicting the type of breast cancer. Providing such information in advance gives the doctors a proper insight to the disease and can help in proper diagnosis and treatment per patient basis. Using 5 machine learning algorithms: KNN classifier, Kernal SVM classifier, random forest, decision tree and Naïve Bayes, we predict the

53 Enhancing the Prediction of Breast Cancer …

589

Table 53.1 Comparison of classifier results Classifiers

Precision

Recall/ sensitivity

Specificity

F1_score

Accuracy

KNN

95.31

91.04

93.62

93.12

92.11

DT

95.38

92.53

93.62

93.93

92.98

NB

89.70

91.04

85.10

90.37

88.60

RF

92.64

94.02

89.36

93.33

92.11

SVM

92.85

97.01

89.36

94.89

93.86

type of breast cancer for the numerical data. For the dataset with images, we have used deep learning technique, which is convolutional neural network for classifying the mammogram images into the type of breast cancer.

53.4.1 Machine Learning Machine learning model is trained with ‘Wisconsin Diagnostic Breast Cancer Dataset’. There are 32 columns (features) and 569 rows (samples) in total. There is no missing value in the entire dataset. The response variable is binary with two classes; malignant with 212 values and benign with 357 values. Because the data contains a lot of irrelevant information, it must be pre-processed. The data is reduced to 10 columns after preprocessing. Now, the dependent variable column should be separated from all the independent variables. To execute the classification, the output column must be labeled with binary values that is, 0 s and 1 s, because it contains category data. Then the accuracies, sensitivities/recalls, specificities, F1 score, and precisions were calculated for all the models. The rounded figures of accuracies, sensitivities/recalls, specificities, F1 score, and precisions for all the classifiers is shown in below Table 53.1.

53.4.2 Deep Learning The dataset adopted for the deep learning model is the ‘BreaKHis Dataset, which was publicly available at this [20] link. Breast tumor tissue microscopic images of 82 patients are totally 9109 images with different magnifying factors (400X, 200X, 100X, and 40X), are available in breast cancer histopathological image classification (BreaKHis). The images are divided into 5429 malignant samples (700 × 460 pixels, 3-channel RGB, 8-bit depth in each channel, PNG format) and 2480 benign. SOB method is used to collect samples available in the dataset, which is also named partial mastectomy or excisional biopsy. Compared to any methods of needle biopsy, this type of procedure removes the larger size of the tissue sample and is done in a hospital with a general anesthetic. The dataset contains four distinct histological types of

590

M. Thangavel et al.

Fig. 53.3 Training and validation loss graph for VGG16

Fig. 53.4 Training and validation accuracy graph for VGG16

benign breast tumors: Fibroadenoma (F), adenosis (A), tubular adenoma (TA), and phyllodes tumor (PT); and four malignant tumors (breast cancer): lobular carcinoma (LC), carcinoma (DC), papillary carcinoma (PC), and mucinous carcinoma (MC). The whole dataset for deep learning model was pre-processed. Then the accuracies, validation accuracies, loss, and validation loss were calculated for both the VGG16 (Figs. 53.3, 53.4) and ResNet152 (Figs. 53.5, 53.6).

53.5 Conclusion The Wisconsin breast cancer dataset produced an output in which the new data was subdivided into two groups: malignant (cancerous) and benign (non-cancerous). After that the data was pre-processed to make it machine-readable. The dataset was then separated into two portions once again. The first section comprises all of the independent variables (x), whereas the second half comprises our dependent variable

53 Enhancing the Prediction of Breast Cancer …

591

Fig. 53.5 Training and validation loss graph for ResNet152

Fig. 53.6 Training and validation accuracy graph for ResNet152

(y) or response. The accuracy of categorizing the data was determined using the supplied dataset and several machine learning approaches such as support vector machine, decision tree, Naive Bayes classifier, and KNN classifier. However, when compared to the other classifiers, it has been discovered that the KNN classifier delivers the most effective accuracy. Furthermore, the remainder of the models were closer to the SVM classifier’s accuracy value. The breast cancer data was categorized as malignant or benign using these machine learning techniques. Also, it had been noticed that the support vector machine classifier provides the most effective accuracy as compared to the other classifiers. Then the decision tree provides the most effective precision, support vector machine provides the most effective sensitivity. Support vector machine provides the most effective F1 score, but in case of specificity both KNN and decision tree gave best as compared to the other classifiers. So using these machine learning algorithms, the breast cancer data was classified as malignant or benign. And as per deep learning model, VGG16 and ResNet152 were used giving us accuracy of training accuracy of 90.18% and testing

592

M. Thangavel et al.

accuracy of 78.67%, while ResNet152 gave training accuracy of 95.71% and testing accuracy of 82.97%. So, we can conclude that ResNet152 stands at better position than that of VGG16. This proposed system may be utilized by doctors to determine whether a cell is cancerous or not and by ordinary people to determine if the cells are malignant or benign.

References 1. Ismail, N.S., Sovuthy, C.: Breast cancer detection based on deep learning technique. In: International UNIMAS STEM 12th Engineering Conference (EnCon), pp. 89–92 (2019) 2. Mekha, P., Teeyasuksaet, N.: Deep learning algorithms for predicting Breast cancer based on tumor cells. In: Joint International Conference on Digital Arts, Media and Technology with ECTI Northern Section Conference on Electrical, Electronics, Computer and Telecommunications Engineering (ECTI DAMT-NCON), pp. 343–346 (2019) 3. Prakash, S.S., Visakha, K.: Breast cancer malignancy prediction using deep learning neural networks. In: Second International Conference on Inventive Research in Computing Applications (ICIRCA), pp. 88–92 (2020) 4. Zou, W., Lu, H., Yan, K., Ye, M.: Breast cancer Histopathological image classification using deep learning. In: 10th International Conference on Information Technology in Medicine and Education (ITME), pp. 53–57 (2019) 5. Timmana, H.K., Rajabhushanam, C.: Breast malignant detection using deep learning model. In: International Conference on Smart Electronics and Communication (ICOSEC), pp. 383–388 (2020) 6. Xiang, Z., Ting, Z., Weiyan, F., Cong, L.: Breast cancer diagnosis from Histopathological Image based on Deep Learning. In: Chinese Control and Decision Conference (CCDC), pp. 4616–4619 (2019) 7. Zhang, X., et al.: Deep learning based analysis of Breast cancer using advanced ensemble classifier and linear discriminant analysis. IEEE Access 8, 120208–120217 (2020) 8. Zheng, J., Lin, D., Gao, Z., Wang, S., He, M., Fan, J.: Deep learning assisted efficient AdaBoost algorithm for Breast cancer detection and early diagnosis. IEEE Access 8, 96946–96954 (2020) 9. Shahidi, F., Mohd Daud, S., Abas, H., Ahmad, N.A., Maarop, N.: Breast cancer classification using deep learning approaches and histopathology image: a comparison study. IEEE Access 8, 187531–187552 (2020) 10. Mahmood, T., Li, J., Pei, Y., Akhtar, F., Imran, A., Rehman, K.U.: A brief survey on Breast cancer diagnostic with deep learning schemes using multi-image modalities. IEEE Access 8, 165779–165809 (2020) 11. Roslidar, R., et al.: A review on recent progress in thermal imaging and deep learning approaches for Breast cancer detection. IEEE Access 8, 116176–116194 (2020) 12. Yari, Y., Nguyen, T.V., Nguyen, H.T.: Deep learning applied for histological diagnosis of Breast cancer. IEEE Access 8, 162432–162448 (2020) 13. Alghunaim, S., Al-Baity, H.H.: On the scalability of machine-learning algorithms for Breast cancer prediction in big data context. IEEE Access 7, 91535–91546 (2019) 14. Fatima, N., Liu, L., Hong, S., Ahmed, H.: Prediction of breast cancer, comparative review of machine learning techniques, and their analysis. IEEE Access 8, 150360–150376 (2020) 15. Bayrak, E.A., Kırcı, P., Ensari, T.: Comparison of machine learning methods for Breast cancer diagnosis. In: Scientific Meeting on Electrical-Electronics & Biomedical Engineering and Computer Science (EBBT), pp. 1–3 (2019) 16. Teixeira, F. Montenegro, J.L.Z., da Costa, C.A., da Rosa Righi, R.: An analysis of machine learning classifiers in Breast cancer diagnosis. In: XLV Latin American Computing Conference (CLEI), pp. 1–10 (2019)

53 Enhancing the Prediction of Breast Cancer …

593

17. Naveen, Sharma, R.K., Ramachandran Nair, A.: Efficient Breast cancer prediction using ensemble machine learning models. In: 4th International Conference on Recent Trends on Electronics, Information, Communication & Technology (RTEICT), pp. 100–104, 2019. 18. Sengar, P.P., Gaikwad, M.J., Nagdive, A.S.: Comparative study of machine learning algorithms for Breast cancer prediction. In: Third International Conference on Smart Systems and Inventive Technology (ICSSIT), pp. 796–801 (2020) 19. Thomas, T., Pradhan, N., Dhaka,V.S.: Comparative analysis to predict Breast cancer using machine learning algorithms: a survey. In: International Conference on Inventive Computation Technologies (ICICT), pp. 192–196 (2020) 20. Cancer Medical datasets (2021)—http://www.inf.ufpr.br/vri/databases/BreaKHis_v1.tar.gz

Chapter 54

Performance Analysis of Deep Learning Algorithms Toward Disease Detection: Tomato and Potato Plant as Use-Cases Vijaya Eligar , Ujwala Patil , and Uma Mudenagudi

Abstract Tomato and potato plants are affected by diseases like fungi, virus, and bacteria. In India, tomato and potato crop yields are below the global standards and demand vision-based solutions for increasing the yield. Manual diagnosis of crop diseases is the primary cause for decrease in yield in most parts of the world. Unlike traditional diagnosis process, machine learning algorithms used for diagnosis are semi-automated and provide accurate solutions using hand-crafted features. However, choosing appropriate features to make the machine learn is a challenging task as it necessitates need of larger data chunk. Deep learning architectures provide improved classification accuracy on image classification problems. In this paper, the performances of state-of-the-art deep learning architectures like InceptionV3, ResNet50, VGG16, MobileNet, Xception, and DenseNet121 toward plant leaf disease detection using PlantVillage dataset are analyzed. The dataset consists of 11,370 images with healthy and non-healthy categories for tomato and potato leaves and classify the leaves into ten classes. An extensive ablation study by tuning the hyper parameters of the state-of-the-art methods is presented here and an improved accuracy of 98.83% with MobileNet architecture is observed. The results of study using appropriate quantitative scores on benchmark dataset are presented.

54.1 Introduction Tomato and potato are the most important crops grown in India and accounts for 11% and 10% of the global production, respectively [1]. Maharashtra and Karnataka V. Eligar (B) · U. Patil · U. Mudenagudi Center of Excellence in Visual Intelligence, KLE Technological University, Hubballi, India e-mail: [email protected] URL: https://www.kletech.ac.in U. Patil e-mail: [email protected] U. Mudenagudi e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2022 D. Mishra et al. (eds.), Intelligent and Cloud Computing, Smart Innovation, Systems and Technologies 286, https://doi.org/10.1007/978-981-16-9873-6_54

595

596

V. Eligar et al.

are the world’s second-largest producers of tomatoes while Uttar Pradesh and West Bengal are the major producers of potatoes. India is still behind the global standards in terms of yield of these crops. There are various challenges involved in the production of larger quantities of food due to population growth, climate change, insufficient resources, and plant diseases. Major problems in cultivation of the crop are caused by pests and natural calamities. Preventing the disease caused by natural calamities is crucial and demands technological solutions for early detection and prevention. The need of the hour is to educate farmers regarding available technology and make them acquainted toward its usage for increasing the production and yield. Traditional methods include manual diagnosis of disease detection and prevention, thereby limiting the production/yield for meeting the global standards. Typically, in traditional methods farmers observe the shapes and colors of the leaves to identify a particular disease. Manual diagnosis of plant disease is laborious and time consuming. Toward this, a deep learning-based solution for plant leaf disease detection is proposed here. Image processing and machine learning are rapidly evolving areas for crop disease detection and localization. The drawback of these algorithms is due to the fact of using hand-crafted features [2]. Choosing appropriate features to make the machine learn is a challenging task [3]. Toward this, several deep learning algorithms came into existence. Neural networks are designed to handle complex nonlinear structures to provide a feasible and more accurate solution compared to machine learning algorithms. Many researchers [4–6] provide semi-automated solutions for plant leaf disease detection. However, most of these methods find limitations with respect to the underlying dataset chosen. In Mokhtar et al. [7], apply the technique of Gabor wavelet transform to extract the features of tomato leaf image using 100 images for each class. The approach proposed here provides an overall accuracy of 99.5% with SVM classifier. In Mokhtar et al. [5], apply gray level co-occurrence matrix for feature extraction. Using SVM classifier to classify two classes a classification accuracy of 99.83% has been achieved. In Revathy et al. [8], detect the bacterial infection present on crape jasmine and tomato leaves on a dataset containing 800 images with 2 classes. The algorithm used here is scaled conjugate gradient. It provides 94 and 86% accuracy for the two types of leaves. However, there is scope to experiment with heterogeneous dataset and provide insights on accuracy for the same. Deep learning methods may be a solution going ahead. A large dataset which is a necessity for high-quality decisions is needed. Machine learning techniques cannot train large datasets. Machine learning techniques employed here are not efficient in classifying large variety of diseases. To address this, decisions from multiple architectures/algorithms can be fused to generate a robust decision, as in [9]. In Tm et al. [6], classify 10 different classes using LeNet model PlantVillage dataset. The model obtained an accuracy of 94–95% with different learning rates and optimizers. In Elhassouny et al. [4], propose a model for efficient mobile application using deep convolutional neural network (CNN) to recognize 10 classes of tomato leaf diseases using MobileNet CNN. The authors have proposed to further expand the fault diagnosis model in order to enhance identification accuracy of tomato diseases

54 Performance Analysis of Deep Learning Algorithms Toward …

597

by using more high-quality images of different tomato diseases. In Mokhtar et al. [10], used AlexNet and GoogleNet deep learning architectures obtaining an accuracy of 99.35% using the PlantVillage dataset. The proposed model accuracy reduced substantially to 31.4% on a non-homogeneous background. The choice of the dataset is the problem that is observed in the literature. The majority of proposed models use images from a static background, but in real-time, it is challenging to capture and process high-quality images of tomato leaves with diseases. However, authors can consider designing architecture on the non-homogenous background. The organization of the paper is as follows: Sect. 54.1.1 emphasizes on the contributions. Section 54.2 explains the proposed methodology and model used along with the measures taken to achieve the required objectives. The experimental settings are discussed in Sect. 54.3. Section 54.4 deals with the outcomes and analysis of the proposed methodology. The conclusion and future work are discussed in Sect. 54.5.

54.1.1 Contributions 1.

2. 3. 4.

InceptionV3, ResNet50, VGG16, Xception, MobileNet, and Dense Net121 are modeled to classify tomato and potato leaf diseases considering the ten classes of PlantVillage dataset. The results are demonstrated using appropriate quantitative metrics, including accuracy, recall, precision, and F1-score on the benchmark dataset. The performances of InceptionV3, ResNet50, VGG16, Xception, MobileNet, and DenseNet121 on benchmark dataset are compared. Extensive experiments are performed and shown that MobileNet performs better than ResNet50, VGG16, Xception, MobileNet, DenseNet121 to classify tomato and potato leaf diseases.

54.2 Performance Analysis of Deep Learning Algorithms In this paper, extensive experiments are performed using state-of-the-art architectures for tomato and potato leaf disease detection. It includes 4 stages: Input image, data augmentation, transfer learning and classification as shown in Fig. 54.1.

54.2.1 Dataset The input images are taken from PlantVillage dataset which is an open access repository that helps to analyze the disease detection faster. Potato and tomato leaves dataset from PlantVillage consisting of 10 classes are used, which includes healthy

598

V. Eligar et al.

Fig. 54.1 Transfer learning approach for leaf disease detection

Fig. 54.2 Tomato and potato leaves dataset from PlantVillage dataset [10]

and non-healthy leaves and has a total of 11,370 images. The different classes, along with sample images, are shown in Fig. 54.2. There are 3 classes of potato leaves, 1 healthy and 2 non-healthy. Similarly, 7 classes of tomato leaves, 1 healthy and 6 non-healthy. Different types of healthy and non-healthy include Tomato Septoria leaf spot (1771), Tomato bacterial spot (2127), Potato healthy (152), Tomato early blight (1000), Tomato healthy (1591), Tomato leaf mold (952), Tomato target spot (1404), Potato late blight (1000), Tomato mosaic virus (373), and Potato early blight (1000).

54.2.2 Data Augmentation Typically, a higher computational cost is observed if the dataset is used without preprocessing for training the deep learning model. However, from the literature it is inferred that dataset is customized to meet the specifications of the architecture. The

54 Performance Analysis of Deep Learning Algorithms Toward …

599

dataset is divided into 80:20 ratio with training and testing, respectively. The stateof-the-art architectures have millions of parameters which require large datasets, otherwise the network is over fitted. Since the dataset is skewed, the dataset has to be balanced. The augmentation with the intention of increasing the accuracy is proposed to be carried out.

54.2.3 Transfer Learning From literature, it is inferred that transfer learning is appreciated as the model since it uses previous learning and as prior knowledge and facilitates faster convergence. Here InceptionV3, ResNet50, VGG16, Xception, DenseNet121, and MobileNet pretrained models are considered, and these models have already acquired knowledge through larger benchmark datasets. • Inception networks: Inception networks are proven as computationally effective, both with respect to the number of parameters and the economic cost incurred [11]. • Xception networks: Xception stands for “extreme inception”. The Xception network can implement a model with deeper and faster computation than the conventional convolution method by utilizing depth-wise separable convolution layers with residual connections [12]. In the proposed work, the images need to be resized to 299 × 299 to meet the Inception and Xception network requirements. • ResNet: ResNet is deep learning model used for classification of plant leaf disease detection [13]. • VGG16: VGG16 is used as an image-feature extractor in transfer learning. • It is an efficient deep learning model for image classification with 16 layers [14]. But VGG is deeper and has more parameters. The classification accuracy has been improved by increasing the depth of convolution. It uses multiple convolutional layers than a single convolution layer, which can better extract the image features. It uses max-pooling for down-sampling, and adds rectified linear unit (ReLU) as the activation function. • MobileNet: MobileNet uses convolutions toward the construction of less expensive deep convolution neural network and is appreciated for applications like mobile and embedded vision [15]. • DenseNet121: DenseNet121 has a depth of 121 layers. DenseNets have several compelling advantages: they alleviate the vanishing gradient problem, strengthen feature propagation, encourage feature reuse, and substantially reduce the number of parameters [16]. Those 121 layers are varied on purpose, and some are convolution layers, pooling, concatenation of matrices, densely connected layers, and others. All state-of-the-art architectures have a stack of convolution and pooling layers at the beginning and fully connected layers at the end with soft-max output layers. In the proposed network, the images will be resized to 224 × 224 to meet the requirements of the considered network.

600

V. Eligar et al.

Fig. 54.3 Overview of plant disease detection using deep learning architecture: Tomato and potato

All models consist of several convolution layers, which are followed by ReLU, average-pooling, and normalization layers. Fully connected layers have been modified with the total number of classes, i.e., 8 diseases and 2 healthy classes, resulting in a total of 10 classes. The RGB input image given in the PlantVillage dataset is considered and the models are initialized with weights of the ImageNet. The fully connected and dense layers are replaced to classify the input test data as healthy and non-healthy. In addition, the architecture is retrained for 30 epochs and the effect on classification accuracy is observed (Fig. 54.3).

54.3 Experimental Settings A total of 11,370 images of potato and tomato from PlantVillage dataset are used here. Data augmentation is performed using operations like rotation, flipping, vertical, and horizontal shifting of images. The deep learning model using TensorFlow API is deployed here. In this experiment, only the last fully connected layer is trained. The transfer learning models InceptionV3, ResNet50, VGG19, Xception, DenseNet121, and MobileNet which are trained on the ImageNet dataset are used here. RMSprop and Adam optimizers with sparse categorical cross entropy are used for retraining. The implemented model consists of one fully connected layer of 1024 neurons, a dropout layer at the rate of 0.5 and a classification layer with 10 neurons for predicting 10 different classes. The output of the pre-trained model is given as the base to the classification layer. For extracting the features, six other deep learning models VGG16, InceptionV3, Xception, ResNet50, MobileNet, and DenseNet121 have been used. These models have been compiled with different optimizers like RMSprop and Adam with various learning rates. The models are trained using the extracted train and validation features for 30 epochs and keeping a batch size of 30.

54 Performance Analysis of Deep Learning Algorithms Toward …

601

54.4 Results and Discussion Table 54.1 shows the accuracies obtained while employing transfer learning on PlantVillage dataset on six state-of-the-art models with learning rates of 0.001, 0.00001, 0.00002, and 0.00005 while using RMSprop and Adam optimizers. It is further reduced by a factor of 0.2 on plateau where the loss stops decreasing. Early stopping has also been used in order to monitor the validation loss and stop the training process once it increases. All the experiments were implemented in Python under Windows10 using the GPU NVIDIA GTX1050 with 4 GB video memory. In Table 54.1, it is observed that MobileNet attained best accuracy of 98.83% with Adam optimizer at a learning rate 0.00005. The accuracy vs learning rate/batch size is depicted in Fig. 54.4. The training and validation accuracy is depicted in Fig. 54.5. Finally, the precision, recall and F1-scores are displayed in Figs. 54.6, 54.7, 54.8, 54.9.

54.5 Conclusions This paper demonstrated the performance analysis of state-of-the-art methods like InceptionV3, ResNet50, VGG19, Xception, DenseNet121, and MobileNet of tomato and potato leaf disease detection using PlantVillage dataset. Various deep learning architectures were experimented with by tuning the hyperparameters and it was observed that the accuracy improved to 98.83% with MobileNet architecture. The accuracy was achieved for chosen dataset, but the accuracy dropped to 40% when images with dynamic background were used for training and testing.

97.65

91.11

92.67

DenseNet121-Adam

DenseNet121-RMSprop

90.71

Xception-RMSprop

97.00

90.88

Xception-Adam

MobileNet-RMSprop

61.89

ResNet50-RMSprop

MobileNet-Adam

61.33

ResNet50-Adam

87.34

VGG16-RMSprop

97.01

89.46

VGG16-Adam

88.28

0.001

Learning rate →

InceptionV3-RMSprop

16

Batch size →

InceptionV3-Adam

Accuracy

Model-optimizer

94.87

93.94

97.01

97.47

91.92

91.21

65.76

62.68

93.96

96.87

78.76

76.97

0.00001

30

94.31

95.82

97.26

97.86

93.83

94.78

74.94

70.06

94.81

96.89

85.52

82.01

0.00002

30

97.67

98.52

98.61

98.83

94.09

95.70

76.89

75.82

97.54

97.10

90.87

89.27

0.00005

30

Table 54.1 Validation accuracy of potato and tomato leaf disease detection

97.01

98.00

97.99

97.99

90.55

94.11

70.44

71.33

92.55

97.86

91.33

90.31

0.001

64

83.25

84.23

98.10

100.00

88.45

90.12

69.56

70.12

94.67

94.67

87.21

88.97

0.00005

30

F1-score

86.11

85.01

98.26

98.41

90.23

91.25

71.38

71.20

94.01

94.60

79.99

89.00

Precision

520

514

13

11

415

345

860

846

48

31

415

378

Miss-class-ified

602 V. Eligar et al.

54 Performance Analysis of Deep Learning Algorithms Toward …

603

Fig. 54.4 The plot for accuracy vs learning rate/batch size. Blue color indicates learning rate 0.001 with batch size 16. Orange color indicates learning rate0.00001 with batch size 30. Gray color indicates learning rate 0.00002 with batch size 30. Yellow color indicates learning rate 0.00005 with batch size 30. Dark blue color indicates learning rate 0.001 with batch size 64. MobileNet attained best accuracy of 98.83% with Adam optimizer at a learning rate 0.00005 and batch size 30 Fig. 54.5 Training and validation accuracy for tomato and potato leaf disease detection of MobileNet architecture

Fig. 54.6 F1-score, precision, recall of tomato and potato leaf disease detection of MobileNet architecture

604

V. Eligar et al.

Fig. 54.7 The plot of precision vs state-of-the-art models with optimizers. MobileNet architecture has precision of 98.41%

Fig. 54.8 The plot of F1-score vs state-of-the-art models with optimizers. MobileNet architecture has precision of 99.10%

54 Performance Analysis of Deep Learning Algorithms Toward …

605

Fig. 54.9 The plot of recall vs state-of-the-art models with optimizers. MobileNet architecture has precision of 99.80%

References 1. Costa, J., Heuvelink, E.: The Global Tomato Industry, pp. 1–26. Crop Production Science in Horticulture Series, CABI, 2nd edn (2018) 2. Tabib, R.A., Patil, U., Naganandita, T., Gathani, V., Mudenagudi, U.: Dimensionality reduction using decision-based framework for classification: sky and ground. In: Progress in Intelligent Computing Techniques: Theory, Practice, and Applications, pp. 289–298. Springer, Berlin (2018) 3. Patil, U., Mudengudi, U.: Evidence based image selection for 3d reconstruction. In: Computer Vision, Pattern Recognition, Image Processing, and Graphics: 7th National Conference, NCVPRIPG 2019, Hubballi, India, December 22–24, 2019, Revised Selected Papers, Vol. 1249, p. 53. Springer Nature (2020) 4. Elhassouny, A., Smarandache, F.: Smart mobile application to recognize tomato leaf diseases using convolutional neural networks. In: 2019 International Conference of Computer Science and Renewable Energies (ICCSRE), pp. 1–4. IEEE, New York (2019) 5. Mokhtar, U., El Bendary, N., Hassenian, A.E., Emary, E., Mahmoud, M.A., Hefny, H., Tolba, M.F.: SVM-based detection of tomato leaves diseases. In: Intelligent Systems’ 2014, pp. 641– 652. Springer, Berlin (2015) 6. Tm, P., Pranathi, A., SaiAshritha, K., Chittaragi, N.B., Koolagudi, S.G.: Tomato leaf disease detection using convolutional neural networks. In: 2018 Eleventh International Conference on Contemporary Computing (IC3), pp. 1–5. IEEE, Berlin (2018) 7. Mokhtar, U., Ali, M.A., Hassenian, A.E., Hefny, H.: Tomato leaves diseases detection approach based on support vector machines. In: 2015 11th International Computer Engineering Conference (ICENCO), pp. 246–250. IEEE, New York (2015) 8. Revathy, R., Roselin, D.: Digital image processing techniques for bacterial infection detection on tomato and crape jasmine leaves. Int. J. Sci. Eng. Res. (2015) 9. Tabib, R.A., Patil, U., Ganihar, S.A., Trivedi, N., Mudenagudi, U.: Decision fusion for robust horizon estimation using Dempster Shafer combination rule. In: 2013 Fourth National Conference on Computer Vision, Pattern Recognition, Image Processing and Graphics (NCVPRIPG), pp. 1–4. IEEE, New York (2013) 10. Mohanty, S.P., Hughes, D.P., Salath´e, M.: Using deep learning for image-based plant disease detection. Front. Plant Sci. 7, 1419 (2016)

606

V. Eligar et al.

11. Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J., Wojna, Z.: Rethinking the inception architecture for computer vision. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2818–2826 (2016) 12. Chollet, F.: Xception: Deep learning with depth-wise separable convolutions. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1251–1258 (2017) 13. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) 14. Simonyan, K., Zisserman, A.: Very Deep Convolutional Networks for Large-Scale Image Recognition. arXiv preprint arXiv:1409.1556 (2014) 15. Howard, A.G., Zhu, M., Chen, B., Kalenichenko, D., Wang, W., Weyand, T., Andreetto, M., Adam, H.: Mobilenets: Efficient Convolutional Neural Networks for Mobile Vision Applications. arXiv preprint arXiv:1704.04861 (2017) 16. Huang, G., Liu, Z., Van Der Maaten, L., Weinberger, K.Q.: Densely connected convolutional networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4700–4708 (2017)

Chapter 55

Classification of Human Facial Portrait Using EEG Signal Processing and Deep Learning Algorithms Jehangir Arshad, Saqib Salim, Amna Khokhar, Zanib Zulfiqar, Talha Younas, Ateeq Ur Rehman, Mohit Bajaj, and Subhashree Choudhury Abstract An electroencephalogram (EEG) is used to evaluate the electrical activity of the brain. While a person sees something, the brain creates a mental precept, this precept captures the use of EEG to get an instantaneous example of what is happening inside the brain all through the precise technique. This study aims to design an automated convolutional neural network (CNN)-based deep learning algorithm that can be employed for the visual classification of a human facial portrait using electroencephalography processing. Moreover, EEG information evoked through visible photograph stimuli has been employed through the convolution neural network (CNN) to realize a discriminatory mind recreation manifold of image classifications within the mind-reading process. We have used a 9-channel EEG Mindwave Mobile 2 headset to record the brain activity of subjects while looking at images of four persons from the dataset. The presented results validate the proposed algorithm as it shows a precision of 80% that has greatly outshined the existing techniques. In further, this study shows that the learned capabilities by CNN-based deep learning models can be employed for automated visual classification that can be used for disabled persons and criminal investigation with further few improvements.

J. Arshad Department of Electrical and Computer Engineering, COMSATS University Islamabad, Lahore Campus, Lahore 54000, Pakistan S. Salim · A. Khokhar · Z. Zulfiqar · T. Younas Department of Electrical and Computer Engineering, COMSATS University Islamabad, Sahiwal Campus, Sahiwal, Pakistan A. U. Rehman Department of Electrical Engineering, Government College University, Lahore 54000, Pakistan M. Bajaj Department of Electrical and Electronics Engineering, National Institute of Technology, Delhi, India S. Choudhury (B) Department of Electrical and Electronics Engineering, Siksha O Anusandhan (Deemed to be) University, Odisha 751030, India e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2022 D. Mishra et al. (eds.), Intelligent and Cloud Computing, Smart Innovation, Systems and Technologies 286, https://doi.org/10.1007/978-981-16-9873-6_55

607

608

J. Arshad et al.

55.1 Introduction and Related Studies Electroencephalography (EEG) is an electrophysiological technique used to monitor the electrical activity of the human brain. The human brain creates a mental perception of the object captured by the human eye that is essentially a mental impression of that thing. It is possible to capture the precept using brain signals to get a direct illustration of what is happening in the brain during a certain process [1]. The important task here is the decoding of the sensory inputs received from brain impressions that are a fundamental modern technological challenge and constituent of neuroscience expertise [2]. However, the similar category object classification and the distinct human faces identification from brain activity patterns evoked from visually similar inputs is enormously challenging. The latest reawakening in convolutional neural networks (CNN) caused a crucial presentation enhancement in the automatic graphic category, their generality assistances not on the humanoid degree, considering which examine the discriminative feature area that particularly depends upon the employed discipline dataset rather than on greater fashionable principles. Alongside the verdant image earlier brought by another deep neural network accurately and our model will instruct with those natural images. More specifically, the representation of one-layer capabilities of the CNN appears generalized during one-of-a-kind information sets, as these are much likely Gabor filters and coloration blobs, whilst the very last-layer capabilities restrict to the specific dataset. The proposed scheme is based on EEG signals. The recording of the EEG signals is performed by fixing an electrode on the subject scalp using the standardized electrode placement scheme [3] which is shown in Fig. 55.1. For this purpose, Neurosky Mindwave Mobile 2 headset is used to extract and record the EEG signals which are further processed by using deep learning algorithms. In further, the natural image is introduced before another deep neural network effectively and the model is trained with real-time images. Furthermore, the human judgment of visualization is used that provides an effective combining of multiple dynamic neural networks (DNN) layers to improve the visual quality of generated images. Whereas the result suggests that hierarchical visual information in the brain can be combined competently to recognize perceptual and subjective images [4]. This paper presents a novel approach to feature extraction and automated visual classification from the perceptual content of brain activity. We have used electroencephalography that is a recording process of the electrical activity of the brain that contains invaluable information related to the various physiological states of the brain. In recent research, the researchers have worked on the recording and processing of EEG signals in many fields. The concept of reading and analyzing data from the mind of a person while performing specific tasks has been investigated so far, especially for building brain-computer interfaces (BCI). Most brain-computer interfaces (BCI) studies have carried out binary EEG information types in the absence of specific styles. For example, the researchers implemented CNN for epileptic seizure prediction from intracranial EEG and highlighted the usefulness of bivariate measures of brainwave synchronization [5]. Similarly, for

55 Classification of Human Facial Portrait Using EEG Signal …

609

Fig. 55.1 Standardized electrode placement scheme

P300 detection researchers used 4 single classifiers with exceptional functions set and 3 multi-classifiers [6]. Recently, other works are pursued using deep mastering to model more complex cognitive models (i.e., audio stimuli, or cognitive load) from mind signals [7–9]. Also, a blend of convolutional and recurrent ANN has been suggested to analyze EEG arguments in rational load type duties and said class accuracy is approximately 90% over 4 cognitive load levels. These strategies have proved the capability to use mind indicators and deep mastering for type however, they deal with a small number of class classes, and none of them are related to the visible scene know-how [10, 11]. Most of the cognitive neuroscience research had decided as a perfect deal as various object instructions somehow interpreted in occasion associated capability (ERP) amplitudes recorded via EEG (by figuring out areas of seen cortex). However, these scientific indications is not thoroughly utilized to assemble visible stimuli–evoked EEG classifiers [11]. In this work, the aim is to explore today and direct the shape of human involvement that may be a new imaginative and prescient of the human-based computation approach for computerized seen elegance. The underlying concept is to analyze a brain sign discriminative manifold of seen classes by classifying EEG indicators through thought analysis [12–15]. We define BCIs as the undertaking of interpreting an item class associated EEG indicators for inclusion into computer vision methods. Whereas, in our research real-time EEG-based human visual records have been used to pick out the picture of the respective man or woman proven to the difficulty through using convolutional neural networks and specific deep reading strategies.

610

J. Arshad et al.

55.2 Methodology This paper aims to design an algorithm that can record the EEG signals of the human brain and then process the gathered information for visual classification by using algorithms of deep learning. Neuroscience exquisitely works in approaching the visual scenes evoked by the human brain. The deep mastering strategies have the potential to study capabilities at a couple of levels, which makes the machine capable to examine complex mapping feature f : X → Y directly from the given data, without the assistance of the human crafted features. The foremost characteristic feature of deep learning methods is that their models contain deep architecture. A deep architecture network leads to multiple hidden layers in the network. We have used convolutional neural networks in deep learning for better training and testing of EEG data. However, convolutional neural networks (CNN) are one of the most powerful deep learning neural network models used for object detection and classification. In CNN, all neurons are connected to a set of neurons in the next layer in a feedforward fashion. The following Fig. 55.2a, b illustrate the block diagram of the proposed model and the internal functionality of ANN. The fundamental objective of this paper is to make an impression on the human picture. An EEG signal is the external manifestation of thinking activity. Since EEG signals have been recorded, people utilize an assortment of strategies to utilize EEG signs to uncover mind movement. In these investigations, EEG signals incited by picture upgrades have been generally utilized in numerous investigations because of stable highlights. The expanded refinement and number of assets accessible on modernized EEG frameworks and of clinical imaging preparation will keep on advancing.

Fig. 55.2 a Schematic of proposed model. b Schematic of proposed model and working of ANN

55 Classification of Human Facial Portrait Using EEG Signal …

611

Fig. 55.3 Raw EEG signal of subject EEG data acquisition

Fundamentally, a high goal EEG is utilized as a device in the clinical analytic workup and has a splendid future in this field. The basic architecture of CNN has been shown in Fig. 55.3 that incorporates three primary sub-blocks, comprising convolution, pooling, and fully connected layers. Seven subjects are shown human facial portraits while EEG data has been recorded. We have considered four human facial portraits shown to each subject and respective EEG data has been recorded. An EEG data of all seven subjects are recorded in the same experimental environment. All subjects are evaluated to exclude possible conditions like diseases interfering with the acquisition process. The dataset used for visual stimuli has been a subset of four human facial portraits. During the experiment, each image has been shown to the subject for 30 s followed by 20 s pause, and the EEG signals are recorded. A summary of the experimental and random paradigms is shown in Table 55.1. The experiments are conducted using Neurosky Mindwave Mobile 2 headset having 9 channels and active low impedance electrodes. The sampling frequency and data resolution of the headset are 512 Hz and 12 bits, respectively. The raw EEG signals are processed using Hamming and Hanning’s windows having low bandpass Table 55.1 Summary of experimental and random paradigm 1

Total number of images

4

2

Time for each image

30 s

3

Total sessions

4

4

Running (Session) time

900 s

5

Running (overall) time

3600 s

6

Pause between images

20 s

7

Total number of recorded signals per subject

80

612

J. Arshad et al.

and Butterworth filters to remove the unwanted noise and distortion in the signal. Raw EEG signal consists of the power spectrum of alpha (8–12 Hz), beta (12–40 Hz) and gamma (40–100 Hz), etc. Figure 55.2 shows an EEG sign of and a subject that is extricated by utilizing an application and plotted in MATLAB. Human shows brilliant execution, still inaccessible by devices, while deciphering graphic scenes. For each signal, the timestamp and location are eliminated. The EEG signal of each subject for each image is then labeled to avoid any inconvenience during the testing and training of the CNN. The following Table 55.2 shows the training dataset of EEG signal of each subject in .csv file format. An EEG manifold learning for feature extraction: The main examination targets interpreting an input multichannel chronological EEG order into a minimal dimensional component vector summing up the important substance of the information succession. Past methodologies essentially concatenate the time sequence from numerous channels into a solitary characteristic vector, disregarding historical elements, which, rather, contain principal data for EEG action understanding [2, 15]. To include such dynamics in our representation we concatenate the time sequence, and we employ the LSTM-recurrent ANN due to their ability to track long term dependencies of their input datasets. In further, the EEG signals have been evaluated at different points by using different software to do the model building. Table 55.2 The training data of EEG signal of one subject in .csv format

Time sampling Poor signal EEG raw value EEG raw value 1.575E+12

26

−101

−2.22E−05

1.575E+12

26

−108

−2.37E−05

1.575E+12

26

−111

−2.44E−05

1.575E+12

26

125

2.75E−05

1.575E+12

26

−86

−1.89E−05

1.575E+12

26

−84

−1.85E−05

1.575E+12

26

99

2.18E−05

1.575E+12

26

−124

−2.72E−05

1.575E+12

26

112

2.46E−05

1.575E+12

26

115

2.53E−05

1.575E+12

26

90

1.98E−05

1.575E+12

26

−109

−2.40E−05

1.575E+12

26

121

2.66E−05

1.575E+12

26

88

1.93E−05

1.575E+12

26

−119

−2.61E−05

1.575E+12

26

−118

−2.59E−05

1.575E+12

26

−124

−2.72E−05

1.575E+12

26

−127

−2.79E−05

1.575E+12

26

−124

−2.72E−05

1.575E+12

26

117

2.57E−05

55 Classification of Human Facial Portrait Using EEG Signal …

613

Figure 55.3 shows a diagram of the proposed approach, in the above segment a low dimensional portrayal for fleeting EEG signals filmed whilst subjects glanced at pictures of the subject is found out by the encoding element and the processed EEG traits are utilized in preparation of a picture classifier. In the underneath area, a CNN is prepared to appraise EEG includes legitimately from pictures; at that point, the classifier prepared in the previous stage is utilized for mechanized order short of the need for EEG information for new images. The encoder and classifier preparing is completed over inclination drop by means of giving the class name related to the picture that appeared while every EEG grouping has been recorded. After preparing, the encoder can be utilized to produce EEG features from information EEG sequences, while the order system will be utilized to foresee the individual’s picture for an info EEG includes portrayal, which can be registered from EEG signals. The following Fig. 55.4 illustrates an EEG manifold learning and regression with identifier and subject. The primary layers of CNN endeavor to get familiar with the general features of the pictures, which are regular between numerous undertakings, in this way we introduce loads of these layers utilizing prepared models, and afterward, gain proficiency with loads of last layers without any preparation in a start to finish setting. We have used preprogrammed CNN and then we modified it by using the SoftMax layer with the regression layer for classification as shown in Fig. 55.5. The CNN-based regressor has different layers such as Conv. layer, Max pool layer, Average Max pool layer having different stride and different padding [10]. Long short-term memory (LSTM) and CNN models, just as profound and shallow CNN models, speaking to the best in class in profound learning for BCI errands having significantly better-quality deviation.

Fig. 55.4 EEG manifold learning and regression with identifier and subject

614

J. Arshad et al.

Fig. 55.5 CNN-based regressors

The flow chart provided in Fig. 55.5 shows a detailed flow of this work. A long shortterm memory (LSTM) is a profound learning framework that keeps away from the evaporating angle issue. An LSTM is typically increased by repetitive doors named overlook entryways. In further, an LSTM forestalls back engendered mistakes from evaporating or detonating. Rather, blunders can stream in reverse through boundless quantities of virtual layers unfurled in space. LSTM models with up to a few LSTM layers which include 32 to 256 LSTM memory cells seemed. Since it is not viable to understand the tremendous model shape a previous, precise configuration has been trained and monitored on the validation set [16–22]. The fine consequences have been finished with a single layer and 128 LSTM memory cells, which are normal with consequences acquired [9]. The concept of studying the thinking of people at an identical time as performing unique responsibilities is lengthy investigated to build brain laptop interactions.

55 Classification of Human Facial Portrait Using EEG Signal …

615

55.3 Performance Analysis and Results The dataset is divided into training validation and test sets for the fraction of 80% training and 20% testing. In further, the data is again evaluated using different software to monitor the division of data into training and testing sets. We have ensured that the signals created by each subject for a solitary picture are all connected in a solitary split. The entire model design had taken just dependent on the aftereffects of testing split which makes the test split uncontaminated and solid quality file for definite understanding. The general number of EEG sequences utilized for preparing proposes of CNN encoder is 1280. The classifier has been used in Fig. 55.6 for the exactness of the prepared encoder. A similar classifier is utilized on CNN relapsed EEG highlights for robotized visual order. The proposed stacked LSTM encoder approach had the option to arrive at 70% characterization exactness, which extensively beats the exhibition of 29% more than 12 classes of their dataset accomplished in the visual scenes has been recorded for 30 s with the time interim of 1 ms. Table 55.2 also represents that the

Fig. 55.6 Graphical representation of training data

616

J. Arshad et al.

Table 55.3 The training data of EEG signal of one subject in CSV format Schemes

Universal–Accuracy (%)

Transfer–Accuracy (%)

Increase (%)

Proposed CNN

62.1

80

17.9

MLP [3]

55.7

62.3

6.6

LSTM [3]

63.6

67.2

5.8

time may influence the grouping execution. Up until this point, it has been realized that the element extraction for individual picture acknowledgment in a human occurs during the main 50–120 ms (stimuli spread time from eye to visual cortex) contingent on the psychological capacity of a subject, though less is known after the period of 120 ms. Since in our investigations, Table 55.2 shows for 30 s with the goal that the subject imagines the picture highlights of the individual. The best execution has been taken valiantly when adjoining perfect conditions have been given to the subject. Along these lines, the visual order has been enacted after the underlying visual acknowledgment process in the visual cortex of prosperity. Table 55.3 provides a comparative analysis of the proposed model and the existing illusion detection scheme proposed in [3]. The comparison has been performed based on universal accuracy and transfer accuracy. Table 55.1 depicts a considerable improvement of the CNN application.

55.4 Conclusion and Future Works In this paper, the human brain-driven preprogrammed visual classification strategy has been proposed. It involves a predominantly two-phase CNN-based LSTM model to learn visual improvements evoked by EEG information utilizing a most minimized strategy for the portrayal of such information. Furthermore, a CNN-put together relapse approach is applied for pictures to learn EEG features portrayal henceforth empowering computerized visual order in an EEG information complex evoked from the human cerebrum. The methodology shows serious execution explicitly worried to other EEG learning portrayals of different article classes as in relevant on ongoing human cerebrum forms engaged with visual acknowledgment of individual pictures. We can decipher what somebody encounters outwardly dependent on their mind movement opens a ton of potential outcomes. It reveals the emotional substance of our psyche, and it gives an approach to get to, investigate, and share the substance of our discernment, memory, and creative mind. The promising outcomes have been accomplished in visual acknowledgment utilizing computer vision, AI, and neuroscience by moving human visual abilities to the machines. Later with further alterations, not exclusively might it be able to create a neural-based recreation of what an individual is seeing, yet in addition to what they recollect and envision. The technique used in this paper help people communicate in a comma, helping stroke patients, and those who suffered brain injuries, also aiding those with learning

55 Classification of Human Facial Portrait Using EEG Signal …

617

disorders and improving people’s communication skills. The affirmative designed autonomous system using mind-reading technology can be implemented for a criminal investigation with further improvements. Image reconstruction is admissible in court because unlike a polygraph, which relies on emotional responses, our technique uses EEG to see how the brain reacts to pictures related to a crime scene. It will be used in instant and speed up investigations. Recapitulating it, later with further modifications, not only could it produce a neural-based reconstruction of what a person perceives, but also of what they imagine and remember so far.

References 1. Shen, G., Horikawa, T., Majima, K., Kamitani, Y.: Deep image reconstruction from human brain activity. PLoS Comput. Biol. 15(1), e1006633 (2019) 2. Spampinato, C., Palazzo, S., Kavasidis, I., Giordano, D., Souly, N., Shah, M.: Deep learning human mind for automated visual classification. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6809–6817 (2017) 3. Williams, J.M.: Deep learning and transfer learning in the classification of EEG signals (2017) 4. Er, M.B., Çi˘g, H., Aydilek, ˙I.B.: A new approach to recognition of human emotions using brain signals and music stimuli. Appl. Acoust. 175, 107840 (2021) 5. Graser, A., Cecotti, H.: Convolutional neural networks for P300 detection with application to brain-computer interfaces. IEEE Trans. Pattern Anal. Mach. Intell. 33(3), 433–445 (2011) 6. Mirowski, P., Madhavan, D., LeCun, Y., Kuzniecky, R.: Classification of patterns of EEG synchronization for seizure prediction. Clin. Neurophysiol. 120(11), 1927–1940 (2009) 7. Shah, S.M.A., Ge, H., Haider, S.A., Irshad, M., Noman, S.M., Meo, J.A., Younas, T.: A quantum spatial graph convolutional network for text classification. Comput. Syst. Sci. Eng. 36(2), 369–382 (2021) 8. Nanda, P.P., Rout, A., Sahoo, R.K., Sethi, S.: Work-in-progress: analysis of meditation and attention level of human brain. In: 2017 International Conference on Information Technology (ICIT), pp. 46–49. IEEE, New York, Dec 2017 9. Kavasidis, I., Palazzo, S., Spampinato, C., Giordano, D., Shah, M.: Brain2image: Converting brain signals into images. In: Proceedings of the 25th ACM International Conference on Multimedia, pp. 1809–1817, Oct 2017 10. Saleem, S., Saeed, A., Usman, S., Ferzund, J., Arshad, J., Mirza, J., Manzoor, T.: Granger causal analysis of electrohysterographic and tocographic recordings for classification of term vs. preterm births. Biocybern. Biomed. Eng. 40(1), 454–467 (2020) 11. Wang, C., Xiong, S., Hu, X., Yao, L., Zhang, J.: Combining features from ERP components in single-trial EEG for discriminating four-category visual objects. J. Neural Eng. 9(5), 056013 (2012) 12. Zhang, X., Yao, L., Wang, X., Monaghan, J.J., Mcalpine, D., Zhang, Y.: A survey on deep learning-based non-invasive brain signals: recent advances and new frontiers. J. Neural Eng. (2020) 13. Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep convolutional neural networks. Adv. Neural. Inf. Process. Syst. 25, 1097–1105 (2012) 14. Lotte, F., Bougrain, L., Cichocki, A., Clerc, M., Congedo, M., Rakotomamonjy, A., Yger, F.: A review of classification algorithms for EEG-based brain–computer interfaces: a 10-year update. J. Neural Eng. 15(3), 031005 (2018) 15. Yosinski, J., Clune, J., Bengio, Y., Lipson, H.: How transferable are features in deep neural networks? arXiv preprint arXiv:1411.1792 (2014)

618

J. Arshad et al.

16. Sadiq, M.T., Suily, S., Ur Rehman, A.: Evaluation of power spectral and machine learning techniques for the development of subject-specific BCI. In: Artificial Intelligence Based Brain Computer Interface (BCI) (Chap. 4). Elsevier (2021) (Accepted) 17. Sohail, M.N., Jiadong, R., Uba, M.M., Irshad, M., Iqbal, W., Arshad, J.: A hybrid forecast cost benefit classification of diabetes mellitus prevalence based on epidemiological study on Real-life patient’s data. Sci. Rep. 9, 10103 (2019). https://doi.org/10.1038/s41598-019-466 31-9 18. Akbari, H., Sadiq, M.T., Ur Rehman, A., Ghazvin, M., Naqvi, R.A., Payan, M., Bagheri, H., Bagheri, H.: Depression recognition based on the reconstruction of phase space of EEG signals and geometrical features. Appl. Acoust. 179, 1–16 (2021). https://doi.org/10.1016/j.apacoust. 2021.108078 19. Sohail, N., Jiadong, R., Uba, M., Tahir, S., Arshad, J., John, A.V.: An Accurate clinical implication assessment for diabetes mellitus prevalence based on a study from Nigeria. Processes 7, 289 (2019). https://doi.org/10.3390/pr7050289 20. Naqvi, R.A., Arsalan, M., Rehman, A., Ur Rehman, A., Loh, W.-K., Paul, A.: Deep learningbased drivers emotion classification system in time series data for remote applications. Remote Sens. 12(3), 587. https://doi.org/10.3390/rs12030587 21. Sadiq, M.T., Yu, X., Yuan, Z., Fan, Z., Ur Rehman, A., Ullah, I., Li, G., Xiao, G.: Motor imagery EEG signals decoding by multivariate empirical wavelet transform based framework for robust brain-computer interfaces. IEEE Access 7, 171431–171451 (2019). https://doi.org/ 10.1109/ACCESS.2019.2956018 22. Sadiq, M.T., Yu, X., ZYuan, X., Fan, Z., Ur Rehman, A., Li, G., Xiao, G.: Motor imagery EEG signals classification based on mode amplitude and frequency components using empirical wavelet transform. IEEE Access 7, 127678–127692 (2019). https://doi.org/10.1109/ACCESS. 2019.2939623

Chapter 56

A Semantic-Based Input Model for Patient Symptoms Elicitation for Breast Cancer Expert System Chai Dakun, Naankang Garba, Salu George Thandekkattu , and Narasimha Rao Vajjhala Abstract Information gathering from patients is a challenging task for physicians. It is also complex for computers to process input in their raw forms. Most research that attempts to formalize symptoms elicitation is mostly keyword-based without considering the semantics and structure of the sentences. The drawback of the keywordbased approach is that it mainly results in redundant recall and poor precision of symptoms generated. This research aims at developing a semantic-based input model for a medical expert system. User input collected on the GUI is processed using python, Natural Language (NL) pre-processing, and Normalized Sentence Structure (NSS) to extract relevant terms. The NSS instances generated are mapped with breast cancer terminologies present in a breast cancer lexicon to develop acceptable symptoms in the domain of breast cancer. Results obtained from this paper show that the proposed model had an 18.11% improvement in the precision of symptoms generated and a 6.3% decrease in symptoms count. Furthermore, to measure the impact of symptoms generated by the proposed model, the symptoms were recorded in the modified Select and Test (ST) algorithm for the diagnosis of breast cancer which resulted in a 31.8% and 1.97% enhancement in the value of precision and accuracy of breast cancer diagnosis respectively. Finally, the Wisconsin Breast Cancer Database (WBCD) was used, resulting in 27.89% accuracy in breast cancer diagnosed instances showing an improvement over the current work, which had 26.60%.

C. Dakun Department of Computer Science, Ahmadu Bello University, Zaria, Nigeria N. Garba · S. G. Thandekkattu American University of Nigeria, Yola, Nigeria e-mail: [email protected] S. G. Thandekkattu e-mail: [email protected] N. R. Vajjhala (B) University of New York Tirana, Tirana, Albania e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2022 D. Mishra et al. (eds.), Intelligent and Cloud Computing, Smart Innovation, Systems and Technologies 286, https://doi.org/10.1007/978-981-16-9873-6_56

619

620

C. Dakun et al.

56.1 Introduction Cancer is one of the leading causes of death globally and is responsible for 8.8 million deaths as of 2015, indicating that 1 in 6 deaths is due to cancer [1]. Chronic disease is defined as a disease that can only be managed but not completely cured [2]. They identified some chronic diseases such as breast cancer, allergy, asthma, diabetes, epilepsy, glaucoma, heart disease, and obesity. Careful attention to chronic diseases is important to achieving good health, quality of life, and cost-effective care [2–5]. As a result of the damage cancer causes, it has become necessary for AI programmers to speed up developing trustworthy health applications that can be used to monitor and further diagnose cancer. Over the years, research has been ongoing on bridging the gap between machines understanding the human language. To bridge such gap, the fields of Semantic Web, Artificial Intelligence (AI), and Natural Language Processing (NLP) have been significant in that direction [1, 6, 7]. There is a need to develop expert system applications that can aid the human view understood by machines [4, 8, 9]. It is observed that when we have an impaired and incomplete input data fed into an expert system, it results in wrong reasoning and result presentation. To close this gap, Oyelade et al. [10] proposed a model that leveraged NLP modules since NLP can generate meaning to raw text. However, their work elicits symptoms of breast cancer from natural language sentences by extracting tokens without considering the semantics and structure of the sentences. Clinicians are usually supported by AI systems on problems that rely on data and knowledge manipulations. Such systems are referred to as Clinical Decision Support System (CDSS) to deploy eHealth applications. Diagnosis Decision Support Systems (DDSS) fall under a category of CDSS, which requires that a patient enter information about a particular disease. Then, the system intelligently carries out a diagnosis. According to Oyelade et al. [10], DDSS remains promising in health informatics because of the knowledge, skills, and tools that enable information to be collected, managed, used, and shared to support health care delivery and promote health. This paper proposes an input model for a breast cancer expert system that considers the semantics and structure of sentences that a breast cancer patient usually provides. The input data collected by the input model are unstructured. The symptoms generated from the proposed input model will be fed into a DDSS, in this case, an enhanced ST algorithm for diagnosing breast cancer [11]. This paper proposes reducing the symptoms recall, thereby improving the diagnosis and accuracy of a breast cancer diagnosis. It is expected that at the filtering stage of the raw text, not all sentences are used at the later stages because of their semantics.

56.2 Review of Literature A model in Extensible Markup Language (XML), which provides access to the clinical information of patient records for many clinical applications, was implemented using NLP that matches textual reports to a form in line with the model [6]. Dhole and

56 A Semantic-Based Input Model for Patient Symptoms …

621

Uke [12] proposed obtaining names of diseases with the help of classifiers and leveraging NLP to get information related to a particular disease. The Natural Language Web Interface for Database (NLWIDB) was developed [3]. Users of the NLWIDB could query the database in natural language sentences through a friendly interface on the internet. An NLP engine is also known as the LifeCodeTM, an AI system that extracts information from the raw text of a clinical report. The output of this engine was the patient’s medical condition, the progress of the patient’s treatment, and disposition [13]. An improved ST algorithm was proposed for clinical diagnosis and monitoring [5]. An inference engine was designed and its approximate reasoning representation to diagnose and monitor breast cancer. A qualitative semi-structured interview was held with fifteen breast cancer patients being confronted with a radiotherapy decision one month to eight years earlier, together with fifteen interviews with a breast cancer specialist to explore the experiences, decisional attributes, and needs of patients and health care professionals as input for the development of a patient decision aid [2, 11]. A patient input parsing model that takes in user unstructured text in breast cancer was developed by Oyelade et al. [10]. The model maps the input into a list of symptoms that are acceptable as breast cancer symptoms. They adopted the semantic web and NLP approaches to develop a natural language parsing algorithm to achieve their objectives. They also developed a sizable lexicon with the help of an oncologist for comparison with the generated medical terms. However, their approach was limited to token and keyword extraction, failing to capture the semantics and structure of the sentences entered in an open-ended pattern. This lack of sentence semantics is likely to generate redundant symptoms, resulting in a false positive outcome. The model’s output negatively affects precision and falls short of the accuracy of the diagnosis of an expert system. Ontology Learning (OL), a concept of building ontologies from natural language text, was used by Dasgupta et al. [14]. They developed a Description Logic Ontology Learning (DLOL) tool that generates a knowledge base from collecting factual non-negative IS-A sentences in English.

56.3 Materials and Method The proposed model uses a breast cancer lexicon which was adopted from Oyelade et al. [10]. The ontology was modeled using protégé with the assistance of an oncologist. The input that describes natural language symptoms is passed through the Graphical User Interface (GUI). The input entered in the GUI is passed to the Natural Language (NL) Pre-processor that contains submodules: Text Pre-processor and Part-of-Speech (POS) tagging. The text pre-processor is the early stage where the sentences are broken down into tokens and simplified. In the Lexical normalizer, variations of the predicate of each sentence are identified. Likewise, the subject and object quantifiers, numerals in words are converted into their number format to ensure efficient representation. This paper used the Normalized Sentence Structure method, which takes user input from the NL pre-processor [14, 15]. The sentences

622

C. Dakun et al.

are converted into a canonical representational structure called the NSS instances. Therefore, the semantics of these sentences are sorted in this stage of the model. The NSS instances generated are parsed into an inference engine for inference making. The concepts in the breast cancer lexicon are compared with those developed after parsing the NSS instances into the inference engine to develop acceptable symptoms. The system includes a text-based GUI where the patient can enter input in Englishlike statements. These statements are trivial and factual sentences. The input is passed to the natural language pre-processor module. This component of the model was implemented using the Java programming language. At the NL pre-processing stage, the model uses the python libraries for implementation. The NSS filters the semantics of the sentences at this stage. Our proposed algorithm is presented below: Algorithm: Matching User Input to Breast Cancer Lexicon and Generating Symptoms Input: Spre-processed, QuantifierList, Predicate_List, Negative_AB_List Output: Diagnosis 1. Start 2. Declaration 3. arrayList of arraylist ptoken_synonyms 4. arrayList of arraylist ptoken_homonyms 5. arrayList SentenceSet 6. end declaration 7. arrayList SentenceSet ← Spre-processed 8. for sentence in SentenceSet do 9. Needed_sentence ← Extract_Sentence (sentence, Negative_AB_List) 10. for sentence in Needed_sentences do 11. Predicate_lexeme ← Extract_predicate (sentence, Predicate_List) 12. Insert_in_NSSCell ([Predicate], Predicate_lexeme); 13. // Subject_Phrase Extraction 14. Subject_Phrase ← Extract_Before (Sentence, Predicate_lexeme); 15. for token ϵ Subject_Phrase do 16. if token ϵ QuantifierList then 17. Insert_in_NSSCell ([Q], token); 18. end 19. else if Extract_POS_tag(token) ϵ [NN, NNP, JJ, RB, CD] then 20. while Next_token (token) ≠ Predicate_lexeme do 21. Token ← Extract_next_token (token); 22. if Extract_POS_tag (token) ϵ [NN, NNP, JJ, RB] then 23. Insert_in_NSSCell ([Ms], token); 24. end 25. end 26. end 27. Insert_in_NSS ([S], token); 28. end

56 A Semantic-Based Input Model for Patient Symptoms …

623

29. 30. 31. 32. 33. 34. 35. 36. 37. 38. 39. 40. 41. 42. 43. 44. 45. 46. 47. 48. 49. 50. 51. 52. 53. 54. 55. 56.

// Object_Phrase Extraction Object_Phrase ← Extract_After (Sentence, Predicate_lexeme); for token ϵ Object_Phrase do if token ϵ QuantifierList then Insert_in_NSSCell ([Q], token); end else if Extract_POS_tag(token) ϵ [NN, NNP, JJ, RB, CD] then while Next_token (token) ≠ null do Token ← Extract_next_token (token); if Extract_POS_tag (token) ϵ [NN, NNP, JJ, RB] then Insert_in_NSSCell ([Mo], token); end end end Insert_in_NSS ([O], token); end end end Term_list ← Extract_terms (Insert_in_NSSCell) Load the wordNetDB for token ϵ Term_list do ArrayList psynonym ← get all synonyms of token from wordNetDB ArrayList phomonym ← get all homonyms of token from wordNetDB Store psynonym in ptoken_synonyms Store phomonym in ptoken_homonyms M atch all elements of ptoken_synonyms against thesaurus M atch all elements of ptoken_homonyms against thesaurus Send all matched items of ptoken_synonyms & ptoken_homonyms to semantic input token storage 57. end 58. Symptoms ← M atch_Items (Semantic_Input_storage, Breast_Cancer_Lexicon);

59. Send Symptoms to DDSS for Diagnosis 60. Stop

The algorithm represents the steps taken to achieve the objective of this paper. Lines 2–6 show some variable declarations used in the body of the algorithm. After the input data has undergone NLP pre-processing steps, needed sentences are obtained, and predicates of these sentences are extracted, as shown in lines 8–12. Since sentences are represented in a triple of subject-predicate-object, their predicate is identified for each sentence. The corresponding subject and object phrases are extracted in lines 14–27 and 29–43, respectively. Each token in the subject and object phrases are inserted into an NSS cell in lines 17, 27, 33, and 43 of the algorithms. All the tokens in the NSS cell are then extracted into storage in line 47. Synonyms and homonyms of the NSS instances are gotten and mapped to the breast cancer lexicon through the inference engine in lines 49–58. The symptoms generated are then fed into the modified ST algorithm to diagnose breast cancer in line 59. Pre-processing, subject, predicate, and object phrase extractions were implemented using python.

624

C. Dakun et al.

56.4 Result and Discussion In this section, the experimental results for breast cancer live patient data description [10] and Wisconsin Breast Cancer Database (WBCD). The data served as input on both existing and proposed models. The impact of the symptoms generated on the existing and proposed models was measured in terms of precision and accuracy of diagnosis when the symptoms were fed into the enhanced ST algorithm [11]. This work sources its data from 20 breast cancer patient data descriptions gotten from Oyelade et al. [10] and standard dataset from Wisconsin Breast Cancer Database (WBCD), which has 699 instances and 11 attributes, namely: id number, clump thickness: 1–10, Uniformity of cell size: 1–10, Uniformity of cell shape: 1–10, Marginal adhesion: 1–10, Single Epithelial cell Size: 1–10, Bare Nuclei:1–10, Bland Chromatin:1–10, Normal Nucleoli: 1–10, Mitosis: 1–10, Class: (2 for benign and 4 for malignant). Non-breast cancer (benign) diagnosed instances are 458, and the breast cancer (malignant) is 241 representing 65.5% and 34.5% projection. The data is entered into the model through the GUI as a set of sentences in an open-ended pattern. These sentences are responses to questions on the patient’s age, parity, symptoms presented, family history, histology, biopsy, breast examination, investigation, staging, and treatment, typical of questions an oncologist asks.

56.4.1 Evaluation Metrics To evaluate the performance of our model and that of [13], symptoms count indicating the count of the symptoms generated before and after inference making on both existing and proposed models, precision and accuracy were computed in terms of a breast cancer diagnosis. Precision is the percentage of results that are closely connected. To find the value of precision, the following formula is used: Precision =

TP (TP + FP)

where: • True Positive (TP): For items generated as symptoms, how many of them are in the breast cancer lexicon? • False Positive (FP): For items generated as symptoms, how many of them are not in the breast cancer lexicon? • Accuracy: Accuracy measures how close a measured value is to the true measurement. In measuring accuracy, the following formula was used.

56 A Semantic-Based Input Model for Patient Symptoms …

625

56.4.2 Comparative Analysis of Symptoms Count Table 56.1 shows a reduction in the symptoms generated before and after inference making across 20 breast cancer patient input data descriptions collected on the proposed system. There was a 6.3% decrease in symptoms count on the proposed model compared to what is obtainable in the existing model. The reason for this is the irrelevance of the tokens in sentences that were tagged negative. Therefore, it is observed that the symptoms generated contained a redundant set of symptoms that are not needed during generating symptoms [10]. Table 56.1 Symptoms generated before and after inference on existing and proposed models Existing model

Proposed model

Patient input

Before inference

After inference

Before inference

After inference

P1

20

76

18

72

P2

15

75

12

70

P3

17

76

13

72

P4

15

75

12

69

P5

15

75

12

70

P6

18

77

13

73

P7

16

78

16

71

P8

18

76

14

70

P9

19

78

17

75

P10

16

77

15

72

P11

19

79

17

76

P12

14

74

11

69

P13

19

76

15

74

P14

16

75

16

71

P15

21

80

16

72

P16

16

75

13

70

P17

17

76

11

70

P18

18

78

17

75

P19

15

74

12

69

P20

16

77

12

72

Average

76.4

71.6

626

C. Dakun et al.

56.4.3 Comparative Analysis of Precision of Symptoms Generated on the Existing and Proposed Models Table 56.2 depicts the enhancement of the proposed model over the existing model by Oyelade et al. [11] concerning precision. The average precision indicates that the proposed model had an 18.11% enhancement over the existing model. Table 56.3 shows an average accuracy of breast cancer diagnosis of 77.1% using the proposed system against 75% accuracy of the existing system when tested on the modified ST algorithm. The result shows a 1.97% increase in the accuracy of diagnosis of breast cancer, which indicates the impact of the symptoms generated by our proposed input model on the modified ST algorithm. To further validate the effectiveness of our proposed input model when the symptoms are fed into the modified ST algorithm, the benchmark WBCD dataset was converted from numerical format into natural language English-like sentences that can be easily read by the input GUI. This concept was adopted by Oyelade et al. [11]. Table 56.4 shows the result of positive breast Table 56.2 Precision of symptoms generated on existing and proposed models

Patient

Existing model

Proposed model

Enhancement (%)

P1

0.52

0.61

17.31

P2

0.54

0.63

16.66

P3

0.51

0.62

21.57

P4

0.54

0.59

9.26

P5

0.54

0.60

11.11

P6

0.51

0.61

19.61

P7

0.56

0.65

16.07

P8

0.53

0.62

16.98

P9

0.51

0.69

35.29

P10

0.55

0.65

18.18

P11

0.50

0.58

16

P12

0.59

0.67

13.55

P13

0.54

0.63

16.66

P14

0.55

0.66

20

P15

0.50

0.60

20

P16

0.54

0.65

20.37

P17

0.56

0.64

14.28

P18

0.51

0.62

21.56

P19

0.60

0.70

16.66

P20

0.52

0.63

21.15

Average

0.536

0.63

18.11

56 A Semantic-Based Input Model for Patient Symptoms …

627

Table 56.3 Precision and accuracy of diagnosis using modified ST algorithm Patient input

Existing model

Proposed model

Precision

Accuracy (%)

Precision

Accuracy (%)

Improvement in precision (%)

Improvement in accuracy (%)

P1

0.50

75

0.62

76.1

24

1.466

P2

0.50

75

0.63

76.7

26

2.666

P3

0.50

75

0.68

76.5

36

2

P4

0.50

75

0.65

76.8

30

2.4

P5

0.50

75

0.65

76.5

30

2

P6

0.50

75

0.61

77.0

22

2.666

P7

0.50

75

0.70

76.9

40

2.533

P8

0.50

75

0.60

76.7

20

2.266

P9

0.50

75

0.75

76.3

50

1.733

P10

0.50

75

0.72

76.0

44

1.33

P11

0.50

75

0.50

75.5

0

0.66

P12

0.50

75

0.64

75.9

28

1.2

P13

0.50

75

0.68

76.8

36

2.4

P14

0.50

75

0.75

77.0

50

2.66

P15

0.50

75

0.60

76.4

20

1.866

P16

0.50

75

0.62

76.7

24

2.266

P17

0.50

75

0.65

76.9

30

2.53

P18

0.50

75

0.70

75.0

40

0

P19

0.50

75

0.77

77.1

55

2.8

P20

0.50

75

0.66

76.5

32

2

Avg

0.50

75

0.659

77.1

31.8

1.97

Table 56.4 Comparative analysis of breast cancer diagnosis on modified ST algorithm using WBCD Existing model

Proposed model

Instances

Benign

Malignant

Benign

Malignant

100

67

33

60

40

200

139

61

147

53

300

191

109

182

118

400

266

134

257

143

500

343

157

335

165

600

424

176

414

186

699

513

186

504

195

628

C. Dakun et al.

cancer diagnosis (malignant) and non-breast cancer diagnosis (benign) when the dataset was tested on the medical expert system.

56.4.4 Comparative Analysis of Precision and Accuracy of Diagnosis Using Modified ST Algorithm Furthermore, experimental results that were obtained when symptoms generated from each patient data were fed into the modified ST algorithm for breast cancer diagnosis shows 31.8% increase in the value of precision of diagnosing breast cancer, which indicates a significant improvement over the existing system and 1.97% increase in the accuracy of the diagnosis of breast cancer. The proposed model had 195 out of 699 instances indicating breast cancer, representing 27.89% against 186 out of 699 instances of the existing model, representing 26.60%.

56.5 Conclusion In this paper, a semantic-based input model for a breast cancer expert system was developed. The input generating model collects patient input on how they feel. These inputs are processed to outline the breast cancer symptoms present in the patient and further use the symptoms generated as input into an existing diagnostic system. This paper compares the symptoms count and precision of symptoms generated by the proposed input model to determine the improvement in the proposed model. The precision and accuracy of breast cancer diagnosis are also computed on an existing breast cancer expert system. The proposed model and the model were tested using the same dataset, and the result shows an improvement in the proposed input model over the existing model. The future directions of this paper include incorporating speech recognition and multi-lingual components to the input model and generating symptoms of ailments other than breast cancer and the associated diagnostic algorithms for diagnosis.

References 1. Genitsaridi, I., Marias, K., Tsiknakis, M.: An ontological approach towards psychological profiling of breast cancer patients in pervasive computing environments. In: Proceedings of the 8th ACM International Conference on Pervasive Technologies Related to Assistive Environments, p. Article 28. Association for Computing Machinery, Corfu, Greece (2015) 2. Isac, C., Viterbo, J., Conci, A.: A survey on ontology-based systems to support the prospection, diagnosis and treatment of breast cancer. In: Proceedings of the XII Brazilian Symposium on Information Systems on Brazilian Symposium on Information Systems: Information Systems

56 A Semantic-Based Input Model for Patient Symptoms …

3.

4. 5.

6.

7.

8.

9.

10.

11.

12. 13. 14. 15.

629

in the Cloud Computing Era, vol. 1, pp. 271–277. Brazilian Computer Society, Florianopolis, Santa Catarina, Brazil (2016) Hamine, S., et al.: Impact of mHealth chronic disease management on treatment adherence and patient outcomes: a systematic review. J. Med. Internet Res. 17, e52 (2015). http://doi.org/10. 2196/jmir.3951 Meystre, S.M., Haug, P.J.: Comparing natural language processing tools to extract medical problems from narrative text. In: AMIA Annual Symposium Proceedings, pp. 525–529 (2005) Oyelade, O.N., Obiniyi, A.A., Junaidu, S.B.: ONCODIAG select and test (ST) algorithm: an approximate clinical reasoning model for diagnosing and monitoring breast cancer. Curr. Res. Bioinf. 9(1) (2020) Friedman, C., et al.: Representing information in patient reports using natural language processing and the extensible markup language. J. Am. Med. Inform. Assoc. 6(1), 76–87 (1999) Altıparmak, H., Nurçin, F.V.: Segmentation of microscopic breast cancer images for cancer detection. In: Proceedings of the 2019 8th International Conference on Software and Computer Applications, pp. 268–271. Association for Computing Machinery, Penang, Malaysia (2019) Khoulqi, I., Idrissi, N.: Breast cancer image segmentation and classification. In: Proceedings of the 4th International Conference on Smart City Applications, p. Article 59. Association for Computing Machinery, Casablanca, Morocco (2019) Idri, A., Chlioui, I., Ouassif, B.E.: A systematic map of data analytics in breast cancer. In: Proceedings of the Australasian Computer Science Week Multiconference, p. Article 26. Association for Computing Machinery, Brisband, Queensland, Australia (2018) Oyelade, O.N., et al.: Patient symptoms elicitation process for breast cancer medical expert systems: a semantic web and natural language parsing approach. Future Comput. Inf. J. 3(1), 72–81 (2018) Oyelade, O.N., et al.: A modified select and test (ST) algorithm for medical diagnosis in an adhoc network environment. In: 2017 IEEE 3rd International Conference on Electro-Technology for National Development (NIGERCON) (2017) Dhole, G., Uke, N.: NLP-based retrieval of medical information for diagnosis of human diseases. Int. J. Res. Eng. Technol. 3(10), 243–248 (2014) Heinze, D., et al.: LifeCod†—a natural language processing system for medical coding and data mining. In: AAAI/IAAI (2000) Dasgupta, S., et al.: Formal Ontology Learning from English IS-A Sentences (2018) Yang, Y., et al.: Corpus-guided sentence generation of natural images. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing, pp. 444–454. Association for Computational Linguistics, Edinburgh, United Kingdom (2011)

Appendix

This book features a collection of high-quality research papers presented at the 2nd International Conference on Intelligent and Cloud Computing (ICICC 2021), held at Siksha ‘O’ Anusandhan (Deemed to be University), Bhubaneswar, India, on October 22–23, 2021. Including contributions on system and network design that can support existing and future applications and services, it covers topics such as cloud computing system and network design, optimization for cloud computing, networking, and applications, green cloud system design, cloud storage design and networking, storage security, cloud system models, big data storage, intra-cloud computing, mobile cloud system design, real-time resource reporting and monitoring for cloud management, machine learning, data mining for cloud computing, data-driven methodology and architecture, and networking for machine learning systems.

© The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2022 D. Mishra et al. (eds.), Intelligent and Cloud Computing, Smart Innovation, Systems and Technologies 286, https://doi.org/10.1007/978-981-16-9873-6

631

Author Index

A Abhilash Pati, 39 Abhinandan Roul, 3 Abhinav Kumar, 463 Abinash Nayak, L. S., 419 Adyasha Das, 471 Akshara Preethy Byju, 77 Alekhya Viswanath, 77 Ambati Divya, 353 Amit Yadav, 501 Amiya Kumar Rath, 317 Amna Khokhar, 607 Apoorva Tyagi, 17 Archana Sarangi, 187, 243 Arnab Mitra, 111 Arup Kumar Mohanty, 363 Aryan Kenchappagol, 545 Asheesh Balotra, 97 Asif Khan, 501 Ateeq Ur Rehman, 607 Avinash Maurya, 227

B BAbhi Teja, 77 Barnali Sahu, 533 Batool Hassan, 491 Bharat Jyoti Ranjan Sahu, 419, 491 Bichitrananda Patra, 255, 517 Binita Kumari, 401 Binod Kumar Pattanayak, 39, 145, 393 Biswaranjan Jena, 55, 363 Biswa Ranjan Senapati, 133 Bravish Ghosh, 3

C Chai Dakun, 619 Chandan Kumar Mishra, 581 Chitaranjan Tripathy, 29 Chukhu Chunka, 227, 371 D Debahuti Mishra, 65, 121, 187, 243, 269, 327, 341, 363, 419 Debasish Swapnesh Kumar Nayak, 215 Deepak Kumar Patel, 29 Dibya Ranjan Das Adhikary, 295 Digita Shrestha, 501 Dong Ryeol Shin, 177 Durga Prasad Mohapatra, 481 G Gayatri Nayak, 255 Gilbert Rozario, S., 165 Gopinath Pranav Bhargav, 77 Gouri Prasad Sahu, 145 Gubbala Srilakshmi, 439 Gunseerat Kaur, 17 H Hyungi Jeong, 177 I Inderpreet Kaur, 121, 341 J Jagdeep Kaur, 121

© The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2022 D. Mishra et al. (eds.), Intelligent and Cloud Computing, Smart Innovation, Systems and Technologies 286, https://doi.org/10.1007/978-981-16-9873-6

633

634 Jayashankar Das, 215 Jehangir Arshad, 607 K Kaberi Das, 157, 419 Kancharla Shridhar Reddy, 77 Khder Essa, 157 Kirandeep Kaur, 341 L Lambodar Jena, 517 Lara Ammoun, 517 M Madhuri Rao, 481 Madhusmita Sahu, 305 Madiha Jamil, 491 Mamata Nayak, 277 Mandakini Priyadarshani Behera, 243 Manoranjan Parhi, 3, 39 Manpreet Singh Manna, 121, 341 Minu, R. I., 65, 269, 327 Millee Panigrahi, 317 Miranji Katta, 439 Mitali Madhusmita Nayak, 199 Mitrabinda Ray, 89, 199, 255 Mohit Bajaj, 607 Mukesh Kumar Maheshwari, 491 N Naankang Garba, 619 Nagarajan, G., 65, 269, 327 Narasimha Rao Vajjhala, 619 Nawab Muhammad Faseeh Qureshi, 177 Nilima Das, 277 P Pabitra Mohan Khilar, 133 Parashjyoti Borah, 227, 371 Paulos Bekana, 187 Pawan Singh, 295 Prabira Kumar Sethy, 317 Prakash Chandra Sahoo, 393 Pranati Satapathy, 363 Preetipunya Rout, 481 Priyanka Das Sharma, 481 Q Qinchao, 501

Author Index R Rahul Chakraborty, 97 Rahul Patnaik, 581 Rajashree Dash, 207, 429, 455, 565 Rakesh Ranjan Swain, 133 Ramkumar Adireddi, 439 Ranjan Phukan, 371 Rasmiranjan Mohakud, 455 Rasmita Dash, 207, 305, 565 Rasmita Rautray, 207, 565

S Sachin Umrao, 157 Salu George Thandekkattu, 619 Samarjeet Borah, 65, 269, 327 Sahu, Bharat J. R., 157 Sampa Chau Pattnaik, 89, 199 Sandanalakshmi, R., 439 Sanika Agrawal, 97 Santi Kumari Behera, 317 Saqib Salim, 607 Sarada Prasanna Pati, 571 Saravanan, T. R., 65, 269 Sarbeswara Hota, 363 Sashikala Mishra, 97, 545 Saswati Mahapatra, 555 Saumendra Pattnaik, 145 Sayan Banerjee, 145 Sharmila Subudhi, 471 Shatarupa Dash, 157 Shruti, B., 353 Shubhendu Kumar Sarangi, 187 Sidharth Samal, 429 Simerpreet Singh, 121, 341 Smita Prava Mishra, 287 Smruti Ranjan Sahoo, 581 Srikanta Kumar Mohapatra, 55, 243 Srikanta Patnaik, 385 Srilekha Hota, 419 Sreenidhi, A., 353 Subhashini, N., 353 Subhashree Choudhury, 607 Subhra Mohanty, 555 Sukant Kumar Sahoo, 55 Sumant Kumar Mohapatra, 385 Suprakash Samantaray, 463 Suprava Ranjan Laha, 145 Susmita Bag, 317 Swadhin Kumar Barisal, 255 Swati Sucharita, 533 Syed Shayaan Ahmed, 491

Author Index

635

T Talha Younas, 607 Tanishq Ige, 97 Tejashwa Kumar Tiwari, 17 Tesfaye Woldeyohannes, G., 571 Thangavel, M., 581 Tripti Swarnkar, 215, 401, 533, 555 Tulip Das, 287

Vijaya Eligar, 595 Vinod Kumar Kulamala, 481 Vishek Singh, 295

U Ujwala Patil, 595 Uma Mudenagudi, 595

Y Yahya Daood, 89 Yanglu, 501 Yashowardhan Shinde, 545

V Vanita, 481 Vasanthi, V., 165

W Wanying Guo, 177

Z Zanib Zulfiqar, 607