Intelligent Technologies: Concepts, Applications, and Future Directions [2] 9789819914814, 9789819914821

This book discusses automated computing systems which are mostly powered by intelligent technologies like artificial int

245 65 7MB

English Pages 260 Year 2023

Report DMCA / Copyright

DOWNLOAD PDF FILE

Table of contents :
Preface
Contents
Editors and Contributors
Clinical Decision Support System for Diagnosis and Treatment of COPD Using Ensemble Methods
1 Introduction
1.1 COPD Symptoms
1.2 Clinical Decision Support System (CDSS)
2 Problem Statement
3 Literature Survey
4 Objectives of the Proposed Research
5 Methodology Followed
5.1 Architecture for Constructing CDSS for COPD
6 Outcome of the Proposed Research
6.1 Objective 1
6.2 Experimental Results of Objective 1
6.3 Objective 2
6.4 Experimental Results of Objective 2
6.5 Objective 3
6.6 Experimental Results of Objective 3
6.7 Objective 4
6.8 Experimental Results of Objective 4
6.9 Objective 5
7 Conclusion
References
Designing of Fault-Tolerant Models for Wireless Sensor Network-Assisted Smart City Applications
1 Introduction
1.1 Chapter Background
2 Overview
2.1 Fault in WSN
3 Chapter-wise Work
3.1 Energy Balanced Cluster Formation for Uniform Load Distribution
3.2 Partitioned-Based Energy-Efficient LEACH
3.3 Uniform Energy Clustering and Population-Based Clustering
3.4 Applications of Smart Cities: Case Study
4 Conclusion
References
Test Scenarios Generation and Optimization of Object-Oriented Models Using Meta-Heuristic Algorithms
1 Introduction
2 Literature Review
3 Generation of Test Scenarios Using Combined Object-Oriented Models
3.1 Process of Test Scenarios Generation
3.2 Results and Discussions
4 Test Scenarios Optimization Using Fractional-SMO in Object-Oriented Systems
4.1 Proposed Approach
4.2 Results and Relative Study
4.3 Comparative Assessment Using User Login System Case Study
5 SMPSO: Spider Monkey Particle Swarm Optimization for Optimal Test Case Generation in Object-Oriented System
5.1 Proposed Approach (SMPSO)
5.2 Proposed SMPSO
5.3 Results and Comparative Analysis
5.4 Competing Techniques
5.5 Analysis Based on the Case Study of the Online-Trading System
5.6 Comparative Discussion
6 Conclusion and Future Work
6.1 Generation of Test Scenarios Using Combined Object-Oriented Models
6.2 Test Scenarios Optimization Using Fractional-SMO in Object-Oriented Systems
6.3 Spider Monkey Particle Swarm Optimization for Optimal Test Case Generation in OO System
6.4 Future Scope
References
Logical Interpretation of Omissive Implicature
1 Introduction
1.1 Problem Statement and Hypothesis
2 Theoretical Basis
2.1 Implicature
2.2 Answer Set Programming
3 Definitions and Methodology
3.1 Definitions
3.2 Methodology
4 Experimental Environments
4.1 The Testimonials of Logical-Linguistic Puzzles
4.2 Dialogical Interactions
5 Results
6 Conclusions
7 Derived Publications
8 Code for Criminal Puzzle (Clingo 4.5.4)
9 Prototype for Program Update in Logic (Python 3.7)
References
Loss Allocation Techniques in Active Power Distribution Systems
1 Introduction
2 Loss Allocation Analysis with Method-1 With/Without DGs
2.1 Methodology
3 Analysis of Loss Allocation with Respect to Load/DG Power Factor Variation
4 Analysis of Loss Allocation with Different Load Modeling
5 Analysis of Loss Allocation with Network Reconfiguration
6 Conclusion and Future Scope
References
Detection of Brain Abnormalities from Spontaneous Electroencephalography Using Spiking Neural Network
1 Introduction
2 Epileptic Seizure Detection Using Traditional ML and Convolution Neural Network Methods
2.1 Dataset
2.2 Approaches
2.3 Implementation and Result of Traditional and Deep Machine Learning Approaches to Classify EEG Signals
3 Schizophrenia Detection from EEG Signals Using Probability Spiking Neural Network
3.1 Data Set
3.2 Approach
3.3 Result of Implementation of Probability Spiking Neural Network on EEG Signals of Schizophrenia Patients
4 Depression Psychosis Detection from EEG Signals Using Fuzzy-Based NeuCube Spiking Neural Network
4.1 Dataset
4.2 Approach
4.3 Result from Analysis and Discussion on the Implementation of NeuCube Spiking Neural Network on Depression Dataset
5 Comparative Analysis of Three Experiments
6 Summary and Future Scope
References
QOS Enhanced Energy Aware Task Scheduling Models in Cloud Computing
1 Introduction
2 Literature Review
3 Problem Statement
4 Objectives
5 Methodology
5.1 Energy Aware Multi-objective Genetic Algorithm for Task Scheduling in Cloud Computing
5.2 Optimized Resource Scheduling Using the Meta-Heuristic Algorithm in Cloud Computing
5.3 An Optimized Resource Allocation Model Using Ant Colony Auction-Based Method for Cloud Computing
5.4 Multi-objective Dynamic Resource Scheduling Model for User Tasks in the Cloud Computing
6 Results and Discussion
7 Conclusion
References
Power Quality Improvement Using Hybrid Filters Based on Artificial Intelligent Techniques
1 Introduction
2 Power Quality Improvement Using Hybrid Filters in PV Integrated Power System
2.1 Case Study: PQ Improvement in Three Phase System Using PV Integrated Conventional VSI Based Series HAPF Designed by Robust Extended Complex Kalman Filter (RECKF) and Perturb and Observe Fuzzy (PO-F)
2.2 System Configuration and Modelling
2.3 Control Strategies for PV Integrated HAPF
2.4 Results Analysis of the Case Study
3 Artificial Intelligent Methods for PQ Improvement in DC Microgrid Integrated Power System
3.1 Case Study: PQ Improvement in Three Phase System Using PV Integrated Conventional VSI Based Series HAPF Designed by Robust Extended Complex Kalman Filter (RECKF) and Perturb and Observe Fuzzy (PO-F)
3.2 System Configuration and Modelling
3.3 Control Strategies for PV Integrated HAPF
3.4 Results and Discussions
4 Artificial Intelligent Methods for PQ Improvement in Hybrid Microgrid System
4.1 System Configuration and Modelling
4.2 Control Strategies for HMG Integrated with HAPF
4.3 Results and Discussions
5 Conclusion
References
Predictive Analytics for Advance Healthcare Cardio Systems
1 Introduction
1.1 Artificial Intelligence Playing a Major Role in Health Sector
1.2 Role of Deep Learning in Preventive Care
2 Literature Review
2.1 Review of Classification Methods for Heart Disease
2.2 Review of Lifestyle Factors Affecting Heart Disease
3 Comparison of Various Classifiers for Identification of the Disease
3.1 Data Set Description
3.2 Results
3.3 Summary
4 Role of Feature Selection in Prediction of Heart Disease
4.1 Dataset Description
4.2 Results
4.3 Summary
5 Enhancing the Performance of Extreme Learning Machines Using FS with GA for Identification of Heart Disease of Fetus
5.1 Dataset
5.2 Genetic Algorithm (GA)
5.3 ELM as a Classifier
5.4 Results
5.5 Summary
6 COPD and Cardiovascular Diseases: Are They Interrelated?
6.1 Dataset
6.2 Results
6.3 Summary
7 Conclusion and Future Work
References
Performance Optimization Strategies for Big Data Applications in Distributed Framework
1 Performance Improvements in Big Data and SDN
1.1 Introduction
1.2 Open Issues
1.3 Counter Based Reducer Placement
1.4 Intelligent Data Compression Policy
1.5 Resource Aware Task Speculation
1.6 Results-Counter Based Reducer Placement
1.7 Results-Intelligent Compression
1.8 Results-Resource Aware Task Speculation
2 Topology Discovery in Hybrid SDN
2.1 Introduction
2.2 Open Issues
2.3 Indirect Link Discovery (ILD)
2.4 Broadcast Based Link Discovery (BBLD)
2.5 Indirect Controller Legacy Forwarding (ICLF)
2.6 Extended Indirect Controller Legacy Forwarding (E-ICLF)
2.7 Evaluation Platform
2.8 Result Analysis-ILD
2.9 Result-Analysis-BBLD
2.10 Result Analysis-ICLF
2.11 Result Analysis-E-ICLF
3 Traffic Classification and Energy Minimization in SDN
3.1 Introduction
3.2 Open Issues
3.3 Traffic Classification Using Intelligent SDNs
3.4 Clonal Selection Based Energy Minimization
3.5 Dataset Description and Experimental Analysis
3.6 Result Analysis Traffic Classification
3.7 Simulation Setup
3.8 Results Energy Minimization
4 Traffic Engineering in SDN
4.1 Introduction
4.2 Open Issues
4.3 Intelligent Node Placement (INP)
4.4 Simulation Platform
4.5 Result Analysis
5 Conclusions and Future Work
References
Recommend Papers

Intelligent Technologies: Concepts, Applications, and Future Directions [2]
 9789819914814, 9789819914821

  • 0 0 0
  • Like this paper and download? You can publish your own PDF file online for free in a few minutes! Sign Up
File loading please wait...
Citation preview

Studies in Computational Intelligence 1098

Satya Ranjan Dash Himansu Das Kuan-Ching Li Esau Villatoro Tello   Editors

Intelligent Technologies: Concepts, Applications, and Future Directions, Volume 2

Studies in Computational Intelligence Volume 1098

Series Editor Janusz Kacprzyk, Polish Academy of Sciences, Warsaw, Poland

The series “Studies in Computational Intelligence” (SCI) publishes new developments and advances in the various areas of computational intelligence—quickly and with a high quality. The intent is to cover the theory, applications, and design methods of computational intelligence, as embedded in the fields of engineering, computer science, physics and life sciences, as well as the methodologies behind them. The series contains monographs, lecture notes and edited volumes in computational intelligence spanning the areas of neural networks, connectionist systems, genetic algorithms, evolutionary computation, artificial intelligence, cellular automata, selforganizing systems, soft computing, fuzzy systems, and hybrid intelligent systems. Of particular value to both the contributors and the readership are the short publication timeframe and the world-wide distribution, which enable both wide and rapid dissemination of research output. Indexed by SCOPUS, DBLP, WTI Frankfurt eG, zbMATH, SCImago. All books published in the series are submitted for consideration in Web of Science.

Satya Ranjan Dash · Himansu Das · Kuan-Ching Li · Esau Villatoro Tello Editors

Intelligent Technologies: Concepts, Applications, and Future Directions, Volume 2

Editors Satya Ranjan Dash School of Computer Applications KIIT Deemed to be University Bhubaneswar, Odisha, India Kuan-Ching Li Department of Computer Science and Information Engineering (CSIE) Providence University Taichung, Taiwan

Himansu Das School of Computer Engineering KIIT Deemed to be University Bhubaneswar, Odisha, India Esau Villatoro Tello Universidad Autónoma Metropolitana Unidad Cuajimalpa (UAM-C) Mexico City, Mexico

ISSN 1860-949X ISSN 1860-9503 (electronic) Studies in Computational Intelligence ISBN 978-981-99-1481-4 ISBN 978-981-99-1482-1 (eBook) https://doi.org/10.1007/978-981-99-1482-1 © The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 This work is subject to copyright. All rights are solely and exclusively licensed by the Publisher, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed. The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use. The publisher, the authors, and the editors are safe to assume that the advice and information in this book are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or the editors give a warranty, expressed or implied, with respect to the material contained herein or for any errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional claims in published maps and institutional affiliations. This Springer imprint is published by the registered company Springer Nature Singapore Pte Ltd. The registered company address is: 152 Beach Road, #21-01/04 Gateway East, Singapore 189721, Singapore

Preface

Intelligent systems are technologically advanced machines that perceive and respond to the world around them. Intelligent systems can take many forms like Natural Language Generation, Speech Recognition, Machine Learning Platforms, Virtual Agents, Decision Management, AI Optimized Hardware, and Deep Learning Platforms. Intelligent automation (IA) combines robotic process automation (RPA) with advanced technologies such as artificial intelligence (AI), analytics, optical character recognition (OCR), and intelligent character recognition (ICR). According to these intelligent systems of classification, there are four types of AI or AI-based systems: reactive machines, limited memory machines, theory of mind, and self-aware AI. Intelligence has been defined in many ways: the capacity for abstraction, logic, understanding, self-awareness, learning, emotional knowledge, reasoning, planning, creativity, critical thinking, and problem-solving. Intelligent document processing is the kind of technology that can automatically recognize and extract valuable data from diverse documents like scanned forms, PDF files, emails, etc. and transform it into the desired format. The book is organized into ten chapters. A brief description of each of the chapters is as follows: In the chapter “Clinical Decision Support System for Diagnosis and Treatment of COPD Using Ensemble Methods”, the author has been proposed and architecture for the construction of CDSS for COPD using the ML algorithm. CDSS for COPD is supported by basic information about the patient followed by medical history and then a spirometer test. This helps in staging COPD based on spirometry results as moderate, mild, and severe stages. In the chapter “Designing of Fault-Tolerant Models for Wireless Sensor Network-Assisted Smart City Applications”. The author has proposed a framework of creating smart cities; smart water is one of the more sensitive and attention-seeking applications. The water conservation plan has been put forth in this procedure. India has plenty of water, but poor management of the country’s water resources has resulted in severe water shortages in some areas. In the chapter “Test Scenarios Generation and Optimization of Object-Oriented Models Using Meta-Heuristic Algorithms”, the authors have proposed a hybrid v

vi

Preface

approach called Spider Monkey Particle Swarm Optimization (SMPSO) to optimize the produced test cases from the developed models. Accordingly, the proposed algorithm efficiently produces the best test cases from UML by means of the framing of a control flow graph. However, the proposed algorithm attained a maximum coverage of 85% and is capable of generating maximum test scenarios. In the chapter “Logical Interpretation of Omissive Implicature”, the authors have proposed a procedure that was developed to make decisions, based on answers and a record (knowledge base) of the occurrences of omission, while maintaining the communication process. The procedure was oriented to psychotherapy interviews, where the Beck Inventory was extended to include silence, to assess the degree of depression of a person. In the chapter “Loss Allocation Techniques in Active Power Distribution Systems”, the authors have developed an active power loss allocation technique (Method-1) for fair allocation of losses among the network participants from the direct relationship existing between voltage drop across a branch and its subsequent load currents in terms of node injected complex powers. The LA results are found to be as per topology with/without DG penetration. The DG remuneration technique proposed awards the entire benefit of network loss reduction (NLR) to the DG owners after analyzing their exact contribution towards NLR. In the chapter “Detection of Brain Abnormalities from Spontaneous Electroencephalography Using Spiking Neural Network”, the authors have the depression dataset that is analysed with the fuzzy-based neucube spiking neural network approach. In conclusion, spiking neural network has given better performance, where data are analysed deeply using a suitable mathematical model for extracting appropriate data. In the chapter “QOS Enhanced Energy Aware Task Scheduling Models in Cloud Computing”, here the author has proposed Cost-based model that is for resource selection and optimization technique for resource scheduling where negotiates the cost between the parties, client, and Cloud service provider. In the chapter “Power Quality Improvement Using Hybrid Filters Based on Artificial Intelligent Techniques”. The author has proposed to generate the reference currents of HAPF, several artificially intelligent (AI), and adaptive strategies that are taken into consideration for enhancing the PV system’s performance likely to increase the stability of the DC link voltage in HAPF. In the chapter “Predictive Analytics for Advance Healthcare Cardio Systems”. The author aims in finding out the lifestyle factors that affect the heart disease and identifies the most efficient classification techniques that can assist health care experts to predict the disease in less time. The classification techniques used are Support Vector Machine, Decision Tree, Naïve Bayes, K-Nearest Neighbours, Random Forest, Extra Trees, Logistic Regression, and Extreme Learning Machines (ELM). In the chapter “Performance Optimization Strategies for Big Data Applications in Distributed Framework”, the author provides novel schemes which unearth topology discovery by requiring fewer messages and gather link information of all the devices in both single and multi controller environments (might be used when scalability issue is prevalent in hSDN). Traffic engineering problems in h-SDN are

Preface

vii

addressed by proper placement of SDN nodes in h-SDN by utilizing the analyzing key criterion of traffic details and the degree of a node while lowering the link utilization in real time topologies. Topics presented in each chapter of this book are unique to this book and are based on the summarized Ph.D. works of all the authors. In editing this book, we have attempted to bring all the innovative trends and experiments that have made on intelligent technologies. We believe this book to serve as a reference for larger audience such as students, research scholar, developers, and researchers. Bhubaneswar, India Bhubaneswar, India Taichung, Taiwan Mexico City, Mexico

Dr. Satya Ranjan Dash Dr. Himansu Das Dr. Kuan-Ching Li Dr. Esau Villatoro Tello

Contents

Clinical Decision Support System for Diagnosis and Treatment of COPD Using Ensemble Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Sudhir S. Anakal and P. Sandhya

1

Designing of Fault-Tolerant Models for Wireless Sensor Network-Assisted Smart City Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . Hitesh Mohapatra and Amiya Kumar Rath

25

Test Scenarios Generation and Optimization of Object-Oriented Models Using Meta-Heuristic Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . Satya Sobhan Panigrahi and Ajay Kumar Jena

45

Logical Interpretation of Omissive Implicature . . . . . . . . . . . . . . . . . . . . . . . Alfonso Garcés-Báez and Aurelio López-López

75

Loss Allocation Techniques in Active Power Distribution Systems . . . . . . Ambika Prasad Hota, Sivkumar Mishra, and Debani Prasad Mishra

95

Detection of Brain Abnormalities from Spontaneous Electroencephalography Using Spiking Neural Network . . . . . . . . . . . . . . 123 Rekha Sahu and Satya Ranjan Dash QOS Enhanced Energy Aware Task Scheduling Models in Cloud Computing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145 G. B. Hima Bindu, Kasarapu Ramani, and C. Shoba Bindu Power Quality Improvement Using Hybrid Filters Based on Artificial Intelligent Techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165 Soumya Ranjan Das, Prakash Kumar Ray, and Debani Prasad Mishra

ix

x

Contents

Predictive Analytics for Advance Healthcare Cardio Systems . . . . . . . . . . 187 Debjani Panda and Satya Ranjan Dash Performance Optimization Strategies for Big Data Applications in Distributed Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 221 Mir Wajahat Hussain and Diptendu Sinha Roy

Editors and Contributors

About the Editors Dr. Satya Ranjan Dash is currently working as an associate professor at KIIT University, India. His current research includes Epileptic Seizure Detection based on EEG Signal through Spiking neural network (SNN), Classification of Schizophrenia Patients from EEG and fMRI using SNN and SSN, fetal heart rate signals classification through extreme learning machine (ELM), Mammogram Analysis with Local binary pattern (LBP), and generative adversarial network (GAN) model. Dr. Himansu Das works as Associate Professor in the School of Computer Engineering, Kalinga Institute of Industrial Technology (KIIT), Deemed to be University, Bhubaneswar, India. He has been received Ph.D. in Engineering (Computer Science and Engineeing.) degree from Veer Surendra Sai University of Technology (VSSUT), Odisha. He has been received M.Tech. degree in Computer Science Engineering from the National Institute of Science and Technology, Odisha. He has also received B.Tech. degree from the Institute of Technical Education and Research, Odisha, India. He has published several research papers in various international journals and has presented at conferences. He has also edited several books published by IGI Global, Springer, CRC press, and Elsevier. He has also served on many journals and conferences as editorial or reviewer board member. He is proficient in the field of Computer Science Engineering and served as an organizing chair, a publicity chair and acted as a member of the technical program committees of many national and international conferences. He is also associated with various educational and research societies such as IET, IACSIT, ISTE, UACEE, CSI, IAENG, and ISCA. His research interests include the field of Data Mining, Soft Computing, and Machine Learning. He has also more than twelve years of teaching and research experience in various engineering colleges and universities. Dr. Kuan-Ching Li is currently a Professor in the Department of Computer Science and Information Engineering at the Providence University, Taiwan. Dr. Li is the

xi

xii

Editors and Contributors

Editor-in-Chief of technical publications International Journal of Computational Science and Engineering (IJCSE), International Journal of Embedded Systems (IJES) and International Journal of High Performance Computing and Networking (IJHPCN), all published by Interscience, also serving a number of journal’s editorial boards and guest editorship. His topics of interest include networked computing, GPU computing, parallel software design, and performance evaluation and benchmarking. Dr. Li is a Fellow of the IET and a senior member of the IEEE. Dr. Esau Villatoro Tello is currently full-time professor-researcher at the Universidad Autónoma Metropolitana Unidad Cuajimalpa (UAM-C) in Mexico City. From September 2019 to date he is a visiting professor at Idiap Research Institute, in Martigny, Switzerland. His main research interests are associated with Natural Language Processing (NLP) and Computational Linguistics topics, specifically he has done research on the topics of authorship analysis and authorship attribution, thematic and non-thematic text classification, plagiarism detection, information retrieval, and NLP applied in psycho-linguistics for mental health support.

Contributors Sudhir S. Anakal Faculty of Computer Applications, Sharnbasva University, Kalaburagi, India Soumya Ranjan Das Department of Electrical Engineering, IIIT Bhubaneswar, Bhubaneswar, India Satya Ranjan Dash School of Computer Applications, KIIT Deemed to be University, Bhubaneswar, Odisha, India Alfonso Garcés-Báez Instituto Nacional de Astrofísica, Óptica y Electrónica, Computational Sciences Department, Sta. Ma. Tonantzintla, Puebla, México G. B. Hima Bindu Department of CSE, School of Technology, The Apollo University, Saketa, Murukambattu, Chittoor, Andhra Pradesh, India Ambika Prasad Hota Department of Electrical and Electronics Engineering, Gandhi Engineering College (GEC, Bhubaneswar), Bhubaneswar, India Mir Wajahat Hussain Department of Computer Science and Engineering, Alliance University, Anekal, Karnataka, India Ajay Kumar Jena School of Computer Engineering, KIIT Deemed to be University, Bhubaneswar, Odisha, India Aurelio López-López Instituto Nacional de Astrofísica, Óptica y Electrónica, Computational Sciences Department, Sta. Ma. Tonantzintla, Puebla, México

Editors and Contributors

xiii

Debani Prasad Mishra Department of Electrical and Electronics Engineering, IIIT-Bhubaneswar, Bhubaneswar, India Sivkumar Mishra Department of Electrical Engineering, CAPGS BPUT, Rourkela, Odisha, India Hitesh Mohapatra KIIT (Deemed to be) University, Bhubaneswar, OR, India Debjani Panda Indian Oil Corporation Ltd., Odisha State Office, Bhubaneswar, India Satya Sobhan Panigrahi School of Computer Engineering, KIIT Deemed to be University, Bhubaneswar, Odisha, India Kasarapu Ramani School of Computing, Mohan Babu University, Tirupati, Andhra Pradesh, India Amiya Kumar Rath Veer Surendra Sai University of Technology, Burla, OR, India; National Assessment and Accreditation Council (NAAC), Bangalore, KA, India Prakash Kumar Ray Department of Electrical Engineering, OUTR Bhubaneswar, Bhubaneswar, India Diptendu Sinha Roy Department of Computer Science and Engineering, National Institute of Technology Meghalaya, Shillong, Meghalaya, India Rekha Sahu School of Computer Engineering, KIIT Deemed to Be University, Bhubaneswar, India P. Sandhya Department of Computer Science and Engineering (MCA), VTU CPGS, Mysuru, India C. Shoba Bindu Department of CSE, JNTUA College of Engineering, Ananthapuramu, Andhra Pradesh, India

Clinical Decision Support System for Diagnosis and Treatment of COPD Using Ensemble Methods Sudhir S. Anakal and P. Sandhya

Abstract Chronic Obstructive Pulmonary Disease (COPD) is a long-term condition called chronic inflammatory lung disease which causes obstructive airflow from the lungs. COPD is an umbrella term for two conditions: emphysema and chronic bronchitis. Emphysema slowly destroys the air sacs in the lungs, obstructing the outward airflow. Similarly, bronchitis causes narrowing and inflammation of the bronchial tubes that allow the building up of mucus. Long-term exposure to matter that irritates the lungs could lead to COPD, the most common cause. COPD is a fast-progressing disease that may lead to many complications and exacerbations like worsening respiratory problems and heart problems. Considering these parameters, the Clinical Decision Support System (CDSS) has revolutionized healthcare and its delivery which has enhanced the clinical decisions made by clinicians supported by medical knowledge, patient and health data, and other relevant health information. In the present study, architecture has been proposed for the construction of CDSS for COPD using the ML algorithm. CDSS for COPD is supported by basic information about the patient followed by medical history and then a spirometer test. This helps in staging COPD based on spirometry results as moderate, mild, and severe stages. CDSS is also fed with treatment and management strategies along with pulmonary rehabilitation, and drug-drug interaction checkers that would help physicians to look into the drug interactions in COPD medications or comorbidities. COPD patients above the age of 50 years suffer from mental health problems like depression and dementia which is also conducted through a module and also Quit Smoking Test for patients with COPD. The proposed methodology for CDSS comprises three phases: phase 1: data inputs, phase 2 ML models, and phase 3 possible outcomes. ML approaches that have been used in the present study are Random Forest (RF), Logistic Regression (LR), Decisions Tree (DT), and Gradient Boosting (GB). These ML algorithms are ensemble methods that integrate the experience and knowledge of physicians in the diagnosis of the disease (COPD) and also generalize from observed evidence S. S. Anakal (B) Faculty of Computer Applications, Sharnbasva University, Kalaburagi, India e-mail: [email protected] P. Sandhya Department of Computer Science and Engineering (MCA), VTU CPGS, Mysuru, India © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 S. R. Dash et al. (eds.), Intelligent Technologies: Concepts, Applications, and Future Directions, Volume 2, Studies in Computational Intelligence 1098, https://doi.org/10.1007/978-981-99-1482-1_1

1

2

S. S. Anakal and P. Sandhya

to making predictions about the unseen data in the diagnosis, segmentation, and detection. Finally, phase 3 wherein the possible outcomes are defined such as initial investigation, COPD staging, and classification, treatment, prediction of comorbidities and identification of drug-drug interactions, psychological comorbidities, and modules for quitting smoking and disease management. Keywords COPD · CDSS · Machine learning · AI · Spirometer

1 Introduction Chronic Obstructive Pulmonary Disorder (COPD) is mainly a persistent inflammatory lung disorder that obstructs airflow from the lungs. It is non-infectious and mainly affects the lungs and obstructs the airflow through the lungs and is one of the significant health issues which is gaining huge attention because of its detrimental nature. The disease prevails across the globe and has a high mortality rate because of which it has become one of the four major reasons for death. Though the disease is highly lethal, it is still underdiagnosed. As per the World Health Organization (2012), it is estimated that approximately 210 million people throughout the world suffer from COPD [1]. The mortality rate is persistently increasing and it is estimated that by 2030 it might become one of the top leading causes of death. About 65 million or more people worldwide possess severe or moderate COPD. According to experts, this number may continue to rise globally in the next six decades. Due to the gradual progression of the disease, it is not diagnosed until the patient attains the age of around 40. Smoking (passive and that smoking) is identified as the prime reason for COPD. Apart from smoking, other reasons such as air pollution, and exposure to harmful chemicals also contribute majorly to the occurrence of COPD. The disease is mainly induced by prolonged exposure to particulate matter or irritating gases, mostly from air pollution or smoking. It is mainly characterized by difficulty in breathing with symptoms aggravating underdiagnosed time. Though there is no remedy for COPD at present, it is still preventable and treatable. In COPD patients, it was observed that the airflow is reduced significantly due to the following reasons: • • • •

Deterioration in the elastic quality of the air sacs and airways. Thinning of air sac walls. Inflammation in the walls of airways Clogging of airways due to excess mucus formation.

The two chief conditions contributing to COPD are Emphysema and Chronic Bronchitis [2]. Chronic Bronchitis and Emphysema generally occur together but may vary in terms of severity among COPD patients. Chronic Bronchitis is linked with inflammation of the bronchial tube’s lining or air pathways which transport air to and from the alveoli (air sacs) of the lungs. This condition is mainly characterized by mucus and cough production.

Clinical Decision Support System for Diagnosis and Treatment …

3

In 2016, about 8.9 million people in America were detected with bronchitis, among which approximately 75% of conditions involved people above age 45 and the proportion of women suffering from this condition was nearly twice the rate of men [2]. Emphysema refers to the condition wherein the alveoli of the lungs are harmed because of exposure to particulate matter and irritating gases. In Emphysema cases, walls of injured air sacs become outstretched and lungs get larger, making it tedious to transport air in and out. In 2016, nearly 3.5 million people in America were detected with Emphysema, among which greater than 90% of conditions involved people above the age of 45 [2].

1.1 COPD Symptoms Often, COPD symptoms won’t appear until predominant lung deterioration has occurred. The symptoms typically worsen over time, specifically if exposure to smoke continues. Initial symptoms of COPD include the feeling of suffocation or difficulty in breathing, tiring (particularly during performing physical activities), chest tightness, persistent wheezing, lack of activeness/energy, frequent respiratory disorders, inflammation in legs or feet and ankles, blueness of fingernail beds or lips, inadvertent weight loss and chronic cough which may produce sputum (mucus) which may be white, greenish or yellowish [3]. These symptoms typically turn worse eventually and make routine activities increasingly hard. Sometimes COPD symptoms may get abruptly worse compared to general day-to-day changes and may persist for several days. This condition is called exacerbation or flare-up.

1.2 Clinical Decision Support System (CDSS) A CDSS is a comprehensive software and hardware-based system that aims at improving healthcare delivery by enhancing clinical decisions with desired medical knowledge, health data, patient data, and other information pertinent to health [4]. CDSSs today allow clinicians to integrate their knowledge along with suggestions or information offered by CDSS. These systems exploit knowledge and theories from diverse areas [5], to support sophisticated decision-making and problem-solving. They allow decision-makers to build and explore the implications of their judgments [6]. They even provide evidence-informed recommendations to support clinical diagnoses [7]. Nowadays, computer-based systems have been increasingly adopted by healthcare professionals to record patient data, which is encouraged by carrots and sticks policies of governments and organizations than by inherent benefits from these systems [8]. A number of these tools are being used [9], which are useful in administration, data management, and in clinical experiments. These systems synthesize and integrate patient-specific information, perform complex evaluations and present

4

S. S. Anakal and P. Sandhya

the results to clinicians in a timely fashion [10]. CDSSs are information systems developed to support and enhance medical/healthcare decision-making [11].

2 Problem Statement COPD has been known to be one of the major poorly reversible diseases related to the lungs causing an increase in morbidity and mortality across the globe. The individuals suffering from COPD are mainly the middle-aged and elderly population that is majorly being infected from the risk factors related to respiratory infections. No further cure for COPD has been identified and can only be reduced by reducing the impact of smoking among the population. The major risk factor for the emergence of COPD is smoking habits. Hence, the management for controlling the impact of COPD to reduce the mortality rate has evolved through the regular use of inhaled bronchodilators that helps in preventing and relieving symptoms of COPD. But these management tools can only be utilized when the symptoms are identified, but COPD symptoms often take a longer time to identify which may lead to severe complexities among COPD patients. Hence, there is a need for diagnosing COPD at an early level. The use of spirometer devices has been proven to be effective in the diagnosis of COPD but it does not provide appropriate results for young and aged patients. Hence, the integration of CDSS has been adopted that proved to be effective in appropriate diagnosis and helps the patients in improving their quality of life while reducing the involvement of risk factors. The utilization of spirometers in the diagnosis of COPD is quite expensive and requires skilled staff for its operation. Hence, the implementation of CDSS would help physicians with the appropriate diagnosis of COPD while providing appropriate treatment and diagnosis for COPD patients.

3 Literature Survey Gava et al. checked on the worldwide evaluation of misdiagnosis of COPD and have recognized enormous varieties in the predominance of COPD across various topographical locales, running somewhere in the range of 3% and 21%. Assessment of the genuine weight of COPD will stay a test except if a more uniform way to deal with the analysis of COPD is embraced by all districts where the related danger factors are normal. One of the significant explanations behind the colossal variety in COPD pervasiveness is the various rules used to characterize COPD. By far most information on under-and over-finding of COPD has been produced from the created world, while exceptionally restricted or low-quality information is accessible from LMICs. Admittance to medical care and the under-usage of spirometry remains an enormous worldwide test. In spite of the fact that spirometry is suggested for corroborative determination of COPD, its consistency of utilization for test methodology,

Clinical Decision Support System for Diagnosis and Treatment …

5

customary support, and adjustment stay worldwide difficulties. Besides, the utilization of various reference esteems for lung work, for example, country-explicit or Global Lung Function Initiative (GLI) values, stay a fervently discussed region however one that is past the extent of this audit. Extra danger factors for misdiagnosis incorporate age, sexual orientation, identity, self-impression of side effects, concurrent sickness, and instructive consciousness of hazard factors, both for the patients and their doctors, which are likewise significant. According to [13], advancements made in biomedical and biological technologies have provided us with enormous volumes of physiological and biological data (biomedical data) such as genomic sequences, medical images, electroencephalography and protein sequences, and structures. Deep Learning (DL) algorithms, a part of Machine Learning (ML), show promising results in the extraction of features and learning of complex data patterns. In general, DL has two properties; unsupervised or supervised learning of feature presentations in each layer and multiple layers of processing units (non-linear). The deluge of big data in biomedical necessitates efficient and effective computational tools to analyze, interpret, and store such types of data. DL intends to model abstraction from the enormous data by employing deep neural networks (DNNs) (multi-layered) that make sense of the obtained data like images, texts, and sounds. Artificial Neural Networks (ANNs) mimic the perception of objects and connect artificial neurons within the layers so that they could aid in feature extraction from the objects. ANNs are now facilitated with efficient backpropagation (BP) for pattern recognition. An ANN having more hidden layers provides a higher resolution for feature extraction. Analysis of medical images is one of the important applications in biomedical informatics wherein CNN architecture (DL) is used for the diagnosis of diseases. Notably, DL methods have helped in the objective recognition, segmentation, and localization of natural images in medical imaging. Segmentation of organs and tissues is crucial in the quantitative and qualitative assessment of medical images. For instance, brain tumor segmentation (BRATS) based on DNS helps in imaging obtained through Magnetic Resonance Imaging (MRI) by [14]. [15] conducted a study for modeling structural features in RNA-binding protein (RBP) targets in gene expression and genomic sequencing studies using DL methods. The framework comprises three phases; data encoding, application phases, and training. In the first phase of data encoding, RNA sequences are truncated around the bound site that is identified by the use of tool RNA shapes and CLIP-based experiments. Using replicated softback architecture, primary and secondary sequence structures are encoded. In the training phase, a multimodal deep belief network (DBN) is built to integrate the aforementioned encoded structural profiles and sequences. Finally, in the application phase, the trained DL architecture detects the novel binding sites in RBP on the genome. [16] conducted a review to analyze the application of ML in the field of sensing technology for acquiring and analyzing signals from the patient for the monitoring of heart condition, mental state, and diagnosis of the disease. An important feature of the DL models is that it achieves high-performance and accurate models that depend on

6

S. S. Anakal and P. Sandhya

feature selection and feature extraction. These algorithms are successful in achieving sensor data and modeling physiological signals with better accuracy as compared to conventional ML techniques. Various physiological signal tasks are analyzed using DL techniques like the precision of mental state, classification of sleep stage, human activity, interactions and recognitions, cardiac arrhythmias, human physical activity, monitoring of heart rate, fall detection, detection of epileptic seizures, emotions, and blood pressure.

4 Objectives of the Proposed Research The major aim of the research is to study the Clinical Decision Support System (CDSS) for the diagnosis and treatment of COPD using ensemble methods. The following are the major objectives that have been framed according to the research study: • Initial investigation of COPD by integrating Spirometry, and ensemble techniques. • To diagnose and classify the various stages of COPD based on the Soft Computing approach & MIL (Machine Intense Learning) approach. • To enable improvement in treatment outcomes with knowledgebase systems. • To assess the possible interactions of drugs and their possible side effects associated with them employing prediction techniques. • To investigate the comorbidities associated with COPD by using an ensemble approach. • To analyze and detect cognitive dysfunction in a patient with severe COPD by conducting various neuropsychological tests which are a part of the Clinical Decision Support System.

5 Methodology Followed The present study utilizes the concept of a clinical decision support system (CDSS) that aims for improving healthcare delivery by enhancing medical acquaintances while implementing appropriate clinical decisions and knowledge, patients’ information, and related health information [12]. CDSS consists of design software that aims for clinical decision-making while integrating the features of patients that resemble the computerized clinical knowledge base and later the assessment of the patients is provided to the clinicians for appropriate decision-making. CDSS has been utilized by clinicians for providing care to patients along with the integration of clinicians’ knowledge and information intended by CDSS. The development of CDSS has been integrated for holding the data and related observations. The present era is prone to the digitized use of CDSS in the form of web applications, electronic health records (EHR), and computerized provider order entry (CPOE) systems. These CDSS applications can be used through a tablet, desktop, or smartphone along with

Clinical Decision Support System for Diagnosis and Treatment …

7

other advanced technology-based devices involving wearable health technology and biometric monitoring as such devices are capable of obtaining outputs that are directly visible on the device or are linked to EHR databases. The observations of CDSS are entirely based on intervention timing and active or passive form of delivery [12]. Knowledge-based and non-knowledge-based are the two major classifications of CDSS. The knowledge-based classification involves the integration of IF–THEN statements while producing output or evaluating rules with the retrieval of data within the system. The non-knowledge-based classification of CDSS involves decisions depending on machine learning (ML), artificial intelligence (AI), or statistical pattern recognition. The CDSS has contributed to alarm systems, diagnostics, prescription (Rx), disease management, drug control, and several other clinical analyses concerning diseases. The outputs are obtained in the form of computerized alerts, reminders, computerized guidelines, order sets, patient data reports, documentation templates, and clinical workflow tools.

5.1 Architecture for Constructing CDSS for COPD In the CDSS architecture for COPD, as shown in Fig. 1, the basic patient information is collected then medical histories like alcohol consumption rate, smoking history, TB history, allergy, Asthma, Cardiac issue, Breathlessness, Hypertension, Wheezing, and COPD symptoms. After this, a Spirometer test is conducted, and the spirometer test result value (FEV1/FVC) is used for COPD staging that the individual is suffering. Once the individual is confirmed with COPD (Severe stage, Moderate, Mild) Management and Treatment strategies have been done. The yoga videos, exercises, Pulmonary Rehabilitation, Medications, and breathing techniques are available. CDSS also has a module that gives details on inhaler devices, inhalers, and nebulizers. It has a module for Drug-Drug interaction (DDI) checker that could help the patients or physicians to look for the drug interaction that has been prescribed for COPD or related Comorbidities. COPD patients above the age of 50 years may suffer from two comorbidities involving Dementia and Depression. Depression is a common mental illness that affects many people worldwide and recognizing symptoms is a daunting task. Thus, the study has developed a module for the investigation of depression in a COPD patient. Depression test has been conducted by the Hamilton Scale test, in which questions are asked to the patient by the physician and on the basis of the given answers by the patient, the scorecard is generated. Dementia is a deterioration in mental ability that interferes with daily life like memory loss. Six examinations have been recommended by the National Institute and Aging, USA which include the s Functional Activities Questionnaire (FAQ), Blessed Orientation-Memory-Concentration (BOMC), Mini-Mental State Examination (MMSE), Blessed-Dementia Information-Memory-Concentration Test (BIMC), Short-Term Memory Questionnaires (STMQ) Short-Term Memory recall Test (STMT), that has been considered in the present study. Moreover, the CDSS

8

S. S. Anakal and P. Sandhya

Fig. 1 Architecture for constructing CDSS for COPD

proposed in the study has a module called the Quit Smoking test for helping patients to quit smoking after being diagnosed with COPD by physicians (Fig. 1). The proposed methodology shown in Fig. 2 consists of three major phases involving phase 1 to be Data inputs, phase 2 to be ML models, and phase 3 to be the possible outcomes. The data inputs in phase 1 comprise electronic medical reports, laboratory investigations, spirometer reports, and physician interpretations. All these phases are considered data pre-processing. The data processing of phase 1 integrates towards phase 2. Phase 2 involves the integration of machine learning models. Data warehousing is the integration of COPD. Several machine learning approaches are used which have been illustrated below: Random Forest algorithm: RF algorithm is considered an easy and flexible way to adhere ML algorithm that is capable of producing more results even in the absence of hyper-parameter tuning most of the time. It has been considered one of the most effective algorithms due to its simple techniques and effective diversions. Its effective diversity involves the use of both classification and regression tasks. This model of the algorithm in machine learning is known to be the supervised learning algorithm. Logistic Regression: Logistics regression has been used in the biomedical domain along with its utility in social science applications. It is mainly the modeling process of the probability having a discrete outcome applied in an input variable. One of the most common LR models includes the binary outcome that involves two values such as true/false, yes/no, and so on. It has been considered as an important model

Clinical Decision Support System for Diagnosis and Treatment …

9

Fig. 2 A proposed methodology using ML algorithms

of analysis that aims for classifying the problems where the determination of new samples is capable of fitting into a definite category. Decision tree: A decision tree has been considered for covering the classification and regression in the analysis. It has been effective in visualizing and representing the decisions for effective decision-making in the analysis. Tree-like models have been used for decisions. Any data set represents more characteristics that will be considered as branches in any big tree but the simplicity of this algorithm cannot be ignored. This algorithm represents the important features that are quite clear and the relations can be clearly visible. This methodology has been known as the learning decision tree from data and the tree classifying the features is known as the classification tree as it aims for targeting the classification. The regression trees aim for predicting the values. DT is quite an effective and reliable decision-making technique that provides high accuracy in classification through simple and gathered knowledge. While using DTs, an expert validates the decision-making process itself. Thus, DTs are quite appropriate in supporting the clinical decision-making process. Gradient Boosting: GB has been referred to as a technique that converts weak learners to stronger ones. The boosting process involves the boosting of each tree in order to fit into a modified version. GB algorithm is considered to be initiated by training a DT in which each observation has an equal weight. The evaluation of the first tree is done that increases the weight of the observation that are difficult to classify and lower the observation weights for those which are easy to classify. GB is responsible for training many models in an additive, gradual, and sequential manner.

10

S. S. Anakal and P. Sandhya

GB is able to fit a weak learner to the residual recursively and improves model performance with a gradual increase in the number of iterations. It can discover automatically the complex structure of data that includes high-order interactions and nonlinearity in thousands of potential predictors. Data collection is one of the toughest jobs in research work. We have consulted the local Pulmonologist to finalize the attributes for collecting the data. For collecting data, we visited local hospitals and interviewed the patients. Once the data was collected the data was cleaned using Data Mining techniques. Missing values, and typing mistakes while entering the data in the excel sheet were handled using Python language. Dataset attributes for collecting data were: Age, Gender, Occupation, Height(cm), Weight(kg), Smoker (in the year(s), EX-Smoker (in the year(s)), Sticks per day, Type of smoke, Asthma, Cardiac Problem, Any health problems, FEV1/ FVC, Any Allergy, COPD Symptoms, Staying near Industries, Wood cooking (in the year(s)), COPD stage. There were some reasons behind choosing these attributes for the current study. Starting with “age”, is one of the most important factors because COPD can only be diagnosed after the age of 40. Knowing respondents’ gender will reveal who is more exposed to COPD, male or female. Occupation will define more clarity if one is working in dust places, or has been working for a long time in dust places occupation hazards may lead to COPD. Smoking will add convenience to knowing whether the person who’s smoking gets more symptoms of COPD or not. Furthermore, the count of their smoking sticks per day will let us know better about the factors. Also, the one who is already having asthma will end up showing some COPD symptoms first. Asthma, allergy, diabetes, or any other health issues are also important factors to consider while conducting such surveys. Symptoms of COPD will also depend on the locality of respondents whether they stay near industrial areas or not. Lastly, spirometer results will talk more about the FVC, FEV1, and FEV1/FVC.

6 Outcome of the Proposed Research 6.1 Objective 1 Initial investigation of Chronic Obstructive Pulmonary Disease (COPD) by integrating spirometry, and ensemble techniques Data for the present study were collected by applying ML techniques and designed in a unique spirometer device for conducting PFT tests. From Fig. 3, we can conclude that the above data was recorded for the initial investigation of COPD (Fig. 4). This proposed Spirometer gadget is intended to work out the essential limit of the individual blowing into the cylinder and show the outcomes on an LCD show. This minimal expense Spirometer takes into consideration the computation of the air volume catapulted by the lungs over a time of stretch. This gadget could be utilized

Clinical Decision Support System for Diagnosis and Treatment …

11

Fig. 3 Medical history

Fig. 4 PFT test

to recognize any irregularities in lung volume. It utilizes the standard of the Venturi tube, wherein distinction in pressures inspected at two diverse cylinder widths can be utilized to compute stream rate with Bernoulli’s rule. This spirometer gadget is being planned so that the outcomes obtained in the wake of leading the Spirometer test are shipped off the cloud. The outcomes put away in the cloud can be recovered exclusively by approved people all throughout the planet. The outcomes need not

12

S. S. Anakal and P. Sandhya

be sent utilizing sends or different strategies as the proposed framework utilizes the cloud to store the PFT aftereffects of each tolerance and can be used when required. The gadget utilizes WI-FI association to get associated with the cloud and afterward sends the consequences of the PFT to be stored on the cloud. Then, the outcomes can be gotten to by the approved individual by utilizing the client accreditations to get to the information when required.

6.2 Experimental Results of Objective 1 When the Spirometer gadget is turned on, a message displays on the LCD written “IoT-based minimal expense Spirometer”. In the recently evolved gadget, we additionally have an office that assists us with turning the gadget on without outer force, it utilizes a battery that can give reinforcement for as long as three hours. When the gadget is turned on, utilizing the Wi-Fi module it gets associated with the cloud. The android application, ThingView—Things Speak watcher is utilized to associate with the cloud. ThingsSpeak is an open “IoT” source stage used to store and recover information from the cloud utilizing HTTP over the web. Subsequent to getting associated with the cloud the gadget is fit to be utilized for directing PFT. To direct PFT the patients are encouraged to breathe in and breathe out. When the PFT is done, the outcomes are transferred to the cloud (Figs. 5 and 6 ). In this work, we have planned and fostered a Spirometer gadget, which is of minimal expense. We have utilized a Silicon Pressure Sensor which gives 95% of exact outcomes. The Spirometer gadget can likewise be utilized by the undeveloped staff without any problem. The point of the proposed framework was to plan a handheld spirometer gadget that can be utilized without interfacing with the PC. After the PFT the outcomes are shipped off the cloud utilizing the WI-FI associated with the gadget. The outcomes are put away in the cloud and they can be recovered simply by the approved individual by giving the client qualifications. This proposed gadget is extremely useful in the country’s regions. On the off chance that the pulmonologist isn’t accessible, our gadget can be utilized to lead the PFT and the outcomes are put away in the cloud, the outcomes can be seen by a doctor utilizing the android application.

6.3 Objective 2 To diagnose and classify the various stage of COPD based on the Soft Computing approach & MIL (Machine Intense Learning) approach COPD alludes to being in the last phases of the illness. At this stage, you can hope to encounter huge windedness in any event, while resting. Due to the level of lung harm at this stage, you are in high danger of lung contamination and respiratory

Clinical Decision Support System for Diagnosis and Treatment …

13

Fig. 5 Spirometer

disappointment. As per the GOLD, there are four distinct phases of COPD with range upsides of the PFT test: • Stage I: Mild COPD. Lung work is beginning to decrease yet you may not see it. • Stage II: Moderate COPD. Indications progress, with windedness created upon effort. • Stage III: Severe COPD. Windedness turns out to be more awful and COPD intensifications are normal. • Stage IV: Very serious COPD. Personal satisfaction is seriously weakened. COPD intensification can be dangerous. Each stage is characterized by the spirometry estimation of FEV1 (the volume of air inhaled out in the principal second after a constrained exhalation). End-stage COPD is viewed as stage IV, or exceptionally extreme COPD with an FEV1 of not exactly or equivalent to 30%0.3 various variables impact COPD’s future, including your smoking history, your degree of dyspnea (windedness), your wellness level, and your nourishing status. Certain individuals in stage IV are as yet ready to work sensibly well with not many constraints. Then again, there are additionally many individuals at this stage who are exceptionally wiped out.

14

S. S. Anakal and P. Sandhya

Fig. 6 Spirometer results

6.4 Experimental Results of Objective 2 Classification module designed using Random Forest Classifier coded using Python. RF or random DTs are ensemble learning methods used for the regression, classification, and tasks that operate through the construction of a multitude of DTs at training time and class outputting which is the mode of the classes or average or mean prediction of the individual DT.

Clinical Decision Support System for Diagnosis and Treatment …

15

RF is a supervised learning algorithm wherein the “forest” that is built is also an ensemble of DTs trained using the method of “bagging”. The general idea of this method is to have a combination of learning models increase the overall result. Confusion matrices visualize vital predictive analytics such as specificity, recall, precision, and accuracy. In contrast, other ML classification metrics such as “Accuracy” provide less useful data. It is the difference between correct predictions divided by the total number of predictions. They are useful as it gives direct value comparison such as False Positives, True Positives, False Negatives, and True Negatives. They also represent counts from actual and predicted values. Accuracy could mislead if used with imbalanced datasets, and thus, other metrics on the basis of the confusion matrix could be useful for performance evaluation.

6.5 Objective 3 To enable improvement in treatment outcomes with a knowledgebase system In the present research, prior to the designing and development of CDSS for COPD diagnosis and development, GOLD criteria have been used for prescribing inhalers, drugs, and medications to COPD patients. GOLD criteria are a pocket guide to the diagnosis, prevention, and management of COPD while working with public health officials and healthcare professionals across the globe for raising awareness about COPD and improving the lives of the people living with COPD. The GOLD report published in 2021 recommended that the COPD diagnosis should be based on the presence of airflow obstruction and symptoms that are demonstrated by postbronchodilator forced FEV1/FVC ratio which should be TC2 TC3 < Input: Accurate user id., Accurate select Book, Output: Success with Details Request message> TC4 TC5

3.2.3

Result Assessment

The devised approach is executed by generating the XMI (XML Meta Interface) code of sequence and the state machine diagram of Library book issue system. A Java parser is written where the XMI code is passed as input. It is an interpreter, which takes input from the XMI code and breaks it into parts to generate the sequence of test scenarios. The nodes, edges and all the transitions are recognized from it and a graph is automatically constructed using it. The snapshot of the XMI code of the sequence diagram is illustrated in Fig. 3.7. At last, the test scenarios have been produced from the graph through this Java parser. In Table 3.3, the observed test cases generated from the combined model are mentioned. The same approach is also implemented in the other four case studies to measure its efficiency, coverage and number of fault detection. Coverage analysis of different case studies using our proposed approach is portrayed in Fig. 3.8. We also applied our proposed approach to four other case studies from real-life situations like the Temperature warning system, Online trading system, E-Commerce system and ATM withdrawal system which are mentioned in Table 3.4.

56

S. S. Panigrahi and A. K. Jena

Fig. 3.7 Snapshot of XMI code of sequence diagram

Table 3.3 Observed test cases from the combined model Test case no

User id

Book issued

Book name

Final book issued

Actual result

Expected result

TS1

2,009,286

1

Advance operating system

1

Invalid student

Invalid student

TS2

2,019,876

5

Operating system

5

Allocation complete

Allocation completed

TS3

2,019,987

4

Data science

4

Unable to find Book

Error message

TS4

2,018,654

3

Machine learning

4

Issued succesfully

Issued succesfully

TS5

2,018,325

2

Software testing

2

Student not found

Student not found

4 Test Scenarios Optimization Using Fractional-SMO in Object-Oriented Systems A meta-heuristic technique named Fractional-SMO is applied for selecting the optimal test scenarios with respect to the fitness measures, like fault and coverage. The proposed model is implemented by synthesizing five UML diagrams of different case studies. Comparison of the devised Fractional-SMO (Spider Monkey Optimization) is also analyzed for determining the effectiveness of the approach.

Test Scenarios Generation and Optimization of Object-Oriented Models …

57

Fig. 3.8 Coverage analysis of different case studies

Table 3.4 Coverage analysis of case studies using our proposed approach Sl. no

Case study

Total number of nodes

Total number of edges

Total number of decisions

% of Coverage

1

BILIS

34

33

3

100

2

ATMWS

24

23

4

98.63

3

ULS

22

21

3

99.33

4

OSS

46

45

5

97.23

5

OTRS

43

42

8

98.5

4.1 Proposed Approach In this section, we propose a tactic for test scenario optimization using fractionalSMO. Figure 4.1 depicts the proposed model of our approach. The following steps are followed for the execution of the proposed method. The proposed approach is divided into three major steps like conversion of the diagram into a graph, generation of test case sequences and optimization of generated test sequences. In the proposed technique, the behavioral UML diagram of a model is taken as input for the test case optimization, and it is then converted into a graph based on the assumptions of the activities performed by different associated objects. The proposed approach yields superior performance and maximum coverage with a minimum count of test cases, along with less cost. The UML diagrams utilized in this technique determine the structure and behavior of the devised methodology. The optimal test cases are derived with the help of the Fractional-SMO algorithm.

58

S. S. Panigrahi and A. K. Jena

Fig. 4.1 Proposed fractional-SMO based test case optimization scheme

4.1.1

Proposed Fractional-SMO Based Optimization of Test Case Optimization with UML Diagrams

A detailed description of the proposed test case optimization technique using Fractional-SMO is elaborated in this section. The behavioral UML diagrams are converted into graph (For the desired software system, it is modeled to different UML diagrams and subsequently passed to Rational Software Architect (RSA) software which generates the XMI code). The XMI code is used to generate the desired graph. A Java parser is used to traverse the different nodes and edges and record all potential test sequences from the graph. After test sequence generation, the test sequence having mostly visited nodes are selected to be given as input to the proposed Fractional-SMO. It also considers the fitness factors, such as fault and coverage. Here, the Fractional-SMO is developed by adapting the SMO [12] using the concept of Fractional calculus [23]. The proposed Fractional-SMO chooses the most appropriate test cases that pose maximal coverage. Figure 4.1 portrays the schematic representation of the proposed Fractional-SMO.

4.1.2

Graph Generation Using the Case Study

The primary process that has to be executed in the process of test case optimization is the construction of a graph, wherein graphs are created by allocating the nodes, weights, as well as edges using the UML diagrams. Assume G represents the graph obtained from the UML diagram comprising of edges as well as vertices, which is expressed by,

Test Scenarios Generation and Optimization of Object-Oriented Models …

G = {V, T} ; (1 ≤ V ≤ P) ; (1 ≤ T ≤ U)

59

(4.1)

Here, the terms U and P indicate the overall count of the edges and vertices, respectively. The weights of edges in the graph depend upon different parameters, like time consumed, cost of labor, etc.

4.1.3

Generation of Test Case Sequence

The test case sequence generated is expressed in Eq. 4.2. Se = S1, S2, . . . , Se, . . . , Su

(4.2)

Here, u corresponds to the overall count of sequences and Se indicates eth sequence.

4.1.4

Optimal Test Cases Selection with the Devised Fractional-SMO Algorithm

The introduced Fractional-SMO is elaborated in this section, which is used for selecting the optimal sequences. The Fractional-SMO algorithm is created by adapting the SMO [12] algorithm in accordance with fractional calculus [23], thereby achieving the benefits of the fractional theory in SMO. The SMO algorithm is an algorithm based on swarm intelligence that was devised from the inspiration of the fission–fusion societal conduct of spider monkeys. The spider monkeys split into multiple groups and forage in the home area of the bigger group. The female monkey in the group guides the group and is accountable for determining the sources of food. In case the leader becomes unsuccessful in determining the food location, the group is again portioned into small foraging groups which in turn search for the food sources separately. Communication is carried out inside as well as outside of the group by the member of the subgroups depending on the availability of food. The SMO algorithm is highly effective in handling issues, such as stagnation or premature convergence. On the other hand, fractional calculus [23]. Determination of the optimal solution is performed by computing the fitness function, in which the solution with the minimum fitness value is considered as the best solution. The following expression is used to compute the fitness value. |S|

Fitness =

1  [(1 − C) + f ( j)] |S| j=1

(4.3)

Here, C denotes the coverage, |S| designates the overall count of sequences and f ( j) denotes the fitness function depending on the sequences.

60

S. S. Panigrahi and A. K. Jena

Algorithmic steps The following section details the algorithmic steps that are employed for performing the devised Fractional-SMO algorithm. (i) Initialization The primary step is to initialize the population of spider monkeys which is represented as X and is given by, X = {Xst }; 1 ≤ s ≤ b; 1 ≤ t ≤ u

(4.4)

Here, b indicates the total number of spider monkeys. (ii) Fitness function computation The value of fitness is computed for all solutions with the help of Eq. (4.3). While the value of fitness of the newly found location is better when compared to that of the previous location, then the location of the spider monkeys is updated. The best solution is obtained by considering the spider monkey with the lowest value of fitness. (iii) Local Leader Phase (LLP) location Update The spider monkeys update their present position in accordance with the group leader and the local leader and can be represented as,     Xτuv+1 = Xτuv + U (0, 1) × E E gv − Xτuv + U (−1, 1) × Xτhv − Xτuv

(4.5)

    Xτuv+1 − Xτuv = U (0, 1) × E E gv − Xτuv + U (−1, 1) × Xτhv − Xτuv

(4.6)

wherein, the LHS of the Eq. (4.5) indicates the discrete side of the order denoted as β, which is given by,  τ +1      B β Xuv − Xτuv = U (0, 1) × E E gv − Xτuv + U (−1, 1) × Xτhv − Xτuv (4.7) Here U (0, 1) designates an arbitrary number with values in between 0 and 1. Xhv (τ ) and Xuv (τ + 1) designates the vth dimension of hth and uth Spider monkey. The fractional calculus is employed for smoother variation as well as the elongated memory effect. For analyzing the behavior, a simulation setup is utilized considering different values of β from 0 to 1. Hence, the initial four expressions of the differential derivative are taken into account, which is represented as, 1 1 1 τ −3 × β × Xτuv−1 − (1 − β)Xτuv−2 − × β(1 − β)(2 − β)Xuv 2  6 24    (4.8) = U (0, 1) × E E gv − Xτuv + U (−1, 1) × Xτhv − Xτuv

Xτuv+1 − βXτuv −

Test Scenarios Generation and Optimization of Object-Oriented Models …

61

(iv) Update Global leader Phase (GLP) location After the LLP is completed, the GLP is performed by the spider monkey, wherein they update the positions depending on the global as well as local leaders and the following equation represents the updating position of the monkeys in the GLP phase.     Xτuv+1 = Xτuv + U (0, 1) × KE gv − Xτuv + U (−1, 1) × Xτhv − Xτuv

(4.9)

Here, KE gv indicates the global leader position at vth dimension, wherein the index v ∈ {1, 2, ....N }, is chosen arbitrarily. The position of the spider monkeys is updated in this phase based on the probability G k which is computed with respect to the fitness function. The expression given below is used for computing the probability G k . G u = 0.9 ×

ϒu + 0.1 max −ϒ

(4.10)

Here, the highest value of fitness of the group is represented by max −ϒ and Yu specifies the fitness of uth spider monkey. (v) Update the Global Leader Learning phase (GLL location) In the GLL phase, the location of the spider monkey is updated by considering greedy selection. Moreover, the global leader is confirmed by altering its position. If no position update is performed, then the global limit count R is added by 1. (vi) Local Leader Learning (LLL) phase The position of the local leader is updated by considering the population of the spider monkey X which has the optimal value of fitness, using greedy selection. Once the update is completed, the obtained position is compared with the earlier position. If no update is carried out in the position of the local leader, then the value of the local limit count R is boosted by 1. (vii) Local Leader Decision (LLD) phase If the position of the local leader is not updated with respect to the threshold referred to as local leader limit, the position of the whole group is updated by performing integration of the local and global leader or by arbitrary initialization depending on d. This process is represented by,     τ +1 τ Xuv = Xτuv + U (0, 1) × KE gv − Xτuv + U (0, 1) × Xτgv − E E gv

(4.11)

(viii) Global Leader Decision (GLD) phase In this phase, the global leader location is analyzed. If the position of the global leader is not updated till the preset iteration count called the global leader limit, the group will be partitioned by the global leader into two subgroups. The entire population

62

S. S. Panigrahi and A. K. Jena

Table 4.1 Explanation of UML diagrams of different case studies

UML diagrams

Nodes

1

11

Connections 56

2

21

203

3

32

475

4

41

791

5

51

1272

is split up into two groups initially, later it is then divided into three units, and the process of partitioning continues till I groups are achieved. Every time, the local leader is selected by using the LLL procedure by the GLD phase. In case the global leader’s location is not updated even after reaching the maximal group count, the global leader unites all the subgroups into a group. Thus, the process of fission and fusion are utilized by the Fractional-SMO in achieving the best solution. (ix) Solution feasibility evaluation The feasibility of the obtained solution is evaluated considering the objective function. If the recently attained solution is better than the earlier one, then the new one replaces the old one thus enhancing the performance of the approach. (x) Termination The above process is re-iterated for the maximal iteration, till the global optimal solution is achieved.

4.2 Results and Relative Study 4.2.1

Experimental Arrangement

The execution of the developed Fractional-SMO is supported using MATLAB. The technique is evaluated for its effectiveness with the help of UML diagrams, which are generated synthetically. For the analysis, five different case studies like User login system, ATM withdrawal system, Library management system, Temperature warning system and Online E-Commerce system are taken into account, which are labeled as UML diagrams 1, 2, 3, 4, and 5 respectively. These diagrams comprise multiple connections and nodes, wherein the overall count of the edges and nodes vary from one UML diagram to another, and this is displayed in Table 4.1.

4.2.2

Comparative Methods

The effectiveness of the developed approach is examined by performing its comparison with other optimization techniques, like SMO proposed by Bansal et al. [12],

Test Scenarios Generation and Optimization of Object-Oriented Models …

63

Particle swarm optimization (PSO) proposed by Wang et al. [24], Hybrid Bee Colony proposed by Sahoo et al. [25], and Cuckoo Search Algorithm proposed by Srivastava et al. [11].

4.3 Comparative Assessment Using User Login System Case Study The performance of the devised Fractional-SMO algorithm for test case optimization is examined by considering coverage and test case count using User login system. (i) Assessment using coverage The devised Fractional-SMO algorithm approach is assessed with respect to coverage in this section and the same is depicted in Fig. 4.2 by considering different iterations. When 150 iterations are considered, the devised approach attained a value of coverage at 49, whereas the conventional techniques attain a value of coverage at 45 for SMO, 12 for PSO, 2 for Hybrid Bee Colony and 45 for Cuckoo Search. When 300 iterations are considered, the developed Fractional-SMO computed coverage at 49, and the prevailing techniques, such as SMO, PSO, Hybrid Bee Colony and Cuckoo Search calculated coverage values of 45, 24, 12, and 45. (ii) Assessment based on test case count The assessment of the introduced Fractional-SMO by considering the test case count is depicted in Fig. 4.3. The analysis is carried out by considering various iterations. For 10 iterations, the test cases count of the introduced Fractional-SMO, and the prevailing SMO, PSO, Hybrid Bee Colony and Cuckoo Search is 3050, 3576, 4396, 4396, and 4188. When the iteration count is increased to 150, the test case count attained by the existing schemes is 2563 for Cuckoo Search, 3272 for SMO, 4173 for PSO, 4396 for Hybrid Bee Colony and 2652 for the introduced Fractional-SMO (Table 4.2).

5 SMPSO: Spider Monkey Particle Swarm Optimization for Optimal Test Case Generation in Object-Oriented System This section proposes an effective hybrid approach called spider monkey particle swarm optimization (SMPSO) to optimize the created test cases from the developed model. As a consequence, the designed algorithm efficiently creates the optimal test cases from UML behavioral models by using a control flow graph. The proposed model obtained a better coverage of more than 87% and with a maximum number of test sequence generations, as compared to other existing work.

64

S. S. Panigrahi and A. K. Jena

Fig. 4.2 Comparative assessment of the Fractional-SMO based on coverage

Fig. 4.3 Comparative analysis of the devised Fractional-SMO based on test case count

Test Scenarios Generation and Optimization of Object-Oriented Models … Table 4.2 Assessment of the proposed Fractional-SMO technique

65

Techniques

Number of test cases

Coverage

SMO

3384

48

PSO

4379

47

Hybrid Bee Colony

4396

32

Cuckoo search

4045

45

Devised fractional-SMO

2562

49

5.1 Proposed Approach (SMPSO) In this section, we propose an approach for test case optimization using spider monkey Particle swarm optimization (SMPSO). This algorithm is newly devised through the integration of Spider Monkey Optimization (SMO) and Particle Swarm Optimization (PSO). The illustration of the proposed methodology for test case generation is shown in Fig. 5.1. Solution encoding shows the solution vector with solution encoding and this solution is comprised of n count of test series in such a way that individual series consists of k tolerance amongst the best s sequence selected. Therefore, the solution vector is illustrated as, [s < n < k]. Let us consider the first element as 5, then the successive five series are selected. Let us assume if the first component is four, then the successive four sequences will be chosen. The following steps are followed by executing the proposed method.

Fig. 5.1 Illustration of proposed SMPSO for optimal test case generation

66

5.1.1

S. S. Panigrahi and A. K. Jena

Process Flow UML Diagram Creation from PNML Data D = {L 1 L 2 , ..., L i , ..., L m }; 1 ≤ i ≤ m

(5.1)

The PNML data L is used to create a process flow UML representation, let us assume D as the database. The fundamental elements used to construct the process flow UML diagram are the number of activities connected to it and the series of actions coordinated on these activities. UML representations consist of three diverse groups, such as the class, sequence, and activity description. The core idea of UML representation is the objects and their categories such that they are utilized to describe the physical elements and logical ideas. Typically, the rules are classified into semantic, syntactic, and pragmatic.

5.1.2

Construction of Control Flow Graph

The UML representation of the model is converted into the control flow graph. The weights to the edges are assigned as per predefined criteria of the in-degree and out-degree edges to a node given for the graph. The higher the weight, the higher the activity, like the cost and time involved in performing the activity.

5.1.3

Generation of Test Case Sequences

Let us represent the test case series utilizing the below-specified equation as, Q = {Q 1 , Q 2 , ..., Q l , ..., Q v }

(5.2)

Here, Q represents test case sequence, Q l implies lth sequence and v specifies the overall number of sequences.

5.2 Proposed SMPSO After the creation of test series, it is necessary to determine the best test case and this can be obtained by exploiting designed SMPSO. Moreover, the designed SMPSO is devised by incorporating the SMO [12] and PSO [23]. This part describes the algorithmic steps of the designed model with the upgraded solution. As a consequence, an upgraded solution of the developed algorithm is devised by upgrading the standard expression of SMO with PSO by assuming the location of the particle and the movement location. The fitness measure is estimated as follows,

Test Scenarios Generation and Optimization of Object-Oriented Models …

67

|K |

F=

1  [(1 − R) + q(h) + K ] |K | h=1

(5.3)

Here, K specifies the total number of test cases chosen which represents the dimension of a test suite, and R implies the coverage. q(h) =



W dy ∗ W st



(5.4)

W dy represents dynamic weight, and W st portrays static weight. The dynamic weight is employed to discriminate edges between two modes in diverse series. Nevertheless, it is formulated as, W dy (Bl to Bl+1 ) = I (Bl+2 ) + O(Bl )

(5.5)

where, I (Bl+2 ) represents in-degree of Bl+2 , O(Bl ) shows out-degree of Bl , and Bl , Bl+1 , and Bl+2 signifies three successive nodes in the route. The static weight is allocated to individual edge by user at the testing module. The static weight remains a fixed value. (c) Algorithmic procedure of designed SMPSO The procedure included in the proposed SMPSO is listed as follows: (i) Initialization: Let us consider the swarm dimensions C, in which individual particle location vector in the N dimensional search area, which is specified as Q r = (Z r 1 , Z r 2 , ..., Z r k , ..., Z r N ), and the movement vector is imputed as G r = xr 1, xr 2 , ..., xr k , ..., xr N . As a result, the separate best location is represented as, λr = (yr 1 , yr 2 , ..., yr k , ..., yr N ) and the best location of a swarm is illustrated as, λd = (yd1 , yd2 , ..., ydk , ..., yd N ). (ii) Evaluate objective parameter: It is the measurement utilized to determine the best solution in creating the test cases more effectively. The expression exploited to formulate the fitness parameter is given in Eq. (5.4). (iii) Update solution: The location of the swarm from PSO is given as, Z r k (g + 1) = Z r k (g) + xr k (g + 1)

(5.6)

Accordingly, the expression of SMO is represented as, zu =

Z r k (g + 1) − Z r k (g)(1 − β(0, 1) − β(−1, 1)) − β(−1, 1)Z jk (g) β(0, 1)

(5.7)

(iv) Evaluating feasibility: The objective of individual solution is determined, such that the solution that has low fitness measure is accepted as the optimal solution.

68 Table 5.1 Explanation of UML representation

S. S. Panigrahi and A. K. Jena

UML diagrams

Vertices

Edges

1

42

101

2

44

115

3

54

162

4

72

195

5

74

215

(v) Termination: The aforementioned phases are continued till it achieves an optimal solution.

5.3 Results and Comparative Analysis This section enumerates the simulation results of the proposed SMPSO technique with respect to the evaluation indicators for diverse UML diagrams. The demonstration of this introduced strategy is executed in PYTHON. The test generation in the object-oriented model is simulated using ICPM dataset [26]. The implementation is carried out with UML representations created from PNML data. Moreover, five various UML representations are considered here and an explanation of UML representation is shown in Table 5.1.

5.4 Competing Techniques The performance of the anticipated SMPSO approach is analyzed and compared with the conventional models, like Dolphin echolocation (DE) proposed by Lohmor et al. [27], Ant Colony Optimization (ACO) proposed by Li et al. [28], Cuckoo Search (CS) proposed by Srivastava et al. [11], Hybrid Bee Colony proposed by Sahoo et al. [25], PSO proposed by Wang et al. [24], SMO proposed by Bansal et al. [12], and Fractional-SMO proposed by Panigrahi et al. [3].

5.5 Analysis Based on the Case Study of the Online-Trading System Figure 5.2 shows the assessment of devised SMPSO with respect to coverage. By maximizing the iteration to 500, coverage achieved by classical models, such as DE is 26, ACO is 8, CS is 20, hybrid bee colony is 37, PSO is 29, SMO is 33, and Fractional-SMO is 38. However, the devised SMPSO technique has attained the

Test Scenarios Generation and Optimization of Object-Oriented Models …

69

coverage measure of 40. Figure 5.3 depicts the estimation of designed approach in relation to the test case. By maximizing the iteration to 500, the test case attained by existing models, like DE is 46,431, ACO is 20,635, CS is 62,884, hybrid bee colony is 25,726, PSO is 57,356, SMO is 26,737, and fractional-SMO is 19,299, respectively. Moreover, the proposed technique achieved a test case of 10,079. Fig. 5.2 Analysis based on coverage

Fig. 5.3 Analysis based on test case

70

S. S. Panigrahi and A. K. Jena

5.6 Comparative Discussion Table 5.2 shows the comparative discussion of the designed SMPSO. From the discussion, it is vivid that the designed SMPSO model has attained maximum coverage of 70, and the test case as 82,413 for UML diagram-5. The performance of the SMPSO algorithm provides an optimal solution while considering the UML diagram-5.

6 Conclusion and Future Work In this section, we briefly summarize the contributions of our research and future work.

6.1 Generation of Test Scenarios Using Combined Object-Oriented Models In our first contribution, we have offered a model-based testing approach to generate the test scenarios from a combined model that consists of sequence and state machine diagrams. Firstly, the sequence and state machine diagrams of a case study are drawn using StarUML. Subsequently, the diagrams are converted to their corresponding control flow graphs IG and SMG respectively. The two graphs are combined together to generate a new graph named as SSIG. The test scenarios are generated from this combined SSIG.

6.2 Test Scenarios Optimization Using Fractional-SMO in Object-Oriented Systems In the second contribution, a method is designed for test sequence optimization by exploiting behavioral UML representations. The sequences that enclose all the test probabilities are chosen using the proposed Fractional-SMO. The developed Fractional-SMO is designed by consolidating the Fractional calculus with SMO for an optimal solution. Hence, the optimal test sequences are selected depending on the algorithm that uses objective factors, such as coverage and test case count.

5

4

3

45

253,322

Test cases count

287,123

Test cases count

Coverage

2

198,122

Test cases count

Coverage

45

67,511

Test cases count

Coverage

13

Coverage

75,367

Test cases count

2

6

Coverage

1

DE

Metrics

UML diagrams

Table 5.2 Comparative discussion

229,873

4

168,123

17

77,428

2

46,347

5

40,257

5

ACO

102,233

1

300,137

44

215,551

17

71,377

15

75,218

12

CS

203,798

4

200,261

1

216,031

45

67,263

1

70,123

25

Hybrid Bee colony

253,826

24

210,364

25

155,810

49

65,607

8

66,312

21

PSO

120,003

43

51,421

45

121,646

12

48,590

24

30,741

33

SMO

99,472

65

50,073

49

50,023

50

36,452

38

29,364

36

Fractional-SMO

82,413

70

49,994

60

34,147

52

29,949

40

15,416

38

Proposed SMPSO

Test Scenarios Generation and Optimization of Object-Oriented Models … 71

72

S. S. Panigrahi and A. K. Jena

6.3 Spider Monkey Particle Swarm Optimization for Optimal Test Case Generation in OO System Finally, we have presented a hybrid method called spider monkey particle swarm optimization (SMPSO) for optimal test sequence generation in an Object-Oriented model. The hybridization of PSO and SMO helped to overcome some of the issues associated with the existing research. In order to speed up computing, hybridization would help to expand the solution search space with each iteration. The test sequence generation validates the consistency of the application over improved test coverage.

6.4 Future Scope The developed combined object-oriented model is a semi-automatic approach. Therefore a fully automatic approach be taken as a future work by prioritizing the test scenarios one can reduce some of the still unexplored redundant scenarios, which can be taken as another future work. Test case minimization with the aim to identify the minimal test suite satisfying the needs is a common NP problem and this could be further analyzed. Other meta-heuristic algorithms like GA, DE and ACO, etc. can be used to optimize the test sequences without scarifying the performance of the system under test.

References 1. Anand, S., Burke, E.K., Chen, T.Y., Clark, J., Cohen, M.B., Grieskamp, W., Harman, M., Harrold, M.J., McMinn, P., Bertolino, A., et al.: An orchestrated survey of methodologies for automated software test case generation. J. Syst. Softw. 86(8), 1978–2001 (2013) 2. Potts, C.: Software-engineering research revisited. IEEE Softw. 10(5), 19–28 (1993) 3. Panigrahi, S.S., Jena, A.K.: Optimization of test cases in object-oriented systems using fractional-smo. Int. J. Open Sour. Softw. Proc. (IJOSSP) 12(1), 41–59 (2021) 4. Baluda, M., Braione, P., Denaro, G., Pezzè, M.: Enhancing structural software coverage by incrementally computing branch executability. Softw. Qual. J. 19(4), 725–751 (2011) 5. Pandita, R., Xie, T., Tillmann, N., De Halleux, J.: Guided test generation for coverage criteria. In: 2010 IEEE International Conference on Software Maintenance, pp. 1–10. IEEE (2010) 6. Zhang, C., Duan, Z., Yu, B., Tian, C., Ding, M.: A test case generation approach based on sequence diagram and automata models. Chin. J. Electron. 25(2), 234–240 (2016) 7. Khandai, M., Acharya, A.A., Mohapatra, D.P.: A novel approach of test case generation for concurrent systems using UML sequence diagram. In: 2011 3rd International Conference on Electronics Computer Technology, vol. 1, pp. 157–161. IEEE (2011) 8. Pradhan, S., Ray, M., Swain, S.K.: Transition coverage based test case generation from state chart diagram. J. King Saud. Univ.-Comput. Inf. Sci. (2019) 9. Khurana, N., Chhillar, R.S., Chhillar, U.: A novel technique for generation and optimization of test cases using use case, sequence, activity diagram and genetic algorithm. J. Softw. 11(3), 242–250 (2016)

Test Scenarios Generation and Optimization of Object-Oriented Models …

73

10. Arora, V., Bhatia, R., Singh, M.: Synthesizing test scenarios in UML activity diagram using a bio-inspired approach. Comput. Lang. Syst. Struct. 50, 1–19 (2017) 11. Srivastava, P.R., Sravya, C., Ashima, K., S., and Lakshmi, M.: Test sequence optimisation: an intelligent approach via cuckoo search. Int. J. Bio-Inspir. Comput. 4(3), 139–148 (2012) 12. Bansal, J.C., Sharma, H., Jadon, S.S., Clerc, M.: Spider monkey optimization algorithm for numerical optimization. Memet. Comput. 6(1), 31–47 (2014) 13. Kamonsantiroj, S., Pipanmaekaporn, L., Lorpunmanee, S.: A memorization approach for test case generation in concurrent UML activity diagram. In: Proceedings of the 2019 2nd International Conference on Geoinformatics and Data Analysis, pp. 20–25 14. Kamath, P., Narendra, V.: Generation of test cases from behavior model in UML. Int. J. Appl. Eng. Res. 13(17), 13178–13187 (2018) 15. Minj, J., Belchanden, L.: Path oriented test case generation for UML state diagram using genetic algorithm. Int J. Comput. Appl. 82(7) (2013) 16. Swain, R.K., Behera, P.K., Mohapatra, D.P.: Minimal testcase generation for object-oriented software with state charts (2012). arXiv:1208.2265 17. Arora, P.K., Bhatia, R.: Mobile agent-based regression test case generation using model and formal specifications. IET Softw. 12(1), 30–40 (2018) 18. Mani, P., Prasanna, M.: Test case generation for embedded system software using UML interaction diagram. J. Eng. Sci. Technol. 12(4), 860–874 (2017) 19. Arora, P.K., Bhatia, R.: Agent-based regression test case generation using class diagram, use cases and activity diagram. Procedia Comput. Sci. 125, 747–753 (2018) 20. Shah, S.A.A., Shahzad, R.K., Bukhari, S.S.A., Humayun, M.: Automated test case generation using UML class & sequence diagram. British J. Appl. Sci. Technol. 15(3) (2016) 21. Hooda, I., Chhillar, R.: Test case optimization and redundancy reduction using ga and neural networks. Int. J. Electr. Comput. Eng. 8(6), 5449 (2018) 22. Hashim, N.L., Dawood, Y.S.: Test case minimization applying firefly algorithm. Int. J. Adv. Sci. Eng. Inf. Technol. 8(4–2), 1777–1783 (2018) 23. Bhaladhare, P.R., Jinwala, D.C.: A clustering approach for the-diversity model in privacy preserving data mining using fractional calculus-bacterial foraging optimization algorithm. Adv. Comput. Eng. (2014) 24. Wang, D., Tan, D., Liu, L.: Particle swarm optimization algorithm: an overview. Soft. Comput. 22(2), 387–408 (2018) 25. Sahoo, R.K., Nanda, S.K., Mohapatra, D.P., Patra, M.R.: Model driven test case optimization of UML combinational diagrams using hybrid bee colony algorithm. Int. J. Intell. Syst. Appl. 9(6) (2017) 26. ICPM dataset taken from (2022). https://icpmconference.org/2020/process-discovery-contest/ downloads/. Aaccessed June 2022 27. Lohmor, S., Sagar, B.: Estimating the parameters of software reliability growth models using hybrid deo-ann algorithm. Int. J. Enterp. Netw. Manag. 8(3), 247–269 (2017) 28. Li, K., Zhang, Z., Liu, W.: Automatic test data generation based on ant colony optimization. In: 2009 Fifth International Conference on Natural Computation, vol. 6, pp. 216–220. IEEE (2009) 29. Panigrahi, S.S., Shaurya, S., Das, P., Swain, A.K., Jena, A.K.: Test scenarios generation using UML sequence diagram. In: 2018 International Conference on Information Technology (ICIT), pp. 50–56. IEEE (2018) 30. Panigrahi, S.S., Jena, A.K.: Test scenarios generation using combined object-oriented models. In: Automated Software Engineering: a Deep Learning-Based Approach, pp. 55–71. Springer, Cham (2020) 31. Panigrahi, S.S., Sahoo, P.K., Sahu, B.P., Panigrahi, A., Jena, A.K.: Model-driven automatic paths generation and test case optimization using hybrid FA-BC. In: 2021 International Conference on Emerging Smart Computing and Informatics (ESCI), pp. 263–268. IEEE (2021) 32. Panigrahi, S.S., Jena, A.K.: Spider monkey particle swarm optimization (SMPSO) with coverage criteria for optimal test case generation in object-oriented systems. Int. J. Open Sour. Softw. Proc. (IJOSSP) 13(1), 1–20 (2022)

74

S. S. Panigrahi and A. K. Jena

33. Jena, A.K., Swain, S.K., Mohapatra, D.P.: A novel approach for test case generation from UML activity diagram. In: 2014 International Conference on Issues and Challenges in Intelligent Computing Techniques (ICICT), pp. 621–629. IEEE (2014) 34. Jena, A.K., Swain, S.K., Mohapatra, D.P.: Test case creation from UML sequence diagram: a soft computing approach. In: Intelligent Computing, Communication and Devices, pp. 117– 126. Springer, New Delhi (2015) 35. Jena, A.K., Swain, S.K., Mohapatra, D.P.: Model based test case generation from UML sequence and interaction overview diagrams. In: Computational Intelligence in Data Mining, vol. 2, pp. 247–257. Springer, New Delhi (2015)

Logical Interpretation of Omissive Implicature Alfonso Garcés-Báez and Aurelio López-López

Abstract Implicature is a linguistic concept allowing inferences about what is said during interaction. However, it differs from an implication in that does not involve a definition or truth tables in a logic. In particular, an omissive implicature leads to inferences about what is omitted or not said. Omission in linguistic terms brings us to the intention of remaining silent about something by whatever reason. That is, omission is the word that is not uttered. In this research, a semantics was formulated to explain omission in testimonies, as well as in the context of dialogues, where its role is common. In testimonies, we achieved a logic-based knowledge representation, allowing reasoning through Answer Set Programming. These allowed to generate models illustrating the implications of silence in several logical-linguistic puzzles. Puzzles were taken as case study given that they state, in simple or everyday language, common situations, requiring the use of arithmetic, geometry, or logic, for its solution. In dialogues, a procedure was developed to make decisions, based on answers and a record (knowledge base) of the occurrences of omission, while maintaining the communication process. The procedure was oriented to psychotherapy interviews, where the Beck Inventory was extended to include silence, to assess the degree of depression of a person.

1 Introduction Silence can have different meanings in specific contexts. For example, in some communities, such as that of the North American Indians of the Apache reservation, a kind of quarantine of silence is maintained for those who come to their community after have been outside. There is an intimate relationship between silence and music, A. Garcés-Báez (B) · A. López-López Computational Sciences Department, Instituto Nacional de Astrofísica, Óptica y Electrónica, Sta. Ma. Tonantzintla, Puebla, México e-mail: [email protected] A. López-López e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 S. R. Dash et al. (eds.), Intelligent Technologies: Concepts, Applications, and Future Directions, Volume 2, Studies in Computational Intelligence 1098, https://doi.org/10.1007/978-981-99-1482-1_4

75

76

A. Garcés-Báez and A. López-López

it has been said that music expresses what cannot be said with words but cannot remain silent (Victor Hugo). The music is not in the notes, but in the silence between them (W. Amadeus Mozart). After silence, what comes closest to expressing the inexpressible is music (Aldous Huxley). Calm and quiet people have the loudest and loudest minds (Stephen Hawking). The norms and social distance influence the interpretation of silence, as far as we know, in Japanese society, no inferences are made from condescending silence, known in ours as the one who silences grants. Silence can also be scary, as Pascal said: The silence of infinite spaces frightens me. Here is a fragment of a text that narrates the shipwreck of the ship El Tritón [16]: …But there it was: there were the throatless howls of the cyclone. The radio operator leaned gently toward the set. His voice was suddenly flat, professional. -Veracruz. Veracruz. Veracruz. Change! They responded, from who knows what point, from who knows what corner of the cosmos, some inhuman screams, throats slashed, a dentist’s electric drill, dogs with hydrophobia, snoring, someone scraping glass with sand. The operator pushed the lever. SILENCE. - There’s a lot of static. They don’t hear me,” he said calmly. He wiped his sweaty hands on his legs. - Are you afraid ?—Asked the boatswain without knowing why he was asking this question. Perhaps because of hands soaked in sweat. The telegraph operator smiled. “Yes,” he replied with the same calm. He leaned over the apparatus again: - Veracruz! Veracruz! Veracruz! …

Silence is the sign of a mysterious message whose apparent emptiness feeds on the reality of those who live it, devours their temporary space, far, far away from the possibility of being occupied by words. As Stainer says: How can speech justly convey the form and vitality of silence? The silence was before the word. Man, for Aristotle (384 BC-322 BC), is the being of the word. How did the word reach man? It is something that, as Socrates warns in the Cratylus: It is an enigma, it is not a question whose sure answer is within the reach of humans. Will man also be the being of silence as he is of the word? As for what we cannot speak, we must remain silent, said Wittgenstein. Language can only meaningfully deal with a particular and restricted segment of reality. The rest—and, presumably, most of it—is silence. Most conceptualizations of silence tackle it as a relatively passive behavior. However, not every manifestation of silence represents passive behavior and is not simply the opposite of voice. Speech and silence are two dialectical ingredients to achieve an effective communication.

1.1 Problem Statement and Hypothesis Omission or intentional silence is a phenomenon barely studied from a computational point of view whose interpretation can benefit communication processes, particularly in the interaction during the dialogue, and can help decision-making.

Logical Interpretation of Omissive Implicature

77

Fig. 1 Some contexts where silence appears

Problem Statement: The problem consists of automating the interpretation of the omission or intentional silence in written interactions under the contexts of testimonies and dialogue to make inferences without breaking the communication. Hypothesis: The logical interpretation of the omission implicature contributes elements to the communicative process and helps decision-making.

2 Theoretical Basis Now, we detail some definitions for implicature and omission, considering also concepts provided previously. Given that everybody regularly recurs to silence or omission, the possible interpretations can increase; however, researchers have tried to fully understand the meaning of silence, sharing the advances. Given that silence is a human behavior, it appears in music, art, philosophy, literature, architecture, and a wide variety of disciplines (Fig. 1). In this section, we show also some concepts around silence, useful for our research.

78

A. Garcés-Báez and A. López-López

2.1 Implicature Our definition of omission implicature is based on Grice’s definition of conversational implicature, formalized in the Stanford Encyclopedia of Philosophy [20].

2.2 Answer Set Programming Answer Set Programming (ASP) is a computing approach where different computational problems can be formulated to obtain sets of answers from logic programs. This paradigm has been used to solve diverse tasks, from configuring computer systems or programming decision support systems for the space shuttle to tackling problems that arise in linguistics and bioinformatics [13]. The basis of ASP is on deductive databases, logic programming, knowledge representation, and satisfiability testing. Model generation-based approach: 1. Give a representation of the problem at hand. 2. Reach a solution by a model of the given representation.

3 Definitions and Methodology 3.1 Definitions The definitions of Cooperative Principle by de Conversational Implicature are of Grice [11]. Formalizing the concept of Grice, we have: Definition 4.1 Says(X, Y, T |F) states that agent X asserts that predicate Y is either True (T ) or False (F). Stanford Encyclopedia of Philosophy [20] includes a formal definition of implicature, this and the natural language definition of omissive implicature, as well as our formal definition of omissive implicature, are found in [8]. What this definition encompasses is the possibility of drawing linguistic inferences from silence or omission, without interrupting the communicative interaction, given certain contexts. In [5], we have defined our semantic rules for the five types of silence, namely: 1. Total Defensive Silence (TDS). 2. Partial Defensive Silence (PDS). 3. Acquiescent Silence (AS).

Logical Interpretation of Omissive Implicature

79

4. Prosocial+ Silence (Pro+). 5. Prosocial− Silence (Pro–). After the semantics of five types of silence were stated, we move to elaborate how to evaluate the consequences of silence in certain situations, allowing this to do implicatures. Our definitions of intentional silence and unintentional silence for dialogic interactions can be found in [8].

3.2 Methodology The methodology for the study of silence in the testimonies is illustrated in Fig. 2. The methodology for the study of dialogues is detailed in Fig. 3.

Fig. 2 Strategy for testimonies

Fig. 3 Methodology for dialogues

80

A. Garcés-Báez and A. López-López

4 Experimental Environments The possibilities of interpreting natural language are very varied for each form of expression and this characteristic makes it difficult to formalize statements for their logical interpretation and analysis. But there is a shortcut, the puzzles, which, beyond a simple hobby, can lead us down interesting paths, accepting the challenge of one of its main promoters, Martin Gardner, who argued that no one can define exactly what words mean because there is no exact way to define something that is outside mathematics and logic [9]. The logical-linguistic puzzles allow to limit the use of natural language giving rise to logic. These puzzles facilitate the formalization of statements, in addition to the fact that their solution could have repercussions in practice or help in solving daily problems. Because of this, we use them as case studies to model omission in the testimonial context. In all kinds of interviews, a dialogue is established and then, a role is played in turns exchanging statements, taking turns is used to order the movements in the games, to assign political positions, to regulate the traffic at intersections, to serve customers in commercial establishments and to speak in different situations (interviews, meetings, debates, ceremonies, conversations, and so on), these latter also referred as voice exchange systems. In these cases, the study of silence is important considering that it has meaning. In the field of social psychology, silence is a path that opens the distance in conversation. The source of information that silence represents may contain findings that help psychotherapist specialists to timely intervene risk situations in patients with some critical diagnoses such as depression. This is the context that is studied in dialogical interactions.

4.1 The Testimonials of Logical-Linguistic Puzzles The possibilities of interpreting natural language are very varied for each form of expression and this characteristic makes it difficult to formalize statements for their logical interpretation and analysis. The logical-linguistic puzzles, allow to limit the use of natural language giving rise to logic. These riddles facilitate the formalization of statements, in addition to the fact that their solution could have repercussions in practice or help in solving everyday problems. To address the solution of the puzzle The Criminal [19], we have to think first of a form of representation of the problem that facilitates the analysis. In this case, the matrix representation (Fig. 4-i) is recommended to identify contradictory statements. Whereas the statements of the suspects are: Brown: b1 : I didn’t do it. b2 : Jones didn’t do it. Jones: j1 : Brown didn’t do it. j2 : Smith did it.

Logical Interpretation of Omissive Implicature

81

Fig. 4 Representation and solutions for puzzle The criminal

Smith: s1 : Brown did it. s2 : I didn’t do it. We have that the pairs of contradictory statements are the following: 1. j2 with s2 . 2. b1 with s1 . 3. j1 with s1 . That is, such pairs of statements cannot be held true at the same time because an inconsistent system is reached. To find the solution, we have to analyze case by case, testing the possible assignments of certainty values. Several of them will lead to a contradiction, forcing to go back to find another possible assignment. This type of testing is known as trial and error. We can start in an orderly way by looking for that who tells two lies or two truths to see if all the certainty values are accommodated as required by the riddle. This is done using a possibility matrix associated with the testimonial matrix.

82

A. Garcés-Báez and A. López-López

We will put the testimonies of those involved in a matrix with values of false (F) and true (T), using the innocence property instead of who did it, to differentiate what each one holds for himself and for others. We will use the predicate says(r, innocent (c)) whose meaning is that the person in line r declares the person in column c to be innocent or not. The puzzle, The Criminal [19] allowing to study and explore the interpretations of silence (see Appendix 8). The analysis of this puzzle and the models generated for the TDS, PDS, and AS can be observed in [7]. Each type of silence has its implementation as metaprogramming in Python (see Appendix 9). Figure 4-i includes the suspects testimonies expressed with the predicate Says(x, innocent (y), T /F), defined above. Figure 4-ii presents the solution for the original puzzle, found by trial and error, based on the preconditions of the original completion. As shown, the solution turns out that Brown is the culprit. Other example is the puzzle The Mystery taken from [10]: Vinny has been murdered, and Andy, Ben, and Cole are suspects. Andy said: He did not do it. Ben was the victim’s friend. Cole hated the victim. Ben said: He was out of town the day of the murder. He didn’t even know the guy. Cole said: He is innocent. He saw Andy and Ben with the victim just before the murder. Tables 1, 2 and 3 show some models where the presumed culprit of the crime depends on the type of silence interpreted (TDS, AS, or Pro+) and on who resorts to silence (Andy, Ben, or Cole) [5]. We have modeled and analyzed seven linguistic logic puzzles. False Statements or Silence We explored if there is a relation between silence and false statements, since this puzzle had that feature. As we revealed in Table 4, in the context of this case, additional information was hidden behind silence, opening more possibilities.

Table 1 Total Defensive Silence (TDS) models for the different agents Silent agent(s) Presumable culprit {} {andy} {ben} {cole} {andy, ben} {ben, cole} {andy, cole} {andy, ben, cole}

{ben} {ben, cole} {cole, andy, ben} {andy, ben} {cole, ben, andy} {cole, andy, ben} {cole, ben, andy} {cole, ben, andy}

Logical Interpretation of Omissive Implicature

83

Table 2 Acquiescent Silence (AS) for the different agents Silent agent(s) Presumable culprit {} {andy} {ben} {cole} {andy, ben} {ben, cole} {andy, cole} {andy, ben, cole}

{ben} Unsatisfiable {ben} Unsatisfiable {ben, andy} {ben, cole} {cole, andy} {cole, ben, andy}

Table 3 Analysis of Pro+ silence for different agents Silent agent Regarding andy andy ben ben cole cole

ben cole andy cole andy ben

Table 4 Analysis of combined silence # TDS 1 2 3 4 5 6 7 8 9 10 11 12

Brown

Jones

Smith

Presumable culprit {cole} {ben} {cole, ben, andy} {andy, ben, cole} {ben} {andy}

PDS

Presumably culprit

j1 j2 s1 s2 s1 s2 b1 b2 b1 b2 j1 j2

{Smith, Jones} {Smith, Jones} {Smith, Jones} {Smith, Jones} {Smith, Brown} {Smith, Brown} {Smith, Brown} {Smith, Brown} {Smith} {Smith, Jones} {Smith, Brown} {Smith}

84

A. Garcés-Báez and A. López-López

4.2 Dialogical Interactions In [8] we can find an application of the procedures (Figs. 5 and 6) that we proposed, that can be included in systems with dialogic interactions. In the same work, we reported the use of the prototype named Psychotherapeutic Virtual Couch (PVC) of which an interaction is included in Fig. 7 and an example of the records it generates in Fig. 8.

Fig. 5 Silence detection

Logical Interpretation of Omissive Implicature

85

Fig. 6 Silence management

5 Results The testimonials of logical-linguistic puzzles. Figure 9 shows a summary of puzzles and the properties that each one has with respect to the proposed semantics, as well as the behavior of its knowledge base. Total Defensive Silence (TDS) in all cases always has a solution and shows that silence protects those who use it and provides logical support to the right to remain silent. Partial Defensive Silence (PDS), in the cases where it was used, allowed to test the weight or importance of each statement in the consequences of the testimony of a witness or agent.

86

A. Garcés-Báez and A. López-López

Fig. 7 PVC prototype

Fig. 8 Record of dialogic interaction of PVC

With condescending silence (SC), the last three puzzles show that solutions are not always found but where they are the one who uses it is pointed out confirming he who is silent grants. The positive (Pro+) and negative (Pro-) pro-social silences tested in the last three riddles can be used in organizations and show that it is possible to induce decisionmaking, based on the support or detriment that can be given the silence of one person or agent with respect to another. It is important to know what arguments or statements we can do without without altering the logical result of the interpretation. In two puzzles (Mystery and The

Logical Interpretation of Omissive Implicature

87

Fig. 9 Semantics and relationships of the case studies

criminal) we could see that it is possible to reduce the size of the knowledge base and obtain several models within which the solution to the problem is found, these two cases being examples of Weak testimonial reduction or Weak equivalence. Two other puzzles (Poisoning and Fraud) allowed us to prove that, in some cases, by reducing the size of the knowledge base it is possible to obtain the unique solution, being an example of Strong testimonial reduction or Strong equivalence, proving this way the non-monotony property of the knowledge base and the obtaining of equivalent logic programs since we obtain the same answer set with them [15, 17]. With the Strong testimonial reduction the existence of superfluous statements in the testimonial context is verified, in other words, it is confirmed that, given two logic programs, where in one program some statements are silence (an agent is fully or partially muted), such program will be a reduction of the other. This is formally expressed in [8]. Finally, in two puzzles (Poisoning and The Criminal) it was possible to show that there is a relationship between silence and false statements. Procedure Taking for granted that the testimonies of all people involved are already at hand, a strategy to analyze them can be (Fig. 10) as follows: 1. Determine agents and relations (predicates). 2. Express agent statements employing Says() predicate. 3. Supply definitions and common sense rules pertinent for the problem under consideration. 4. Recognize the different kinds of silence present in the problem. 5. Create a knowledge base to model the problem, considering the agent statements, common sense knowledge, and determined kinds of silence. According to the kind of silence displayed by agents, one or more of the programs modeling them,

88

A. Garcés-Báez and A. López-López

Fig. 10 A proposed strategy for testimony analysis

have to be applied, so that the corresponding agent affects the knowledge as a consequence: t_def_silence (kb.pl, agent) p_def_silence (kb.pl, agent, predicate) acq_silence (kb.pl, agent)

6. Obtain the models with Answer Set Programming, considering the respective kinds of silence in the knowledge base: clingo 0 kb.pl

7. Examine the scenarios obtained after the simulation. The critical step in the proposed strategy is number 4. For instance, the obvious and common case is when one of the parties uses his/her right to keep quiet. So, we can proceed to contemplate the types of defensive and condescending silence, one by one, for that person. The cases that are not obvious but possible are the silences of the Pro-social type in defense of a guild or organization, although it is not obvious if there are elements to explore one or the other. For example, if through third parties it is known of a close friendship between some of those involved (Pro+) or of animosity or enmity between two (Pro-), where someone can remain silent. Nevertheless, some other situations can arise, for example, when two declarers A and B match individually in declarations p and q, but A also asserts r. This can lead to hypothesize condescending silence from B, or even a partial defensive silence, since r is being omitted. One can then proceed to represent and analyze the problem consequently. Dialogical Interactions For dialogic interactions between two human or non-human agents, we proposed to include a new dimension for silence, which can provide relevant information in some contexts, an example of which is in the area of psychology. In [8] we can find some possible lines of research.

Logical Interpretation of Omissive Implicature

89

6 Conclusions During the development of the doctoral project, we were able to confirm that the intentional omission or silence in the communication process has not been sufficiently investigated. With the scrutiny and exploration in the scientific community of our area, we did not find evidence of the subject in Computational Sciences, therefore, the present work could be one of the first records related to the subject. In the computational discipline, the closest thing to the occurrence of silence are the default values, i. e. those assigned “by omission” to some variables in interactive systems. We have begun the study of areas of opportunity for the interpretation of the omission in each of the following aspects: 1. The testimonial or logical-linguistic puzzles. We were able to realize that natural language testimonial puzzles provide the opportunity to do qualitative research and produce logic-based knowledge representation models to analyze the consequences of omission in a linguistic setting. We show some important properties such as the equivalence of programs based on logic, reasoning on non-monotonic knowledge bases, and the relationship between omission and false statements. 2. Dialogical interactions. For experimental and quantitative research purposes, we show the use of a procedure for managing silence, generating records with information that can help decision-making. We formally define the omission implicature and the concept of dialogue that includes it as a possibility. With the experiments carried out in the testimonial context, we were able to realize the power of omission, since its logical interpretation can point to any of the agents (human or not) involved in the process as the presumed culprit. In dialogical interactions, we find areas such as psychotherapy, where the timely interpretation of the information omitted in the interviews could save lives. Thus, we have confirmed the hypothesis: The logical interpretation of the omission implicature contributes elements to the communicative process and helps decision-making. Knowing the possible logical consequences of silence, you can resort to it, voluntarily, consciously, and with intention, according to the circumstances in which it occurs. It is important to reflect on the power of silence, studied from a computational perspective, since by incorporating this dimension into interactive systems, relevant and vital information can be obtained in certain contexts. Intentional silence or omission, intentionally interpreted, is contextual, clear, interactive, and completely concise. Future Work One of the short-term tasks is to put into practice the adoption or development of systems that include the interpretation of silence. In the case of testimonies, some applications of the proposed semantics of silence can be applied in judicial proceedings, law, and police interviews [18] and probably

90

A. Garcés-Báez and A. López-López

with them, models can be generated by using sensor technologies to detect silence [2]. According to our formal definition of omissive implicature and according to the context, predictions can be made to know what could be hidden ( p) behind the silence, asking the question: What or who could p be? The solution could be part of a Base of Assertions or terms in the style of the Herbrand Base. In the case of dialogues, the proposed methodology could be useful in psychotherapeutic consultations [12], for example, it could prevent depression from putting people’s integrity at risk by making timely detection of the state of mood or degree of depression [1]. Some possible lines of research (threads) that could be developed are: 1. Design and solution of testimonial puzzles with various logic programming paradigms. Solving puzzles can have practical implications for doing everyday tasks. 2. Definition of agents that perform omissive implicatures in dialogical interactions. Agents who extend their intelligence with the interpretation of intentional silence. 3. The semantics of the omission used in testimonials and the components used in dialogues could have application in the theory of argumentation with logical programming and negation by failure [14]. 4. Development of a theory or axiomatization of the omission implicature. Formally define linguistic inferences for omission conversational implicatures.

7 Derived Publications During the development of the doctoral program, advances were presented in various forums and the following publications were reported: 1. A semantics of intentional silence in omissive implicature, Garcés-Báez, Alfonso, López-López, Aurelio, Journal of Intelligent & Fuzzy Systems, vol. 39, no. 2, pp. 2115-2126, 2020, DOI: 10.3233/JIFS-179877, Q3 JCR 2018, Q2 Engineering SJR 2019 [5]. 2. Towards a Semantic of Intentional Silence in Omissive Implicature. Garcés-Báez, Alfonso, and Aurelio López-López. Digitale Welt 4.1 (2020): 67-73. Springer. https://doi.org/10.1007/s42354-019-0237-0 [6] 3. First Approach to Semantics of Silence in Testimonies. Garcés-Báez, Alfonso, López-López, Aurelio (2019). International Conference of the Italian Association for Artificial Intelligence, LNAI 11946. Springer, págs. 73-86. https://doi.org/10.1007/978-3-030-35166-3 [3]

Logical Interpretation of Omissive Implicature

91

4. Reasoning in the Presence of Silence in Testimonies: A Logical Approach. Garcés-Báez A., López-López A. (2021) In: Arai K. (eds) Intelligent Computing. Lecture Notes in Networks and Systems, vol 284. Springer, Cham. págs. 952-966. https://doi.org/10.1007/978-3-030-80126-7 [7]. 5. A Logical Interpretation of Silence. Alfonso Garcés Báez, Aurelio López López. Computación y Sistemas, 2020, vol. 24, no 2. pp. 613-623. https://doi.org/10.13053/CyS-24-2-3396 [4]. 6. Chapter 5. Pandemic, depression and silence. Garcés-Báez A., López-López A., Moreno-Fernández Ma. Del Rosario & Eva Mora-Colorado, in Women in science: academic and research experiences in upper secondary and higher education during the state of the pandemic, Carmen Cerón Garnica, Coordinator and Compiler of Work, Universidad Tecnocientífica del Pacífico S.C. 2021, pp. 80-100, ISBN 978-607-8759-19-4. 7. Silence in Dialogue: A Proposal and Prototype for Psychotherapy. Alfonso Garcés Báez, Alfonso, López-López A. (2022) In: Science and Information Conference. Springer., págs. 266–277 [8].

8 Code for Criminal Puzzle (Clingo 4.5.4) %% Puzzle 51 of Wylie [19] %% for the natural solution. %% suspect(brown;jones;smith). % %% Brown says: says(brown,innocent(brown),1). says(brown,innocent(jones),1). % %% Jones says: says(jones,innocent(brown),1). says(jones,innocent(smith),0). % %% Smith says: says(smith,innocent(smith),1). says(smith,innocent(brown),0). %%%%%%%%%%%%%%%%%%%%%% % %% Everyone, except possibly for the criminal, is telling the truth: holds(S) :- says(P,S,1), -holds(criminal(P)). -holds(S) :- says(P,S,0), -holds(criminal(P)). % %% Normally, people aren’t criminals:

92

A. Garcés-Báez and A. López-López -holds(criminal(P)) :- suspect(P), not holds(criminal(P)). % %% Criminals are not innocent: :- holds(innocent(P)),holds(criminal(P). % %% For display: criminal(P) :- holds(criminal(P)). % %% The criminal is either Brown, Jones or Smith, (exclusively): holds(criminal(brown)) | holds(criminal(jones)) | holds(criminal(smith)). #show criminal/1.

9 Prototype for Program Update in Logic (Python 3.7) # Definition of Total Defensive Silence # Input: knowledge base or logic program, and agent to silence # Output: new knowledge base or logic program named ’kb’-’tds’-’agent’.lp def t_def_silence(kb,agent): f=open(kb,’r’) g=open(kb[0:len(kb)-3]+’-’+’tds-’+agent+’.lp’,’w’) for line in f: if ’says(’+agent == line[0:5+len(agent)]: line=’%’+line g.write(str(line)) f.close() g.close() # Definition of Partial Defensive Silence # Input: knowledge base or logic program, agent, and predicate to silence # Output: new knowledge base or logic program named ’kb’-’pds’-’agent’-’predicate’.lp def p_def_silence(kb,agent,predicate): f=open(kb,’r’) g=open(kb[0:len(kb)-3]+’-’+’pds-’+agent+’-’+predicate+’.lp’,’w’) for line in f: if ’says(’+agent+’,’+predicate+’(’ == line[0:5+len(agent)+len(predicate)+2]: line=’%’+line g.write(str(line)) f.close() g.close() # Definition of Acquiescent Silence # Input: knowledge base or logic program, and agent to silence # Output: new knowledge base or logic program named ’kb’-’as’-’agent’.lp def acq_silence(kb,agent): f=open(kb,’r’) g=open(kb[0:len(kb)-3]+’-’+’as-’+agent+’.lp’,’w’) for line in f: if ’says(’+agent == line[0:5+len(agent)]: line=’%’+line elif ’says(’ == line[0:5]: i=line.index(’,’) line_new=’says(’+agent+line[i:len(line)] g.write(str(line_new)) g.write(str(line)) f.close() g.close()

Logical Interpretation of Omissive Implicature

93

References 1. Beck, J.S., Beck, A.T.: Cognitive Therapy: Basics and Beyond. No. Sirsi) i9780898628470. Guilford Press, New York (1995) 2. Gaddy, D., Klein, D.: Digital voicing of silent speech (2020). arXiv preprint arXiv:2010.02960 3. Garcés-Báez, A., López-López, A.: First approach to semantics of silence in testimonies. In: International Conference of the Italian Association for Artificial Intelligence, LNAI 11946. pp. 73–86. Springer (2019). https://doi.org/10.1007/978-3-030-35166-3_6 4. Garcés Báez, A., López López, A.: A logical interpretation of silence. Computación y Sistemas 24(2) (2020) 5. Garcés-Báez, A., López-López, A.: A semantics of intentional silence in omissive implicature. J. Intell. Fuzzy Syst. 39(2), 2115–2126 (2020). https://doi.org/10.3233/JIFS-179877 6. Garcés-Báez, A., López-López, A.: Towards a semantic of intentional silence in omissive implicature. Digitale Welt 4(1), 67–73 (2020). https://doi.org/10.1007/s42354-019-0237-0 7. Garcés-Báez, A., López-López, A.: Reasoning in the presence of silence in testimonies: a logical approach. In: Intelligent Computing, pp. 952–966. Springer (2021). https://doi.org/10. 1007/978-3-030-80126-7_67 8. Garcés-Báez, A., López-López, A.: Silence in dialogue: A proposal and prototype for psychotherapy. In: Science and Information Conference, pp. 266–277. Springer (2022) 9. Gardner, M.: Science, Good, Bad, and Bogus. Prometheus Books (1981) 10. Gelfond, M., Kahl, Y.: Knowledge Representation, Reasoning, and the Design of Intelligent Agents: The Answer-set Programming Approach. Cambridge University Press (2014) 11. Grice, H.P.: Logic and conversation. Syntax Semant.: Speech Acts, Cole et al. 3, 41–58 (1975) 12. Levitt, H.M.: Sounds of silence in psychotherapy: the categorization of clients’ pauses. Psychother. Res. 11(3), 295–309 (2001) 13. Lifschitz, V.: What is answer set programming?. In: AAAI. vol. 8, pp. 1594–1597 (2008) 14. Nieves, J.C., Osorio, M., Zepeda, C.: A schema for generating relevant logic programming semantics and its applications in argumentation theory. Fund. Inform. 106(2–4), 295–319 (2011) 15. Osorio, M., Navarro, J.A., Arrazola, J.: Equivalence in answer set programming. In: International Workshop on Logic-Based Program Synthesis and Transformation, pp. 57–75. Springer (2001) 16. Revueltas, J.: Los días terrenales, vol. 15. Editorial Universidad de Costa Rica (1996) 17. Van Harmelen, F., Lifschitz, V., Porter, B.: Handbook of Knowledge Representation. Elsevier (2008) 18. Walton, D.: Witness Testimony Evidence: Argumentation, Artificial Intelligence, and Law. Cambridge University Press (2008) 19. Wylie, C.R.: 101 Puzzles in Thought and Logic, vol. 367. Courier Corporation (1957) 20. Zalta, E.N., Nodelman, U., Allen, C., Perry, J.: Stanford Encyclopedia of Philosophy (2003)

Loss Allocation Techniques in Active Power Distribution Systems Ambika Prasad Hota, Sivkumar Mishra, and Debani Prasad Mishra

Abstract Along with several opportunities, the deregulation of the modern power system come across several difficulties in the field of power system operation and shows a significant impact on the electricity markets as well. Since this modernized structure is consumer-centric, the cost of each component in the power system operation must be justified without any ambiguity. Service cost comprises of many components and network loss is also one of them. As it plays a major role, its allocation should be transparent and elucidated. In order to meet the ever-rising power demand, continuous penetration of distributed generator (DG) units is going on at the consumer premises. But, the increasing demand of DGs have made power distribution system (PDS) more complex and challenging as its presence changes the nature of the network from passive to active. Keeping this in view, this thesis first develops an active power loss allocation technique (Method-1) for fair distribution of losses among the network users from the direct correlation existing between potential differences across a branch and its subsequent load currents in terms of node injected complex powers. The loss allocation (LA) results are found to be as per topology with/without DG penetration. The DG remuneration technique proposed awards the entire benefit of network loss reduction (NLR) to the DG owners (DGOs) after investigating their real participation towards NLR. Hence, performance of the developed method-1 is analyzed here at two distinct consequences with the deviation in load power factors. But, some inconsistencies are noticed in the LA results of Method-1. To overcome this issue, Method-2 has been developed where impedance parameter of the concerned branch has been taken into consideration. The efficiency of the developed APLAs are also verified at different voltage dependent load models A. P. Hota (B) Department of Electrical and Electronics Engineering, Gandhi Engineering College (GEC, Bhubaneswar), Bhubaneswar, India e-mail: [email protected] S. Mishra Department of Electrical Engineering, CAPGS BPUT, Rourkela, Odisha, India e-mail: [email protected] D. P. Mishra Department of Electrical and Electronics Engineering, IIIT-Bhubaneswar, Bhubaneswar, India © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 S. R. Dash et al. (eds.), Intelligent Technologies: Concepts, Applications, and Future Directions, Volume 2, Studies in Computational Intelligence 1098, https://doi.org/10.1007/978-981-99-1482-1_5

95

96

A. P. Hota et al.

(LMs) by extending them for Energy LA with various types of DG units. The impact of proposed branch exchange (BE) based heuristic network reconfiguration on PDS loss allocation is also verified with/without DG penetration. Keywords Active power · Distributed generation · Energy loss allocation · Load modeling · Network reconfiguration · Power factor · Power loss allocation

1 Introduction Most of the power loss allocation procedures developed are found to be awarding losses to the transmission systems (TSs) and some of them are rectified further to be used in PDSs. However, TSs are referred to as primary distribution systems (DSs) where radial/mesh systems are referred to as secondary DSs. As a result, both systems exhibit structural and behavioral differences. The modified version of methodologies developed for transmission systems may not be well suited for PDSs as the response of the root bus in both the networks has vast operational variations. There are also new LA methods designed specifically for RDNs that address this problem by assigning active power losses to all consumer points excluding the root node. LA-related issues have been tried to resolve through different approaches such as: pro-rata (PR), proportional sharing (PSM), incremental, direct loss coefficient (DLC), and circuit theory based methods have been developed. Recently, a current summation procedure has been developed in [1] where mutual losses are distributed within the PDN users utilizing a logarithmic approach of LA. But, this approach is restricted to those PDNs having participation factors positive within the range [0–2]. An injected power based APLA methodology is developed in [2] to get around these restrictions. However, to all DG linked users, this system is seen as allocating “zero” losses whereas fines for all DGOs even if DG power injections minimize system loss. So, this scheme is not fair for the nodes with DGs [3–5]. In order to provide justice to the DG connected load points, in [6] a mutual-term disintegration procedure (CTDM) is developed where the DG owners receive all profits of NLR caused by the DGs utilizing a remuneration technique based on superposition. However, a small difference between the computed and real amounts of the DG compensation is seen here. For analyzing the impact of uncertainties on LA, a stochastic method of loss allocation is employed in [7] where, allocation to DG units are finalized with the help of Latin hypercube sampling technique. Although this method is considered to be adequate in terms of network topology, the random nature of the DG units is causing larger variances in the voltage profiles. The discussed approach [8–11] does not contain the aforementioned issues. This method presents an effective means to control the distorting character of PDSs in a multi-phase power system in addition to information on the neutral conductor’s LA. As it distributes its mutual losses using contractual powers rather than exact demands, the participation matrix-based APLA [12] fails to provide precise allocations. The power sharing based LA technique [13], which distributes losses to its network members while taking into account their actual

Loss Allocation Techniques in Active Power Distribution Systems

97

power consumptions/injections, does not have this issue. However, there are several assumptions and approximations in this procedure [14].

2 Loss Allocation Analysis with Method-1 With/Without DGs This power loss allocation technique developed distributes losses across network users while taking into account deregulated electricity environments. This method executes system load flow (LF) using a numerical technique based on forward– backward sweep (FBS), and it accomplishes power loss allocation using an exact formulation. Without making any assumptions or approximations, it eliminates the complexity associated with the sharing of the mutual losses in the power loss allocation. The suggested LA provides a direct link between two PD of a branch and its succeeding node currents. It allocates losses to end-users while taking into account their load requirements and geographic locations with or without DGs. A power distribution system’s loss may rise or decrease depending on the penetration of DGs. Therefore, this study suggests to give fairness to the network users, this study suggests a method for DG compensation that, after determining each DG unit’s precise contribution to network loss reduction, delivers either incentives or penalties to the DG units. Utilizing an IEEE 33-bus PDS, the performance of the suggested method is examined while taking into account a variety of load factors, DG capacities, and DG power injections. The outcomes demonstrate the effectiveness and adaptability of the current strategy in comparison to other tried-and-true methods. Three sets of buses with identical demand but in various places are chosen in order to assess the ability with respect to differentiate equitable loss allocation. For comparison with the current methods, Table 1 shows the LA differences between these two customers estimated at different load factors. First, the two nearby nodes 9 and 10 are chosen. At each load level, the suggested technique’s differences in losses are shown to be extremely near to those of the Method [1], BCDLA [15], and Method [16]. In contrast to other procedures now in use, the PSMLA procedure does a very poor job of addressing this discrimination, which is highlighted by the current strategy. Similarly, to examine response as regard to consumers placed away from each other, two sets of customers (i.e., the first distance set consisting of node-9 and node-28 and, the second distance set comprising of node-6 and 28) are selected for comparison. While examining the gap in LA between customers situated at bus 9 and 28, unexpectedly, the LA to customer at 28 was greater than the loss to the customer at 9 (near the substation) at every discussed situation of the LF without DGs by all existing approaches. However, the suggested strategy takes care of this.

98

A. P. Hota et al.

Table 1 Difference in APLA (kW) between customers of equal loads at different load level Procedures

At 60% In absence of DGs

At 100% In presence of DGs

In absence of DGs

At 140% In presence of DGs

In absence of DGs

In presence of DGs

Difference in LA between consumer 28 and 6 Proposed procedure

0.38

4.80

1.15

13.72

2.47

27.32

LA procedure [1]

0.19

0.12

0.58

0.38

1.27

0.83

LA procedure [16]

0.19

0.12

0.58

0.32

1.27

0.65

PSMLA

0.07

0.07

0.22

0.19

0.47

0.38

BCDLA

0.21

0.33

0.56

0.10

1.21

0.74

Difference in LA between consumer 28 and 9 Proposed procedure

0.08

−0.01

0.26

−0.03

0.56

−0.06

LA procedure [1]

−0.05

−0.12

−0.17

−0.34

−0.33

−0.71

LA procedure [16]

−0.05

−0.11

−0.17

−0.33

−0.33

−0.65

PSMLA

−0.11

−0.11

−0.35

−0.32

−0.74

−0.63

BCDLA

−0.04

−0.11

−0.3

−0.59

−0.33

−0.62

Difference in LA between consumer 10 and 9 Proposed procedure

0.12

0.11

0.38

0.32

0.82

0.65

LA procedure [1]

0.11

0.1

0.33

0.33

0.73

0.7

LA procedure [16]

0.11

0.1

0.33

0.29

0.73

0.59

PSMLA

0.09

0.08

0.27

0.24

0.6

0.51

BCDLA

0.11

0.1

0.35

0.29

0.76

0.6

2.1 Methodology Active power loss of any branch can be computed with its branch resistance R(b) and branch current I(b) as: P Loss(b) = R(b) · [|I (b)|]2

(1)

Loss Allocation Techniques in Active Power Distribution Systems

99

It can be distributed among each node point as discussed in [5] as: ploss(b, i) = {Ai · PLi + Bi · Q Li }

(2)

where, PLi + jQLi is net complex power injection at node-i. This method is noticed to be working abruptly to the variation in LPFs hence; Method-2 is developed with the inclusion of impedance matrix in the power loss equation as discussed in [4].      P Loss(b) = |I (b)|2 {Z (b)} = {I (b)} · {I (b)}∗ {Z (b)}

(3)

The proposed method algorithm has been represented through a flowchart (Fig. 1) as follows. However, the proposed strategy exhibits very little discrimination when compared to other well-established techniques, demonstrating the usefulness of the current approach with regard to the positions of the customers in the PDS. From the second set of distance consumers (in between node-6 and 28) it can be realized, node-6 is nearer to the reference or root bus against node-28. The allocation is noticed to be adequate by all the discussed methods in the active/passive condition of the PDS. Further, the proposed scheme provides maximum benefit to the DG connected consumer placed at node-6 against others. The results of Method [1], Method [16], and BCDLA show that the level of discrimination in LA is quite similar. The proposed approach has the biggest difference whereas PSMLA has the lowest. The discrimination between end-users of identical demands is superior with the developed procedure than that of PSMLA and, moderate with the other procedures, among the various techniques that have been discussed. As a result, the current scheme is able to distinguish between nodes that are close to one another and have or do not have DGs as well as participants that are placed far apart from one another. The said RDN is evaluated not only at three distinct LFs (i.e., at 0.6, 1, and 1.4) but also with variation of DG capacities from 20 to 200%. Table 2 shows that all techniques within 20–80% of DG capacity pay compensation to DGOs, but the proposed way gives them the greatest benefit. After 40%, the compensation provided by BCDLA drops off, and at 100% of DG capacity, it is discovered to be 0.05 kW. Although system loss lowers, this strategy penalises DGOs with greater DG capacities (100–200%). Furthermore, the system’s overall loss rises from 43.44 to 74.44 kW when the DG capacity is increased to 150%. In this situation, DG penetration still results in a loss drop of 128.23 kW. However, the BCDLA technique imposes significant fines (99.45 kW) to the DGOs, whereas other discussed procedures do not provide DGOs with a precise loss reduction reward. In the absence of DGs, the total power loss allocation is 202.67 kW at 200% of DG capacity, whereas it is 166.54 kW with DGs. As a result, a NLR of 36.13 kW happens as a result of the integration of DGs to the PDS. However, it is confirmed that all approaches excluding the suggested approach have penalized all DGs. Hence, it can be concluded from the foregoing investigation that when DG capabilities rise at a constant load, their performance in terms of loss

100

A. P. Hota et al. Start Stop Read input data of the RDN and DGs

Using network data construct adb[ ], mf[ ] , mt[ ], pb[ ], SN[ ], nsb[ ], mfs[ ] and mts[ ] arrays

Set: all branch losses to zero, all bus voltage magnitudes to 1 p.u and phase angle to 0º i.e. vx = 1 0, then the PR j continues the auction process with C P i , otherwise withdraws from the auction. Algorithm 2 shows the auction process mechanism in the Cloud Environment.

156

G. B. Hima Bindu et al.

QOS Enhanced Energy Aware Task Scheduling Models in Cloud Computing

157

5.4 Multi-objective Dynamic Resource Scheduling Model for User Tasks in the Cloud Computing Focusing on minimizing makespan, task execution time, and resource cost is a primary goal of the proposed approach. To create effective work scheduling algorithms, we can ignore any of the goals without jeopardizing the others. The technique for scheduling tasks takes into account a pool of m available virtual machines VM = {VM1 , VM2 …VMm } and a pool of n available tasks (T = t1, t2,…. n). Based on a tweaked version of the non-dominated sorting algorithm, this study takes into account a multi-objective scheduling model. In the first stage, the user initiates the cloud environment by submitting the tasks. The jobs are received by the cloud service broker and then forwarded to the scheduler. The jobs are divided into non-dominated sets by the multi-goal scheduler.

5.4.1

Selecting Virtual Machines

Eliminating unused or underutilized VMs is essential for saving money and making better use of available time in your virtual infrastructure. The number of vm’s chosen is based on the workloads that have been requested. The number of virtual machines, as calculated by Eq. 24, is proportional to the number of tasks received. C(V M) =

Load(T otal_T asks) Max(V M_Mi ps)

Load(T otal_T asks) =

n 

Load(T asks)

(24)

(25)

m=1

5.4.2

Sorting the Virtual Machines

In order to schedule effectively, it is necessary to first determine how many virtual machines are needed and then sort them. User expectations must be measured, and quality of service (QoS) is the metric of choice. When deciding which cloud services to employ, the user has specific needs. In this study, we examine how different measures of quality of service—including bandwidth, pricing, and processing time— impact the overall definition of those services. The quality of service (QoS) function for a given service is displayed in Eq. 26; it is derived from the QoS vectors. QoS(s) =

r  Q max j,m − qm (s) m=1

min Q max j,m − Q j,m

(26)

158

G. B. Hima Bindu et al.

where r represents the number of attributes of QoS selected for virtual machine, qm(s) min denotes value of attribute for service s. Q max j,m and Q j,m are maximum and minimum values of the mth attribute selected. Once the QoS function has been calculated, the virtual machines are sorted and placed in descending order. The more intensive computations are distributed among the most powerful virtual machines.

5.4.3

Non-dominates Sorting Approach for Tasks

In order to evaluate something using a multi-objective method, at least two of those things will have to be taken into account. There will be cases where the two goals are equally important and cannot be separated using the dominating method. The tasks are prioritized in this method of sorting according to their minimum size, which is a reflection of the execution cost and makespan. The two objective functions defined for scheduling are shown in Eqs. 27 and 28. Min f (T (si zem ) = T (si zem )

∵ ∀ j∃i

f (T (si zei )) ≤ f (T (si ze j ))

(27)

Min f (T (cos tm ) = T (cos tm )

∵ ∀ j∃i

f (T (cos ti )) ≤ f (T (cos t j ))

(28)

To illustrate, let T(size) represent the overall size of the tasks, and let T(cost) represent the total cost of completing the jobs. Cost of task execution in a virtual machine is calculated as shown in Eq. 29. T (cost) =

 m∈P(V M)

V M(costm ) × T (E xecm )

(29)

where P(VM) is the total number of service providers involved in delivering the VMs. The price of a virtual machine rises in proportion to the number of processors it uses. In Eq. 30, we can see how much the virtual machine will cost in relation to the service provider, and in Eq. 31, we can see how long the job will take to run on the virtual machine that has been assigned. V M(cos t) =

cos t_ per _second() →− → E− → M I P S_to_execute− ← ←_P ←()

n  P E × Length + Si ze_o f _the_Out put T (E xecm ) = MI PS × PE i=1

(30)

(31)

QOS Enhanced Energy Aware Task Scheduling Models in Cloud Computing

5.4.4

159

Calculation of Crowding Distance

To determine the crowding distance (CD), just subtract the distance between any two people and the next two people in line. To see how the objective functions’ crowding distance is determined, see Eq. 32.      i+1  i−1  i−1  f + C D =  f Ti+1 − f − f   (si ze) T (si ze) T (cost) T (cost) 

(32)

where, fi + 1 and fi−1 are the prior and following individuals in each objective of the current individual.

5.4.5

Scheduling Tasks to Virtual Machines

VMs and Tasks are both included in the sorting. Next, the tasks are sorted by priority and added to the execution queue, where the first job is assigned to the first available virtual machine. In a typical state, the virtual machine runs at the pace depicted by Eq. 33. Regular execution speed in the virtual machine is measured against a predetermined limit. If the virtual machine’s execution rate is below a certain level, it will be given the next available task. The new virtual machine is assigned if the execution rate is greater than the threshold. This method is favored since it allows us to meet our tight deadline. V M(N _Rate) =

5.4.6

Curr ent_load(V M) Max_M I P S(V M)

(33)

Penalty Function

With cloud computing, users can get access to their resources whenever they need them. In this study, a resource allocation strategy was presented to cut down on missed deadlines and overall expenses. Calculated using the price of a virtual machine and the penalty for missing a deadline, Eq. 34 displays the full scheduling cost. Min(T ot_cost) = Cost (V M) + Pen_cost Cost (V M) =

I 

(34)

Cost (V Mi )

(35)

pen_costi

(36)

i=1

Pen_cost =

r  i=1

160

G. B. Hima Bindu et al.

In which I and r stand for the total number of virtual machines and the total number of tasks, respectively. The penalty cost associated with the task is displayed in Eq. 37. Pen_costi = Missed(deadline) × Pen_rate

(37)

where Pen_rate is the per-unit cost of the delay and Missed(deadline) is the amount of time it has taken to finish the work after the deadline has passed.

6 Results and Discussion Cloudsim is used to run simulations of the suggested model [22]. The genetic algorithm and the simulator are set up using the NSGA-II package [23]. The bindcloudletToVm() function in the Datacenter Broker class is tweaked so that tasks and VMs can be scheduled. The suggested approach uses a fitness function that prioritizes energy efficiency, maketime, and data transfer speed as its primary selection criteria for chromosomes. Cloud configuration settings can be found in Tables 1 and 2. Table 2 displays task duration in instructions and file size in megabytes. Table 3 contains the algorithm parameters that will be used in the suggested method. If you want optimal results, set the crossover rate somewhere between 0.6 and 0.9, and assume a mutation rate of 0.1. We compared the suggested algorithm’s performance like energy, data transfer time, and makespan to that of HEFT [12] and DVFS-MODPSO [21]. Maketime for DVFS-MODPSO, HEFT, and algorithm proposed is displayed in Fig. 2. This Table 1 VM configuration parameters VM IDs

VMMname

MIPS

Memory(KB)

CPUs

1–5

Xen

500

512

4

1000

6–10

Xen

300

256

1

10,000

11–15

Xen

500

512

1

1000

16–20

Xen

200

512

2

1000

21–25

Xen

500

256

1

10,000

Table 2 Task characteristics

Bandwidth

Task IDs

Length

File size (MB)

Required CPUs

1–50

25,000

250

1

51–100

45,000

300

1

100–150

45,000

300

1

151–200

45,000

300

1

201–250

25,000

250

1

250–300

25,000

250

1

QOS Enhanced Energy Aware Task Scheduling Models in Cloud Computing Table 3 Configuration simulation

161

Parameter

Value

Selection of parent

Roulette Wheel

Size of population

100

Recombining

Single point crossover with probability 0.6−0.9

Total number of tasks in graph

300

Total VMs

25

Convergence criteria

Twenty generations

demonstrates that our suggested approach consistently beats DVFS-MODPSO and HEFT in various random scenarios. Scheduling issue is addressed by the proposed approach by applying the evolution technique. That process favors chromosomes with shorter expected lifespans. To put it simply, HEFT outperforms the DVFS-MODPSO. Even though the proposed algorithm isn’t significantly better than HEFT, it can still win in most circumstances. Data transfer times for DVFS-MODPSO, HEFT, and technique proposed are displayed in Fig. 3. In contrast to HEFT and the other methods, DVFS-MODPSO algorithm will not provide a means of minimizing the time spent passing data between tasks. Both HEFT and the proposed method were able to reduce data transmission time by over 10% compared to DVFS-MODPSO. The proposed approach includes a procedure for choosing optimal pairings of jobs and virtual machines. HEFT and DVFS-MODPSO do not include this step. So, it can be shown that the proposed strategy effectively shortens the duration of the data transfer. To compare the suggested method’s energy consumption to that of already known methods, see Fig. 4. Improved makespan is directly responsible for a lower energy bill as it relates to scheduling. Here, we think about using 25 virtual machines to carry out the duties. The longevity of the proposed approach is increased. At full load, with all 25 virtual machines operational, the suggested technique consumes 64 J of energy. 3000

Fig. 2 Makespan of HEFT, DVFS-MODPSO, and proposed algorithm

2750 2500

HEFT DVFS-MODPSO Proposed

Makespan (sec)

2250 2000 1750 1500 1250 1000 750 500 250 50

100

150

200

Number of Tasks

250

300

162

G. B. Hima Bindu et al. 200

Fig. 3 Data transfer time of HEFT, DVFS-MODPSO, and proposed algorithm

180

HEFT DVFS-MODPSO Proposed

Data Transfer Time (sec)

160 140 120 100 80 60 40 50

100

150

200

250

300

Number of Tasks

80

Proposed DVFS-MODPSO HEFT

70

Energy Consumption in (J)

Fig. 4 Energy consumption of HEFT, DVFS-MODPSO, and proposed algorithm

60

50

40

30

20 5

10

15

20

25

Number of VMs

When stacked up against other methods like HEFT and DVFS-MODPSO, it excels. Comparatively, DVFS-MODPSO used 69 J of energy whereas the HEFT used 76 J. Three algorithms’ overall performance for energy usage, data transfer time, and makespan is compared in Fig. 5. Makespan for HEFT was 4.3% on average. Ten percent less power was used and sixteen percent less data transferred time. Saving 12.8% on energy consumption, 2.3% on makespan, and 6% on data transfer time were all attained using DVFS-MODPSO. There was a 20.4% decrease in energy consumption, a 23.9% decrease in data transfer time, and a 12.5% decrease in makespan with the proposed technique. The study shows that the simulation environment yields optimal makespan value and data transmission time while minimizing energy consumption.

7 Conclusion The results of the study described in this thesis have led to a number of innovative successes and contributions, including the following: In this chapter, we focused on an optimization model to allocate the tasks to available servers efficiently. For the purpose of scheduling tasks in the cloud, a novel technique named energy aware multi-objective genetic algorithm was presented. The

QOS Enhanced Energy Aware Task Scheduling Models in Cloud Computing 25

HEFT DVFS-MODPSO Proposed

20

Improvement (%)

Fig. 5 Performance comparison of HEFT, DVFS-MODPSO, and proposed algorithm

163

15

10

5

0 Energy Savings

Makespan

Data Transfer Time

Objectives

suggested algorithm balances competing concerns such power consumption, communication latency, and network range. It generates a variety of secondary outcomes, allowing users to pick their chosen objective and plan their activities as needed. The second major contribution is the ACO auction model that was created for cutting down on expenses when picking resources and maximizing efficiency when allocating them. The client had the awareness of the resource cost than the resource provider. In the meantime, resource provider concentrates on the profit. Hence the resource provider has the additional benefit in the auction method. The auction model helps the clients in selecting the resources intelligently. In order to make the most efficient use of cloud resources, the ACO model is implemented. Finally, a multiobjective algorithm that makes use of the non-dominated technique and the crowding distance approach was presented. In order to meet the needs of the users, the suggested method calculates the QOS for the VM before allocating them to tasks.

References 1. Huang, C.J., Guan, C.T., Chen, H.M., Wang, Y.W., Chang, S.C., YuLi, C., Weng, C.H.: An adaptive resource management scheme in cloud computing. Eng. Appl. Artif. Intell. 26, 382– 389 (2013) 2. Arunarani, A.R., Manjula, D., Sugumaran, V.: Task scheduling techniques in cloud computing: a literature survey. Future Gener. Comput. Syst. 91, 407–415 (2019) 3. Houssein, E.H., Gad, A.G., Wazery, Y.M., Suganthan, P.N.: Task scheduling in cloud computing based on meta-heuristics: review, taxonomy, open challenges, and future trends. Swarm Evol. Comput. 100841 (2021) 4. Panda, S.K., Nanda, S.S., Bhoi, S.K.: A pair-based task scheduling algorithm for cloud computing environment. J. King Saud Univ. Comput. Inf. Sci. 1–12 (2018) 5. Bittencourt, L.F., Goldman, A., Madeira, E.R.M., da Fonseca, N.L.S., Sakellariou, R.: Scheduling in distributed systems: a cloud computing perspective. Comput. Sci. Rev. 30, 31–54 (2018) 6. Attiya, I., Abd Elaziz, M., Xiong, S.: Job scheduling in cloud computing using a modified harris hawks optimization and simulated annealing algorithm. Comput. Intell. Neurosci. (2020) 7. Abazari, F., Analoui, M., Takabi, H., Fu, S.: MOWS: multi-objective workflow scheduling in cloud computing based on heuristic algorithm. Simul. Modell. Pract. Theory 1–19 (2018)

164

G. B. Hima Bindu et al.

8. Ismayilov, G., Topcuoglu, H.R.: Neural network based multi-objective evolutionary algorithm for dynamic workflow scheduling in cloud computing. Future Gener. Comput. Syst. 102, 307– 322 (2020) 9. Juarez, F., Ejarque, J., Badia, R.M.: Dynamic energy-aware scheduling for parallel task-based application in cloud computing. Future Gener. Comput. Syst. 78, 257–271 (2018) 10. Boveiri, H.R., Khayami, R., Elhoseny, M., Gunasekaran, M.: An efficient Swarm-Intelligence approach for task scheduling in cloud-based internet of things applications. J. Ambient Intell. Hum. Comput. 10(9), 3469–3479 (2019) 11. Orgerie, A.-C., de Assuncao, M.D., Lefevre. L.: A survey on techniques for improving the energy efficiency of large-scale distributed systems. ACM Comput. Surv. (CSUR) 46(4) (2014) 12. Cao, B., Zhang, J., Liu, X., Sun, Z., Cao, W., Nowak, R.M., Lv, Z.: Edge-cloud resource scheduling in space-air-ground integrated networks for internet of vehicles. IEEE Internet of Things J (2021) 13. Paya, A., Marinescu, D.C.: Energy-aware load balancing and application scaling for the cloud ecosystem. In: IEEE (2015) 14. Dabbagh, M., Hamdaoui, B., Guizani, M, Rayes, A.: Towards energy-efficient cloud computing: prediction, consolidation, and over commitment. IEEE (2015) 15. Topcuoglu, H., Hariri, S., Wu, M.-Y.: Task scheduling algorithms for heterogeneous processors. In: Heterogeneous Computing Workshop (HCW’99). San Juan (1999) 16. Wilczy´nski, A., Kołodziej, J.: Modelling and simulation of security-aware task scheduling in cloud computing based on Blockchain technology. Simul. Model. Pract. Theory 99, 102038 (2020) 17. Tayal, S.: Tasks scheduling optimization for the cloud computing systems. Ijaest Int. J. Adv. Eng. Sci. Technol. 1(5), 111–115 (2011) 18. Li, J.F., Peng, J., Cao, X., Li, H.Y.: A task scheduling algorithm based on improved ant colony optimization in cloud computing environment. Energy Procedia 13, 6833–6840 (2011) 19. Genez, T.A.L., Bittencourt, L.F., Madeira, E.R.M.: Workflow scheduling for SaaS/PaaS cloud providers considering two SLA levels. In: IEEE/IFIP NOMS (2012) 20. Yu, J., Buyya, R., Tham, C.K.: Cost-based scheduling of scientific workflow applications on utility grids. In: International Conference e-Science and Grid Computing, pp. 140–47 (2005) 21. Lin, J., Cui, D., Peng, Z., Li, Q., He, J., Guo. M.: Virtualized resource scheduling in cloud computing environments: an review. In: IEEE Conference on Telecommunications, Optics and Computer Science (TOCS), pp. 303–308 (2020) 22. Avinaash, M.R., Kumar, G.R., Bhargav, K.A., Prabhu, T.S., Reddy, D.I.: Simulated annealing approach to solution of multi-objective optimal economic dispatch. In: 7th International Conference on Intelligent Systems and Control (ISCO), pp. 127–132. IEEE (2013) 23. Kamjoo, A., Maheri, A., Dizqah, A.M., Putrus, G.A.: Multi-objective design under uncertainties of hybrid renewable energy system using NSGA-II and chance constrained programming. Int. J. Electric. Power Energy Syst. 74, 187–194 (2016)

Power Quality Improvement Using Hybrid Filters Based on Artificial Intelligent Techniques Soumya Ranjan Das, Prakash Kumar Ray, and Debani Prasad Mishra

Abstract To provide a cost-effective and uninterruptible power supply in commercial and industrial applications, power quality (PQ) is a crucial concern. PQ problems are primarily brought on by power electronic equipment, including switch mode power supplies (SMPS), personal computers, fluorescent lighting, and other interface converters. These loads cause nonlinearity and generate harmonics, which have a considerable influence on the effectiveness and performance of the system. By using custom power devices (CPDs) such passive, active, and hybrid filters, harmonics in the utility network can be reduced. But as harmonic orders rise, the design of passive filters (PFs) becomes more intricate and weighty, and the performance of active power filters (APFs) suffers when higher order harmonics are present. In order to provide cost-effective solutions while overcoming the limitations of PFs and APFs, combinations of these filters are devised as hybrid APFs (HAPF). The combination of distributed generation technologies, like as solar photovoltaic (PV), wind turbines (WT), and battery energy storage systems (BESS), has changed the electricity landscape recently. For generating the reference currents of HAPF, several artificially intelligent (AI) and adaptive strategies are taken into consideration. Utilising model predictive control (MPC) and hysteresis current control (HCC), HAPF gating signals are generated (MPC). Additionally, maximum power point tracking (MPPT) approaches like Perturb and Observe (P&O), Perturb and Observe (PO)Fuzzy (PO-F), and Adaptive Fuzzy Logic Controls (AFLC) are taken into consideration for enhancing the PV system’s performance likely to increase the stability of the DC link voltage in HAPF. The power system and HAPF model are constructed using MATLAB/Simulink tool. S. R. Das (B) · D. P. Mishra Department of Electrical Engineering, IIIT Bhubaneswar, Bhubaneswar, India e-mail: [email protected] D. P. Mishra e-mail: [email protected] P. K. Ray Department of Electrical Engineering, OUTR Bhubaneswar, Bhubaneswar, India e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 S. R. Dash et al. (eds.), Intelligent Technologies: Concepts, Applications, and Future Directions, Volume 2, Studies in Computational Intelligence 1098, https://doi.org/10.1007/978-981-99-1482-1_8

165

166

S. R. Das et al.

Keywords Active filters · Adaptive linear neuron · Adaptive fuzzy logic control · Artificial intelligence · Maximum power point tracking · Microgrid · Power quality · Recursive least square · Total harmonics distortion

1 Introduction The diversity of generation and load demand, the utilisation of distributed energy sources (DES) [1], and grid integration with the aid of interface converters have all presented significant design issues for the current power system. The load is divided into linear and nonlinear types [2], with the latter consisting primarily of electronic devices such as computers, battery chargers, and switch mode power supplies. that are extremely sensitive to power quality (PQ) issues like brief power outages, voltage surges and sags, harmonics, transients, and other waveform distortions that result in losses in revenue. It has been noted that the above-mentioned sensitive equipment, which is increasing daily, accounts for a sizeable portion of the load [3, 4]. It is crucial to reduce harmonics in order to improve operation and control in the context of changing scenarios in the modern power system. The DES, which include solar, wind, fuel cells, tidal energy, mini-hydro, and micro-turbines, are currently assuming the lead [5] in the generation of power system. More focus is needed on addressing the PQ concerns as DES penetration into the main grid gradually increases. The concept of microgrids (MGs) [6] was developed to address the issues with a single DES system. MGs incorporate several energy sources and some storage technologies to increase the system’s dependability. In order to increase system stability, technology for PQ upgrades in microgrid is crucial. To address these issues, several types of custom power devices (CPDs) [7] are used to improve PQ, including passive filters (PF), active power filters (APF), and hybrid active power filters (HAPF), and unified power quality conditioners (UPQC). Recently, considerable emphasis has been placed on the integration of DES, particularly with photovoltaic (PV) [8] systems. The nonlinear devices in the power system cause PQ issues such harmonic current and voltage generation, VAR usage, and more.

2 Power Quality Improvement Using Hybrid Filters in PV Integrated Power System Recent years have seen a remarkable improvement in the use of renewable energy sources (RES) [9], particularly solar photovoltaic (PV), which is utilised for a variety of purposes including battery charging, home supply, pumping systems, etc. However, factors like solar irradiation, temperature, dust level, and load affect how well PV systems perform. Due to changes in solar irradiation and atmospheric

Power Quality Improvement Using Hybrid Filters Based on Artificial …

167

conditions, the PV system’s connection between power and voltage is not linear and changes over time. Therefore, to continually watch the system parameters and get the maximum power from PV under changing climatic situations, an effective maximum power point tracking (MPPT) method is needed. Before being connected to the DC link, the electricity produced by the PV arrays needs to be power-conditioned [10]. The PV-based HAPF is used as a result. In order to enhance the PQ, the PVbased HAPF is used as a compensatory device. Benefits including improved power factor, load balancing, compensating harmonic, and reactive power are all provided by the operation of PV generation systems in conjunction with HAPF. In addition to supporting home loads, PV systems have the potential to eliminate undesirable (harmonic/imbalanced) currents from a utility grid [11]. In light of these observations, this part presents the case study that was used to examine the effectiveness of the PV-integrated HAPF utilising AI-based methods. The following subsequent subsection presents the PQ improvement with a PV-integrated traditional series HAPF built with the Robust Extended Complex Kalman Filter (RECKF) and Perturb and Observe Fuzzy (PO-F).

2.1 Case Study: PQ Improvement in Three Phase System Using PV Integrated Conventional VSI Based Series HAPF Designed by Robust Extended Complex Kalman Filter (RECKF) and Perturb and Observe Fuzzy (PO-F) This case study shows a series HAPF with a PV system integrated on a RECKF basis to generate the compensating current for lowering harmonics in the distribution network. For creating the reference signal in HAPF based on an exponential weighted function under grid perturbations and uncertainties, the proposed RECKF is used. Model predictive control (MPC) and hysteresis current control (HCC) are used to produce the switching pulses (MPC). Additionally, the PV system tracks the maximum power using a PO-F-based MPPT approach. The proposed MPPT offers quick transient response and quick tracking under various atmospheric circumstances and is robust to variations in PV system parameters. Under various operating situations, the comparative enhancement of the proposed RECKF with MPC over the traditional HCC approach is given. Using the MATLAB/Simulink tool, the harmonic improvement case study is evaluated.

2.2 System Configuration and Modelling In order to increase PQ in a three-phase system, the configuration of HAPF with PV integration is suggested in this chapter. The PV-integrated traditional VSI-based series HAPF is depicted in Fig. 1.

168

S. R. Das et al. Vs

VL

Za Zb

isa

Zc

i sb

R L

Tr

isc

Nonlinear Loads

Lr Cr

Ripple filter

Vpv

Photovoltaic array

C pv

Cvsc

ifabc v DC dc V Bus

5th 7th

Pulse

IPV MPPT

VPV

VSI

DBC

Duty Cycle

Control Technique

* _ Vdc-PV

Fig. 1 Configuration PV integrated conventional VSI-based series HAPF

2.3 Control Strategies for PV Integrated HAPF In this chapter, PO-F is used for MPPT in PV system. Again, RECKF is employed for designing the HAPF.

2.3.1

Perturb and Observe Fuzzy (PO-F) MPPT Technique

PO-F is an improved and adaptable type [12] of the traditional P&O algorithm. Comparing PO-F to traditional MPPT techniques, PO-F offers benefits including simpler and faster operation, the capacity to react quickly under rapidly changing weather conditions, partial shedding, and noisy operating situations. The proposed PO-F algorithm is presented in Fig. 2. Seven membership functions are taken into account for building the fuzzy inputs/outputs in this work: negative extra-large (NEL), negative large (NL), negative (N), zero (Z), positive (P), positive large (PL), and positive extra-large (PEL). To create the appropriate output value of D, the crisp values of the inputs P and V are translated into linguistic variables, which are then converted back to crisp values. The output variables for MPPT in a PV system are designed using a Mamdani type fuzzy inference system, and the variables are defuzzified using the Centroid of Area (COA) approach.

2.3.2

Robust Extended Complex Kalman Filter (RECKF)

When grid perturbations and uncertainties are present, the proposed RECKF is used to generate the reference signal [13] in HAPF based on an exponential weighted function. The measurement of the fundamental voltage component in phase and the estimation of the fundamental amplitude of the load current are used to produce the

Power Quality Improvement Using Hybrid Filters Based on Artificial …

169

Fig. 2 Control diagram of PO-F method

Fig. 3 Block diagram of RECKF

reference current. The advantage is that it minimises peak undershoots and overshoots problems, which could compromise the stability and security of the switching devices. The RECKF delivers reliable performance under parametric variation and uncertainties and guarantees efficient tracing of the reference signals. Model predictive control (MPC) and hysteresis current control (HCC) are used to produce the switching pulses (MPC). Figure 3 displays the RECKF block diagram.

2.4 Results Analysis of the Case Study The harmonic analysis simulation results and discussion for a grid-connected PV system are presented in this subsection. An established series HAPF based on VSI is used to create a solar PV with grid connections. Under various nonlinear loading scenarios, a comparison between RECKF with HCC and RECKF approach with MPC is conducted. Using the PO-F MPPT approach, the maximum power delivered

170

S. R. Das et al.

il

0

-20 0

0.05

0.1

Time (seconds)

Mag (% of Fundamental)

FFT analysis

20

THD= 2.36%

15 10 5 0

0

500

1000

Frequency (Hz)

(a)

(b)

Fig. 4 Performance of PV-HAPF in three-phase system under balanced load using RECKF-HCC technique a supply current, b THD value

by the PV system is monitored. Using the MATLAB/SIMULINK programme, the suggested PV-integrated series HAPF system is created. After employing RECKFHCC to connect the PV-integrated HAPF in three-phase system, a significant increase in harmonic distortion mitigation is seen in Fig. 4. The calculated THD value is 2.36%. The suggested system is tested simultaneously under an unbalanced nonlinear load, and the simulated outcomes are displayed in Fig. 5. THD is discovered to be 4.20%. Designing the RECKF-MPC-based HAPF under balanced and unbalanced nonlinear loading situations further develops the improvement in PQ. There is a significant reduction in harmonics. Figures 6 and 7 show, for a balanced and an unbalanced load, respectively, how the suggested system performs. The THD value was 0.40%, which is significantly better than the results from traditional approaches. Reactive power compensation is proven to be substantially more effective than RECKF-HCC in suppressing harmonics in voltage and current at the same time under an unbalanced load. THD has improved since the THD value of 1.76% was discovered. The results indicate that the proposed RECKF-MPC-based HAPF will perform better in terms of PQ than the RECKF-HCC controller. FFT analysis

0 -2 0

0.05

Time (seconds)

(a)

0.1

Mag (% of Fundamental)

is

2

THD= 4.20%

15 10 5 0

0

500

1000

Frequency (Hz)

(b)

Fig. 5 Performance of PV-HAPF in three-phase system under unbalanced load using RECKF-HCC technique a supply current, b THD value

Power Quality Improvement Using Hybrid Filters Based on Artificial …

171

FFT analysis

THD= 0.40%

i

s

50

200 0

100

-50 0

0 0.05

Time (seconds)

0.1

0

(a)

500 Frequency (Hz)

1000

(b)

Fig. 6 Performance of PV-HAPF in three-phase system under balanced load using RECKF-MPC technique a supply current, b THD value

is

0

-20

0

0.02

0.04

0.06

0.08

Time (seconds)

(a)

0.1

Mag (% of Fundamental)

FFT analysis

20

THD= 1.76%

4 2 0

0

500

1000

Frequency (Hz)

(b)

Fig. 7 Performance of PV-HAPF in three-phase system under unbalanced load using RECKF-MPC technique a supply current, b THD value

3 Artificial Intelligent Methods for PQ Improvement in DC Microgrid Integrated Power System The use of renewable energy sources and energy storage technologies has been shown to have undergone a revolutionary transition in the current power system. Recent years have seen a considerably higher penetration of DES into the electricity system, allowing it to achieve sustainability and guarantee resiliency [14]. Additionally, DES addresses the gradually rising energy consumption while reducing the consequences of climate variation and the ongoing depletion of energy resources. This helps to meet the demands of regional economic and social development. The DES creates a direct current (DC) microgrid by integrating unconventional sources like solar (PV) and wind turbines with battery energy storage systems (BESS). Reduced line losses and improved system efficiency are benefits of the DC MG [15] over AC MG. Additionally, unlike conventional grid-tied inverters, the tracking of the phase and frequency of AC voltage is not required in DC MG, which has a significant impact on the controllability and reliability of AC MG. Therefore, the integration of dispersed energy sources is better suited for DC MG. However, connecting the MG to the grid is undoubtedly difficult and calls for greater attention

172

S. R. Das et al.

to the control methods employed for the utilised converters. Shunt hybrid active power filters (HAPF), which are typically combined with MG and are developed with various intelligent control schemes, are used to handle various PQ issues.

3.1 Case Study: PQ Improvement in Three Phase System Using PV Integrated Conventional VSI Based Series HAPF Designed by Robust Extended Complex Kalman Filter (RECKF) and Perturb and Observe Fuzzy (PO-F) PV and BESS-based DC MG coupled with HAPF is taken into consideration for the enhancement of PQ in the first case study. The proposed model utilises the dual tree-complex wavelet transform (DT-CWT), a state-of-the-art signal processing technique, to extract the frequency data from the voltage and current brought on by PQ issues. In order to increase performance under various operating conditions, the PO-F method is proposed for MPPT in PV systems. ZA-LMS is taken into account in the HAPF for measuring the reference current and gating signals. The ZA-LMS and LMS with DT-CWT performance under various operating conditions with nonlinear loadings and PV system uncertainty are compared.

3.2 System Configuration and Modelling To increase the PQ in a three-phase system, the DC MG integrations with shunt HAPF topology are suggested in this chapter. The DC MG integrated shunt HAPF is shown in Fig. 8 and is powered by PV and BESS.

3.3 Control Strategies for PV Integrated HAPF Perturb and Observe Fuzzy (PO-F), a method for MPPT in PV systems, are employed in this subsection. Utilising the ZA-LMS with HCC, DT-CWT is used to produce the reference and switching signals of the VSI.

3.3.1

Zero Attracting-LMS (ZA-LMS)

In order to remove load current distortions, this technique [16] integrates a zero attractor with the LMS framework. The configuration of the control algorithm based on the ZA-LMS is illustrated in Fig. 9. By analysing the distorted load current, the suggested controller can be used to measure the reference grid currents.

Power Quality Improvement Using Hybrid Filters Based on Artificial … Vsa

Za

Vsb Zb

is b is c

Zc

iLabc

is a

Vsc

MPPT IPV

L

5th,7th and high pass shunt passive filters

DC/DC Boost converter Lb

VLabc R

IPV VPV

173

VSI

Nonlinear loads

Rc , Lc

VPV i sa i sb i sc ila ilb ilc

PV array Control Algorithm

Vdc

Battery

Ld

Vsa Vsb Vsc

Gating pulse converter

Bidirectional converter DC BUS

Fig. 8 Shunt HAPF integrated DC MG PV and BESS

3.3.2

DT-CWT

This method analyses the signals utilising time–frequency resolutions, which are very useful for finding any PQ problems in the power system. For the purpose of estimating fundamental components, the load currents are managed separately by the DT-CWT [17]. The complex form of individual phase current is computed using two different DWT decompositions. The real coefficients are produced by one, while the imaginary coefficients are produced by the other. There are two trees in DTCWT that are referred to as actual trees and phantom trees. Each phase’s active load current component is depicted using an imaginary tree in quadrature with the phase’s observed load current. The phase “a” load current’s breakdown structure is illustrated in Fig. 10. The relevant active power components are computed using the estimated quadrature basic load current components.

174

S. R. Das et al. * Vdc

Ipdc

PI Controller Vdc

U bp U ap

U cp

ZCD iLa iLb iLc

iqfla

DT- CWT

iqflb

iAfla

S/H

abs

iqflc

i La wap(n+1) U ap U aq ZA LMS w method for bp(n+1) wap (n)waq(n) i

La

iLb Ubp U bq ZA - LMS method for wbp(n)wbq(n) iLb

iBfla iCfla

..

3

LPF

1/3

U ap U bp U cp

IAplg

iaref

wrp

wmp

wcp(n+1)

Generation i bref of

waq(n+1)

reference current

i Lc U cp U cq ZA - LMS method for wbq(n+1) wcp (n)wcq(n) iLc wcq(n+1)

LPF

1/3

isa S3 S6 isb S5 S2

icref

wrq

wmq

S1 S4

isc wt

PI Controller

LPF

U aq U bq U cq

Inphase and quadrature unit vectors

V sa V sb V sc

Uap Ubp Ucp

ζ εa w(n+1)

ua

Q

SGN

w(n)

1/Z

Fig. 9 ZA-LMS-based control algorithm

3.4 Results and Discussions In this part, a case study using HAPF integrated with DC MG and artificial intelligence approaches is described for active and reactive power management and improvement in PQ. The performance of the DC MG-based, three-phase gridconnected HAPF for harmonic compensation is examined under various loading scenarios. The MATLAB/Simulink programming language is used to create the power system model.

3.4.1

Dynamic Loading Condition

Here, the three-phase source current waveform is still balanced and almost sinusoidal. In Fig. 11, the waveform is displayed (a). The load is reconnected to the system at

Power Quality Improvement Using Hybrid Filters Based on Artificial …

175

iLa Imaginary tree d1R

Real tree l01

k01

k11

a1l

k13

l13

a3R

d3R

l04 a4R

k04 k04

d2R

l03

a3l

d3l

l12

a2R

a2l k03

d2l

d1R

l02

k02 k12

d1l

l11

a1R

a4l

I14

d4l 600 1200

300 600

150 300

d4R

75150

0150

0150

75150

150 - 300 300 600

600 1200

(in hertz) Detailed coefficients

Detailed coefficients

Approximate coefficients

Fig. 10 Decomposition of phase ‘a’ load current with DT-CWT

0.2 s, and as a result, the waveform is seen to be continuous from that point on. In contrast to the dynamics in phase ‘c,’ the load current waveform in phases ‘a’ and ‘b’ can be seen to be stable. Similar to this, Fig. 11 shows the dynamics of the supply and load currents in the ZA-LMS situation (b). The load switching in this case also reflects the fluctuations in source and load currents. When using ZA-LMS instead of the LMS technique, it has been observed that the source current in phase “c” has improved greatly. The source current using LMS THD values (a) 500 W/m2 (b). Figure 12a and b show 1000 W/m2 , respectively (b). In Fig. 13, the THD values of the supply current employing the ZA- LMS are shown as (a) 500 W/m2 and (b) 1000 W/m2 . Under various loading situations and 50 isa

isa

50 0 -50 0

0.05

0.1

0.15

0.2

0.25

0.3

0 -50

0.05

0.1

0.15

0.2

0.25

0.3

0

0.05

0.1

0.15

0.2

0.25

0.3

0

0.05

0.1 0.15 0.2 Time (second)

0.25

0.3

0 -50

0

0.05

0.1

0.15

0.2

0.25

0.3 isc

isc

0 50 isb

isb

50

50 0 -50

0 -50

0

0.05

0.1 0.15 0.2 Time(seconds)

(a)

0.25

0.3

50 0 -50

(b)

Fig. 11 Simulation results of three-phase source currents using HAPF during the dynamic state with a using LMS and b ZA-LMS

176

S. R. Das et al.

(a)

(b)

Fig. 12 THD values of source current using LMS a 500 W/m2 b 0.1000 W/m2

(a)

(b)

Fig. 13 THD values of source current using ZA- LMS a 500 W/m2 b 1000 W/m2

PV system uncertainties, the DT-CWT with LMS and ZA-LMS approaches-based HAPFs are proposed for the PQ enhancement in a PV and BESS-based DC MG. It has been discovered that the load receives a consistent power supply and the DC linked voltage is better managed with the incorporation of DC MG.

4 Artificial Intelligent Methods for PQ Improvement in Hybrid Microgrid System The distributed generation (DG) sources are incorporated into the existing power network, where the case of PQ issues arises. Microgrids, which are adequate and flexible platforms, can be used to integrate DGs and energy storage systems in order to meet the energy demand. Major power networks are of the AC type, as they persist to maintain a disproportionate percentage of market share, it is unlikely that fully DC MGs [18] will arise. To fully demonstrate the benefits of AC and DC distribution networks, a hybrid MG (HMG) [19] should be envisaged in light of faster renewable energy integration, improved power conversion efficiency, reduced demand for energy storage, and other aspects.

Power Quality Improvement Using Hybrid Filters Based on Artificial …

177

Because of the presence of nonlinear loads, harmonic pollution is a serious concern in HMG. These harmonics, particularly the low-order ones, have the potential to increase losses, interact with grid-connected equipment, and possibly induce MG resonances. CPDs can be used to provide low-impedance paths for harmonics, hence improving the PQ of the other nodes in the MG. This lessens the influence of the harmful harmonics. Without incurring additional costs, hybrid active power filters (HAPF) may provide harmonic correction in AC and DC MG.

4.1 System Configuration and Modelling In this chapter, an HMG with interfaced DC and AC local loads as well as shunt HAPF is recommended in order to increase PQ. Figure 14 depicts the HMG with a number of DGs interconnected across the DC and AC buses using DC/DC, DC/AC, and AC/DC/AC converters. By turning on the DC integration mode, the suggested solution seeks to enhance the PQ of MGs while protecting it from any adverse effects. The AC integration is turned on to boost MG efficiency when utility power is steady. The integration of AC/DC MG is used to analyse the performance of HAPF. The proposed methodologies’ functionality is tested in both grid-linked and islanded modes. Three phase grid

Vdc

DC-Microgrid

AC-Microgrid

AC/DC/ AC

DC/DC Transformer

Photovoltaic system

Wind Turbine

Bidirectional DC/DC

T/F

Battery Diesel generator

FC

DC/DC

Switch

AC/AC

Fuel cell

VSI

Cdc DC loads

a

b

Flywheel

c

AC loads

5th,7th and high pass shunt passive filters

DC Bus

AC Bus Nonlinear AC loads

Fig. 14 Control structure of HMG

Flywheel

178

S. R. Das et al.

4.2 Control Strategies for HMG Integrated with HAPF In this sub-section, three different control techniques are presented which are explained in brief in subsequent sections.

4.2.1

Fuzzy Adaptive Grasshopper Optimization Algorithm (FAGOA)

FAGOA is a modified version of the Grasshopper Optimization Algorithm (GOA). It was created based on the traits of grasshoppers in their ecological habitat [20]. Grasshoppers congregated primarily to grow the biggest swarms that were seen for exploitation and exploration. The update factor for each candidate’s solution is measured using fuzzy logic in the proposed FAGOA.

4.2.2

Modified Recursive Gauss–Newton (MRGN)

The traditional RGN is an intricate algorithm that needs a lot of memory for repeated calculations. In contrast, the MGRN approach is far less difficult than the original method. The MRGN approach is further examined in this section [21]. Figure 15 shows the control block of the suggested MRGN approach. 1/Z 1/Z Ma (k-1) Ya (k-1) ζa iLa(k)

cos iLa(k) sin

ωn = 314

a(k)

FAGO

Ma (k-1)

Ya (k-1)

FLC

0.5 Ca (k-1) ζa

.. Ca(k)

..

Ma (k)

..

1/Z

Mloss Mavg

Mb (k) Phase-b weight updation

Mc (k)

Mc (k-1) Yc (k-1) ζc iLc(k)

Phase-c weight updation

Fig. 15 Control block of the proposed MRGN technique

e

LPF

* Vdc Vdc

Δe

Sta

isa*

isa isb isc

Stb

* isb

FAHCC

1/3

Mb (k-1) Yb (k-1) ζb iLb(k)

ITAE

Stc

isc*

Switching signals

Power Quality Improvement Using Hybrid Filters Based on Artificial …

4.2.3

179

Proposed AFL-MPPT Method

Fuzzy logic (FL) control is widely implemented because of its simplicity, capacity to solve system nonlinearity, and data deficiency in mathematical modelling. Because of uncertainty characteristics and meteorological conditions, the tracking behaviour of solar PV panels is particularly difficult. The three primary stages of the FL control’s operation and design—fuzzification, rule assessment, and defuzzification can be summed up as follows. Variations in PV output voltage and current observed in the first stage are utilised to choose input membership functions (MFs) for the FL-based MPPT. The number of input MFs determines the accuracy of the controller. In the second step, the control action is selected using FLC linguistic standards. Various fuzzy membership functions are utilised here to assign the FL controller’s input and output. The third stage of FL controller defuzzification determines the output MF. The FL MPPT controller accepts error and error value changes as inputs. The error signal can be measured. When employing the AFL-MPPT method to evaluate error and error change, two input MFs are taken into account. From the varied inputs and output MFs, seven unique fuzzy subsets are formed. For the proposed AFL-MPPT technique, the rules that connect the input and output MFs are displayed in Table 1. Figure 16 displays the Circuit diagram for the newly AFL-MPPT approach. Table 1 Fuzzy rule base for AFL-MPPT technique E

E NEL

NEM

NES

Z

POS

POM

POL

NEL

NEL

NEL

NEL

NEL

NEM

NES

Z

NEM

NEL

NEL

NEL

NEM

NES

Z

POS

NES

NEL

NEL

NEM

NES

Z

POS

POM

Z

NEL

NEM

NES

Z

POS

POM

POL

POS

NEM

NES

Z

POS

POM

POL

POL

POM

NES

Z

POS

POM

POB

POL

POL

POL

Z

POS

POM

POL

POL

POL

POL

Fig. 16 Circuit diagram for the new proposed AFL-MPPT method for PV systems

FLC

I(k)

P(k)

ΔP

Z -1

..

P(k-1)

E(k)

E

Z -1 E(k-1)

V(k)

DMPPT

ΔE Gate Driver

ΔV >= Z -1

V(k-1) Comparator Sawtooth signal

180

S. R. Das et al.

4.3 Results and Discussions It is determined how the intended MG functions in the grid-linked and islanded mode and also during the transition between them using simulation testing conducted in the Simulink. The grid interface inverter’s DC link capacitor is linked to the DC Bus in AC/DC MG. Grid linked mode is used to connect the main grid to the AC MG. The purpose of the HAPF is to provide both active and reactive power to the load associated at PCC in islanded mode when the main grid is separated in order to compensate for harmonics. Two different types of loads—inductive load and capacitive load—are assessed using the simulation analysis. Both times, a resistor and an inductor or capacitor are linked in series. To introduce nonlinearity into the system for the RL load, a diode bridge rectifier (DBR) is connected in series. Here, L is 100 mH, and R is 40 . The DBR and RC circuit with R = 40  and C = 100 F are coupled in parallel to handle capacitive loads.

4.3.1

Grid Linked Mode with Inductive Load

The grid-connected AC/DC MG with inductive load is utilised without any filtering device. Due to the nonlinear load behaviour, it is shown that the load current is distorted, and THD is estimated to be 17.06%. The harmonics of the distorted load current and PCC voltage is shown in Fig. 17 in accordance with (a) and (b). The various power system parameters are displayed in Fig. 18. In order to increase grid current and PCC voltage, it is clear from the simulation results that the HAPF provides the filtering current. The waveforms under an inductive load are shown in Fig. 18a, b, and c, respectively. Grid current and PCC voltage THD calculations reveal that they are, respectively, 2.28% and 2.32%.

(a) Fig. 17 FFT analysis with inductive load a load current b PCC voltage

(b)

Power Quality Improvement Using Hybrid Filters Based on Artificial …

181

(a)

(b)

(c)

Fig. 18 Performance of the system using inductive load a Various waveforms (supply voltage and current, load current, injected APF and PF current, DC link and PCC voltage, active and reactive power drawn by the load under grid-linked mode, b FFT of supply current and c FFT of compensated PCC voltage

4.3.2

Transient Study

By transitioning from an inductive to a capacitive load after 0.5 s, the system implementing the suggested HAPF is able to respond dynamically. In Fig. 19, different waveforms from the transient analysis are displayed. When switching takes place, a reactive power exchange occurs between the grid and the load, and the DC connected voltage increases during the transient phase. This transaction is plainly shown in the figure. The grid’s supply of reactive power is depleted at the end of the transient, leaving the HAPF as the only power source.

182

S. R. Das et al.

Fig. 19 Characteristics of supply voltage and current, load current, injected APF and PF current, DC link and PCC voltage, active and reactive power, when switching occurs from inductive to capacitive load in grid mode

4.3.3

Islanded Mode

In islanded mode, the supply system is isolated, and compensating device deliveries both active and reactive power to the load connected in order to improve the sinusoidality of the voltage there. Diode bridge rectifier (DBR) is employed in the analysis under nonlinear inductive load with R = 40 , L = 100 mH. The inductive load does not draw any grid current in this mode since the grid is isolated. Once more, no grid electricity is used for active or reactive purposes, the above description is illustrated in Fig. 20.

4.3.4

Transition from Grid Connected to Islanded Mode of Operation

The system is simulated in order to investigate how the proposed HAPF performs when changing from grid to islanded mode of operation under nonlinear inductive loading conditions. Here, the main grid is purposefully cut off for 0.5 s. Only a few of the waveforms are seen in Fig. 21. When switching from grid to islanded mode, it has been noticed that the grid current zeros out. Additionally, it is noted that the DC voltage, active power, and reactive power are all impacted at the time of switching

Power Quality Improvement Using Hybrid Filters Based on Artificial …

183

Fig. 20 Characteristics of supply voltage and current, load current, injected APF and PF current, DC link and PCC voltage, active and reactive power for inductive load in islanded mode

before stabilising. When the harmonics in the PCC voltage are corrected while operating in grid connected, transition period, and islanded modes, the proposed HAPF’s robustness is evident. The performance of the suggested MRGNN-based HAPF is evaluated in an HMG integrated power system. Different loading scenarios and inverter switching states are used in the simulations. The DC MG effectively stabilises the DC voltage of the VSI of the proposed HAPF under the aforementioned operating conditions. The suggested MRGNN-based HAPF significantly improved its compensating performance with the combination of AC and DC MGs under varied operating situations.

5 Conclusion This thesis describes the enhancement of PQ in a three-phase distribution system utilising compensating devices to reduce harmonics and reactive power produced by nonlinear loads under various operating conditions. The research topics included in this thesis begin with current power system scenarios and their impact owing to the presence of harmonics, their causes, and effects. It is discussed how power filters, both passive and active, might help eliminate harmonics. However, the hybrid active

184

S. R. Das et al.

Fig. 21 Characteristics of supply voltage and current, load current, injected APF and PF current, DC link and PCC voltage, active and reactive power during the transition from grid to islanded mode

power filters (HAPFs) have been chosen for their effective harmonics compensation and reactive power management in order to overcome the unavoidable shortcomings of the aforementioned two filters. An efficient and precise controller is needed in HAPFs to produce compensatory reference current and switching signals in order to reduce harmonics. In order to construct the control methods, a thorough literature review on various control strategies based on time and frequency domain techniques has been discussed in Chap. 1. Chapter 2 discusses the harmonics analysis in a three-phase system under the influence of PV integration in HAPF. The HAPF uses methods for generating reference currents such as the RECKF. Once more, the switching pulses for the HAPF are produced using HCC and MPC. In comparison to HCC, MPC offers a superior switching action for correcting the harmonics in the system. Chapter 3 of the HAPF contains the PQ analysis for a three-phase system that integrates a DC MG. By enhancing the DC link voltage control, the effect of the DC MG, which consists of PV, WT, and BESS employing PO-F MPPT, complements the harmonics compensation. For the creation of the reference current and gating signal, HCC uses the DT-CWT, methods with LMS, and ZA-LMS. The voltage and current caused by PQ disturbances can have their frequency information extracted using wavelet techniques. Under various operating circumstances, a comparison of

Power Quality Improvement Using Hybrid Filters Based on Artificial …

185

the proposed ZA-LMS with DT-CWT and the LMS reveals that it performs better in terms of reactive power management, power factor, and THD. Using HAPF coupled with AC and DC MG, PQ analysis in a three-phase distribution network is carried out in Chap. 4. Different DG units are used in this hybrid combination of microgrid (HMG), which is connected to the DC and AC bus through various power converters to improve the DC link voltage control. The HMG with HAPF uses the FAGOA, HCC, and the MRGN approach to generate switching pulses and reference current in order to reduce harmonics. When the harmonics in the PCC voltage are corrected for grid linked, transition period, and islanded mode of operations, the suggested HAPF in MG is seen to be resilient. With the proposed HAPF, the THD values and power factors both significantly increase.

References 1. Singh, M., Khadkikar, V., Chandra, A., Varma, R.K.: Grid interconnection of renewable energy sources at the distribution level with power-quality improvement features. IEEE Trans. Power Deliv. 26(1), 307–315 (2010) 2. Tuyen, N.D., Fujita, G.: PV-active power filter combination supplies power to nonlinear load and compensates utility current. IEEE Power Energy Technol. Syst. J. 2(1), 32–42 (2015) 3. Ward, D.J.: Power quality and the security of electricity supply. Proc. IEEE 89(12), 1830–1836 (2001) 4. Akagi, H.: New trends in active filters for power conditioning. IEEE Trans. Ind. Appl. 32(6), 312–1322 (1996) 5. Karimi, M., Mokhlis, H., Naidu, K., Uddin, S., Bakar, A.A.: Photovoltaic penetration issues and impacts in distribution network–a review. Renew. Sustain. Energy Rev. 53, 594–605 (2016) 6. Mariam, L., Basu, M., Conlon, M.F.: Microgrid: architecture, policy and future trends. Renew. Sustain. Energy Rev. 64, 477–489 (2016) 7. Das, S.R., Ray, Sahoo, P.K., Ramasubbareddy, A.K., Babu, T.S., Kumar, N.M., Elavarasan, R.M., Mihet-Popa, L.: A comprehensive survey on different control strategies and applications of active power filters for power quality improvement. Energies 14(15), 4589 (2021) 8. Das, S.R., Ray, P.K. and Mohanty, A.: Power Quality Improvement using grid interfaced PV with Multilevel inverter based hybrid filter. In: Proceedings of the 1st International Conference on Advanced Research in Engineering Sciences, pp. 1–6 (2018) 9. Devassy, S., Singh, B.: Control of a solar photovoltaic integrated universal active power filter based on a discrete adaptive filter. IEEE Trans. Ind. Inf. 14(7), 3003–3012 (2017) 10. Das, S.R., Ray, P.K., Mishra, A.K., Mohanty, A.: Performance of PV integrated multilevel inverter for PQ enhancement. Int. J. Electron. 108(6), 945–982 (2021) 11. Schonardie, M.F., Martins, D.C.: Application of the dq0 transformation in the three-phase grid-connected PV systems with active and reactive power control. In: Proceedings of the 2008 IEEE International Conference on Sustainable Energy Technologies, pp. 18–23 (2008) 12. Mohd Zainuri, M.A., Mohd Radzi, M.A., Che Soh A., Mariun, N., Abd Rahim N., The, J., Lai, C.M.: Photovoltaic integrated shunt active power filter with simpler ADALINE algorithm for current harmonic extraction. Energies 11(5), 1152 (2018) 13. Ray, P.K., Das, S.R., Mohanty, A.: Fuzzy-controller-designed-PV-based custom power device for power quality enhancement. IEEE Trans. Energy Convers. 34(1), 405–414 (2018) 14. Hossain, M.A., Pota, H.R., Hossain, M.J., Haruni, A.M.O.: Active power management in a low-voltage islanded microgrid. Int. J. Electr. Power Energy Syst. 98, 36–47 (2018) 15. He, J., Li, Y.W., Blaabjerg, F.: Flexible microgrid power quality enhancement using adaptive hybrid voltage and current controller. IEEE Trans. Ind. Electron. 61(6), 2784–2794 (2013)

186

S. R. Das et al.

16. Ding, G., Gao, F., Zhang, S., Loh, P.C., Blaabjerg, F.: Control of hybrid AC/DC microgrid under islanding operational conditions. J. Modern Power Syst. Clean Energy 2(3), 223–232 (2014) 17. Das, S.R., Mishra, A.K., Ray, P.K., Mohanty, A., Mishra, D.K., Li, L., Hossain, M.J., Mallick, R.K.: Advanced wavelet transform based shunt hybrid active filter in PV integrated power distribution system for power quality enhancement. IET Energy Syst. Integr. 2(4), 331–343 (2020) 18. Das, S.R., Mishra, A.K., Ray, P.K., Salkuti, S.R., Kim, S.C.: Application of artificial intelligent techniques for power quality improvement in hybrid microgrid system. Electronics 11(22), 3826 (2022) 19. Gupta, A., Doolla, S., Chatterjee, K.: Hybrid AC–DC microgrid: systematic evaluation of control strategies. IEEE Trans. Smart Grid 9(4), 3830–3843 (2017) 20. Asl, R.M., Palm, R., Wu, H., Handroos, H.: Fuzzy-based parameter optimization of adaptive unscented Kalman filters: methodology and experimental validation. IEEE Access 8, 54887– 54904 (2020) 21. Chittora, P., Singh, A., Singh, M.: Gauss–Newton-based fast and simple recursive algorithm for compensation using shunt active power filter. IET Gener. Transm. Distrib. 11(6), 1521–1530 (2017)

Predictive Analytics for Advance Healthcare Cardio Systems Debjani Panda and Satya Ranjan Dash

Abstract This research work aims in finding out the lifestyle factors that affect the heart disease and identifies the most efficient classification techniques that can assist health care experts to predict the disease in less time. The classification techniques used are Support Vector Machine, Decision Tree, Naïve Bayes, K-Nearest Neighbors, Random Forest, Extra Trees, Logistic Regression, and Extreme Learning Machines (ELM). Embedded feature selection mechanisms have been used and a set of features have been identified which are of utmost important and are responsible for causing the disease. In the model, the in-built activation functions were used in studying the ELM, which suffered from overfitting. Hence, the model was studied with a novel activation function called “roots”, and the results obtained were consistent and better than the available activation functions. The best set of features was selected using genetic algorithm. Three different regression methods namely lasso, ridge, and linear regression were experimented for performing cross-validation. While studying the causal factors of COPD-affected patients from a novel data set, the prime factors responsible for the disease were identified to be age, smoking, and cor pulmonale. A strong correlation was found between the cor pulmonale and COPD, thus interlinking their causal factors. Keywords Heart disease · COPD · Supervised classifiers · Ridge · Lasso · GA · KNN · Lifestyle factors · ELM

D. Panda (B) Indian Oil Corporation Ltd., Odisha State Office, Bhubaneswar, India e-mail: [email protected] S. R. Dash School of Computer Applications, KIIT Deemed to be University, Bhubaneswar, Odisha, India © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 S. R. Dash et al. (eds.), Intelligent Technologies: Concepts, Applications, and Future Directions, Volume 2, Studies in Computational Intelligence 1098, https://doi.org/10.1007/978-981-99-1482-1_9

187

188

D. Panda and S. R. Dash

1 Introduction Healthcare is one of the major contributors to country’s welfare and in turn economic development, as measured through HDI. The sector has become extremely important for the past few decades as the improving technology has given it a boost, in enhancing its capabilities. It is needless to mention that the institutional healthcare can be effective with supportive and efficient information technology [1]. The involvement of technology plays a major role in improving the efficacy of the healthcare sector [2]. The privacy of the data, the veracity of the data, the volume of the data, and the cost incurred by individuals for undertaking the procedure. The healthcare sector consists of the hospitals, medical professionals, the supporting staff, the medical devices, and the patients. As technology has been evolving, the sector focuses on various devices which assist professionals and patients to monitor their health condition and give them a clear report of their underlying conditions. Whenever a patient visits a hospital or a medical center, disease diagnostics becomes a daunting task, requiring several medical tests to be undertaken to provide a valid prognosis. The process of detecting the underlying condition of the patient takes a lot of time and also affects the economy of the patients. Huge medical expenses [3] and the criticality of delayed diagnosis affect the mental health of the patients and become a vicious cycle. The advent of emerging healthcare technologies has brought forth greater awareness of the challenges and risks faced by individuals [4] in the course of their daily lifestyle and environmental effects. As the adage goes, “Prevention is better than cure”; Preventive care has found a renewed emphasis, with the onset of big-data and technologies involving artificial intelligence and deep learning. As per studies, cardiac disease was identified as one of the major risks and causes of fatalities among individuals. As per reports of WHO, in 2016, 17.6 million people died from heart disease and 28.3 million people are expected to suffer from this disease worldwide by 2030 [5]. The increasing trend is alarming, and it has become essential to realize the essence of maintaining a healthy heart. With the advancement of technology, it has become possible for individuals to be equipped with continuous, real-time monitoring devices, enabling early detection of any affecting health condition. Bio medical devices come with abilities to monitor the lifestyle activities and various parameters of health. These devices are even set to warn the person wearing it or using it, whenever the selected parameters are recorded/reaching outside their prescribed limits. Though cardiovascular disease turns out to be the major cause of fatalities, research has also shown that a healthy lifestyle can avoid its occurrence. The huge expenses can be avoided if the person takes adequate care of self and can self-monitor the health conditions. The use of deep learning comes into the picture when concern is about storing huge data generated from medical devices and systems, for meaningfully deriving the underlying patterns and suggesting remedies. The various methods of storing and comparing the results, the role of artificial intelligence

Predictive Analytics for Advance Healthcare Cardio Systems

189

in making the machines learn and unlearn by itself, have made this possible which seemed to be impossible a couple of years ago. Now, it is possible to monitor, store, and predict the patient’s condition with the use of deep learning methods and artificial intelligence. The huge data generated from the databases, medical records, biomedical devices, health care systems, and hospital management systems can now be stored and retrieved to study the interesting and important correlation between factors and hidden patterns. Their importance has now made them an inevitable part of the healthcare sector.

1.1 Artificial Intelligence Playing a Major Role in Health Sector It has been several years and almost more than 5 decades since the role of machines has become inadvertent in almost all sectors of our day-to-day life. With the passage of time, human beings have advanced with technology, and the more technology they have adopted, the usage of machines has become inevitable. Machines have contributed to various sectors, and this has been evolving over the period. The machines which needed human intervention are now coming with built-in intelligence. The more the machines are trained, the better is their efficiency. So, the role of artificial intelligence comes into the picture in which the machines learn to use their previous data to predict the future outcome based on the pattern, data, trends, etc. In almost every sector, the artificial intelligence is playing a major role in deploying models and machines which assist the day-to-day activities and has been helping health experts to opine the conditions of their patients. Healthcare experts are flooded with the test reports of their patients which consists of various attributes. With the paucity of time, and overburdening of the data, it becomes impossible for an expert to investigate all the relevant attributes and as a result, important factors tend to be missed out from their observation [3]. When a machine is fed with the huge data, it analyzes all the attributes and remembers, which one to focus on and hence can aid the experts in opining perfectly without overlooking even the smallest of details.

1.2 Role of Deep Learning in Preventive Care Deep learning [6] has recently gained a lot of importance in the past 6–7 years and has become one of the major sub-areas of study under machine learning. As computational efficiency has increased for machines and huge storage databases have come into place, this computational power and very large data sets are making deep learning even more important. The advancements can be seen in machines being

190

D. Panda and S. R. Dash

able to store, retrieve and manipulate data using methods of deep learning. Huge datasets generated from speech [7], language [8], and images [9] obtained from various health care devices, sensory units, etc. can be understood by the machine and can be interpreted easily using deep learning. The healthcare sector has been largely benefited by using deep learning and the datasets are easily interpreted by the machines which are a resultant of digital healthcare systems and biomedical devices. It has been projected that in the United States alone, the digital health data is expected to grow by 48% annually. Machine learning uses data to train the models and learns by itself using statistical methods and rules and evolves as a self-learning system. Instead of being programmed by humans, machine learning matches the learning rules to transform the input to output using various examples and algorithms. During earlier days designing a system using machine learning required a lot of expertise and domain knowledge for identifying the important features and constructing the system to transform the unprocessed data into meaningful interpretation, without loss of relevant information. Moreover, the process of learning involved mapping the outputs from the previous inputs using a set of rules and methods. Unlike machine learning, deep learning involves representation learning which determines its own patterns for recognizing the data and providing meaning interpretation. It mostly comprises multiple layers of representation [10] wherein the raw data passes from one layer to the other and evolves in each layer making it more meaningful in each layer, respectively. The inputs are iterated until all the data points are explicitly distinguishable from each other. So, it develops its own pattern to recognize the data and helps in reading the most complex data and functions. In addition to, deep learning is a form of representation learning—where a machine is supplied with raw data and develops its own representations necessary for pattern recognition—that is composed of multiple layers of representations. These layers are typically arranged sequentially and composed of a large number of primitives, nonlinear operations, such that the representation of one layer (beginning with the raw data input) is fed into the next layer and transformed into a more abstract representation. As the data circulates across the layers of the system, the input space becomes iteratively distorted until the data points become recognizable. Thus, we can learn very complex functions. The scalability of deep learning gives it an edge over other ML techniques and with its ability to run on specialized computational systems, deep learning becomes very useful for handling large-scale data. useful for very large data sets. In healthcare industry, the variance and volume of data are huge, and each patient is almost having unique features at some point of the time, which makes it difficult for health care experts to exactly diagnose the disease. Deep learning can handle such data with ease and helps in designing a model which can effectively predict the outcome depending upon its inputs. The reinforcement learning plays an important role in healthcare sector where the involvement of a physician is necessary to demonstrate a condition. The biggest advantage of deep learning is its ability to read images from various radiographic reports, X-rays, ECGs, Echo, etc. to identify the object and classify it in groups, detect the object, and also aid in segmenting the object for further

Predictive Analytics for Advance Healthcare Cardio Systems

191

processing [11]. The identification of classes in the images using deep learning helps in detecting whether the individual is affected by a certain disease [12]. The convolution neural networks (CNN) are one type of deep learning algorithm which have gained immense popularity due to its feature which includes natural spatial invariance. Medical imaging is greatly benefited from deep learning models and almost every area like cardiology, dermatology, ophthalmology, and pathology have reaped the benefits of these models. These models have become an aided tool for healthcare experts to help in diagnosing and determining the disease of the individual without loss of much time and money. The deep learning models have accurately predicted the diseases in patients and have correctly diagnosed the underlying cause. Their success is established in detecting moles, identifying cardiovascular risk, tumor detection in breast, diabetic retinopathy, and several others [13, 14]. Sometimes, a unique model of in-depth learning has been effective in diagnosing multiple medical complexities in patients. Our study utilizes deep learning methods and feature selection techniques to design a model which can effectively predict the heart disease in patients. The lifestyle factors which are associated with cause of the disease have also been brought out, to aid medical experts in determining the disease in its early onset so that precautions can be taken to avoid fatalities.

2 Literature Review Identification of the causal factors of heart disease and avoiding its occurrence is extremely important for patients and health care experts. The behavioral and lifestyle factors which enhance the risk of cardiac disease are important and need to be figured out. Several research works have been reviewed to bring out the important factors responsible for the disease. For the identification of critical factors responsible for heart disease, various classifiers, and feature selection methods and included under the scope of our study. The review is conducted in three areas: • Classification methods used for heart disease prediction, • Lifestyle factors responsible for causing heart disease. The related works mainly focus on how feature selection plays a vital role in improving the correctness, significantly reducing the search space, lessen the processing time and cost of processing it. The lifestyle factors responsible for the cause of the disease have been brought under the scope of this study. Different methods have been reviewed to evaluate their performance based on various factors and metrics. Several papers have been included in this study which have been filtered to focus on our area of study.

192

D. Panda and S. R. Dash

2.1 Review of Classification Methods for Heart Disease Various related works have been reviewed for the past 10–12 years to study the use of various classification techniques used for predicting the coronary heart disease in patients [15]. The proposed models vary from using vanilla classifiers to techniques involving the combination of feature selection methods and optimization techniques for increasing the prediction accuracy of these models. Table 1 provides yearwise implementation of the predictive models which used classification methods for predicting the occurrence of heart ailments in individuals. The models and research gaps have been identified.

2.2 Review of Lifestyle Factors Affecting Heart Disease Li et al. [47] in their work identified education level, gravidity, family history of CHD, maternal chronic disease, upper respiratory tract infection, environmental pollution, exposure to occupational hazards, mental stress, and intake of essential food as risk factors that contributed majorly to CVDs in women. The study was conducted on 119 heart disease cases and the majority of these cases showed stress, environmental factors, and chronic infection of the mother as major factors for the cause of the disease. They used a standard feed-forward back-propagation neural network (BPNN) model and achieved 0.91 and 0.86 on the training and testing sets. Ornish et al. [48] in their paper have conducted the study and observed the patients with heart disease for 5 years and found that by making intensive changes in lifestyle including whole foods vegetarian diet, aerobic exercise, stress management training, smoking cessation brought improvement in the heart condition of patients. Their conditions improved than other patients who did not follow the lifestyle change. Their state of coronary atherosclerosis was reduced than the other group which was not following the lifestyle. Another work by Hu et al. [49] establishes dietary changes along with smoking cessation, controlling obesity, and physical inactivity as the major contributory factors for persons suffering from heart disease. Whole food with fiber can add substantial benefits to an individual and can lead to the improvement of heart health. Also, a person needs to quit smoking, become physically active or lose weight to maintain a healthy heart. A healthy diet with lifestyle changes along with proper medications can prevent the occurrence of disease. Chiuve et al. [50] conducted a study to find out the major lifestyle factors responsible for heart disease in men and women. They had stressed that men and women who had lower smoking levels, less consumption of alcohol, and almost 30 min of physical activity in a day with a body mass index less than 25 kg/m2 were less prone to heart disease than others. Sin et al. [51] in their paper focused on depression, smoking, physical activity, sleep quality, and obesity in terms of waist-hip ratio as major contributory factors

Predictive Analytics for Advance Healthcare Cardio Systems

193

Table 1 List of classification models for predicting heart disease Year

Author

FS

Classifier and results

Research gap

2008

[16]

Yes

DT, NB, NN. NB gave best results

Designed for categorical data. Continuous data needs to be tested along with other data mining techniques. Data mining and text mining integration to is another challenge

[17]

Yes

Simple SVM (SSVM). Accuracy of 90.57% achieved for 2 class problems, 72.55% for 5-class

Testing needs to be done for larger data sets and multi-class data sets

[18]

Yes

NB, MLP, SVM, DT. SVM Accuracy reduces while gave the best accuracy. NB removing redundant features had less comp. time Model needs to be tested with large and various data sets and reduced training sample size

[19]

Yes

SVM. CSO + SVM method selected avg. features as 12.5 and avg. accuracy is 82.2%

CSO + SVM needs to be extended to industry applications like quality control and steel industry. The termination criterion is set to be 100 generations which need to be varied and results to be verified

[20]

No

DT, SVN, NN, BN. DT gave 89% accuracy

Other data mining techniques like Time Series, Clustering, and Association Rules need to be incorporated. Continuous data need to be checked. Text mining integration to be done to store unstructured data

[21]

Yes

SVM with linear, polynomial, RBF, and Sigmoid kernel. 77.63% accuracy achieved

Application to be explored in other complex diseases using other common variables. The SVM needs to be compared with LR, BN and NN

[22]

Yes

MLP with BP and IG. Accuracy of 89.56% and 80.99% for training and test data

Information Gain does not contribute much. Alternate methods to be studied. Accuracy needs to be enhanced by using other methods

[23]

Yes

SVM with linear, polynomial, radial basis kernel. 100% specificity and 98.63% accuracy

Results need to be established by testing various other data sets

2009

2010

2011

(continued)

194

D. Panda and S. R. Dash

Table 1 (continued) Year

Author

FS

Classifier and results

Research gap

2012

[24]

Yes

NB, DT, NN. NN gave 100% accuracy with 15 attri. And DT gave 99.2% accuracy with 6 attributes

Other data sets with reduced training sample and with less no. of features to be tested. Model examined 15 attributes

[25]

No

NB and WAC (weighted associative classifier). Lift charts WAC gave (84%), NB gave 78% correct predictions

Only categorical data used. Other data mining to be explored Results to be verified with doctors for its correctness

[26]

Yes, Feature subset

KNN + SU, ANN + PCA, Dietary advise to be integrated ANN + χ 2. Accuracy for younger people along with achieved is 100% active physical activity. Other methods can also be tested

[27]

Yes

NB, clustering, and DT. DT gave 99.2% accuracy with 6 attributes

Classification using clustering gave poor results. Inconsistency of data and missing values remain a challenge in real-time environment

[28]

Yes

K–Nearest neighbor. accuracy of 92.4% achieved

No research gaps were found

[29]

Yes

GA + NB, NN + GA, ANN + GA. 100% accuracy with NB + GA

Crossover rate and mutation rate have been fixed randomly, which needs optimization. The model gave poor performance for the breast cancer data set

[30]

No

KNN with varying K. Fuzzy-KNN achieved 97% accuracy with K = 1 and 80% with K = 9

Model testing is to be done with more no. of attributes and a greater number of records and unstructured data

[31]

No

C4.5, CBA, CMAR, and L3, stream associative classification (SACHDP). SACHDP gave 96.6% accuracy

Model performance to be improved with less no. of rules

[32]

Yes

Probabilistic Neural Network (PNN). PSO + PNN gave 95% accuracy

Highest accuracy achieved is only 79.2% which needs to be increased by examining other techniques

[33]

Yes, feature elimination

Stacked SVM, Adaboost, The parameters and datasets RF, and Extra trees. considered were very small in Accuracy of 92.22% using size to claim high accuracy L1-linear SVM

2013

2014

2015

2016

(continued)

Predictive Analytics for Advance Healthcare Cardio Systems

195

Table 1 (continued) Year

2017

2018

2019

Author

FS

Classifier and results

Research gap

[34]

Yes, feature selection and extraction

Eigen vectors for feature importance. PCA performed better than others

Outliers present in a class boundary need to be addressed along with Data set missing Class labels

[35]

Yes, feature selection

Fuzzy AHP and a feed-forward NN. Method resulted in 83% accuracy

Backpropagation networks to be tested and error minimization

[36]

No

KNN, DT, and NB. Not measured

Other data sets are to be tested to prove the results. More classification methods can be compared to find the best one

[37]

Yes feature extraction

Ensemble classifier uses the C4.5 decision tree, Naïve Bayes, Bayesian Neural Networks

Optimal parameter to be identified for ReliefF as number of nearest neighbors and the weight threshold is not stable

[38]

Yes

LR, KNN, ANN, SVM, DT, and NB. SVM with RBF gave 86% accuracy in 15.23 secs

Accuracy is reduced with irrelevant and redundant features. Other feature selection algorithms and optimization methods to be explored

[39]

Yes

ANN. Accuracy achieved was 90.9%

Parameter tuned ANN requires different optimal parameters for different data sets which is not possible in real-life cases

[40]

Yes

DT, SVM, and KNN. SVM gave 93.2% before and 96.8% after the adaptive filter

Testing various other data sets and various other classifiers Smart wearable technology integration can become beneficial

[41]

Yes

KNN. 90% accuracy with 6 attributes

Optimized no. of features has been calculated which can be tested on more no. of samples or with high dimensional data sets

[42]

Yes

RF, ET, Adaboost, SVM (RBF and linear). Accuracy of 93.33% achieved

More no. data sets and other classification method to be experimented to improve accuracy further

[43]

Yes

NB, RF, Bayes Net, C4.5, MLP and PART. 85.48% accuracy with majority voting

A maximum accuracy achieved is 85.48% which can be enhanced further (continued)

196

D. Panda and S. R. Dash

Table 1 (continued) Year

Author

FS

Classifier and results

Research gap

2020

[44]

Yes

CNN, LSTM, GRU, BiLSTM, and BiGRU. Adadelta gave AUC score of 0.97 for Target 1, 0.98 for Target 2, 0.99 for Target 3, and 0.96 for Target 4

Generative Adversarial Network (GAN) or Attention-based Recurrent Neural Network to be studied

[45]

Parameter optimization of ANN with GA

ANN with GA, NB, KNN, C4.5. ANN–GA 95.82% accuracy, 98.11% precision, 94.55% recall and 96.30% F-measure

Other data sets with higher dimensions need to be tested

[46]

Yes

SVM with RBF kernel. Mean Fischer score gave betteraccuracytrailed by PCA

Only two feature selection algo have been used. Other feature selection methods to be evaluated for comparison of results

to heart disease. They also tried to establish the correlation between depression and poor lifestyle and found that patients suffering from severe depression also suffered from heart disease due to poor lifestyle and vice versa. Liu et al. [52] in their work also brought out physical inactivity, smoking, alcohol consumption, and dietary intake as major contributory factors for causing diabetes in patients who are also affected by CHD. Among the various lifestyle factors identified in the above-mentioned papers, the factors have been studied to identify their effect on cardiovascular diseases. From the above research works, the factors related to the lifestyle of an individual that is related to the cardiac disease include depression, smoking, anxiety, alcohol consumption, obesity, sleep quality, physical inactivity, and educational awareness.

3 Comparison of Various Classifiers for Identification of the Disease For forecasting the occurrence of cardiac ailment, lot of studies have been carried out with deep learning and classification methods. Each and every study intended to provide assistance to early diagnosis of the disease and aid healthcare experts in identifying the disease at its onset. Providing timely assistance to the individuals suffering from such disease may result in saving precious lives and unnecessary cost of hospitalization. Data mining aids in stowing important and useful relationships between various attributes of the data, attribute selection, identification of trends and hidden patterns within the data and retrieving useful data for socio-commercial

Predictive Analytics for Advance Healthcare Cardio Systems Table 2 Classifiers accuracy with two test cases: Case1 and Case 2

197

Classifiers

Test data Case 1:

Test data Case 2:

Gaussian Naïve Bayes

58.33

91.66

Random forest

63.33

85.00

Decision tree

56.66

76.66

Logistic regression

60.00

86.66

Extra trees

61.66

86.66

KNN

48.33

70.00

SVM

61.66

90.00

use. Using various data mining techniques, the enormous volume of information generated from the digital world is collected and analyzed to explore the association between various attributes and their relevance. The intent is to compare seven supervised classification methods on data obtained from publicly available datasets from UCI repository. Cleveland data set has been considered for the purpose of our study. The data set has been studied and compared using the metrics of accuracy, training time, and quality of output. After training the models with the split data set of 80%, the rest 20% of data is tested, and results have been tabulated comparing the accuracy of prediction of these classifiers (Table 2).

3.1 Data Set Description For conducting our experiment, data was obtained from UCI. Statlog dataset has been taken for experimentation which has no missing values [53]. It had 13 attributes and 270 patient records. The details are as given in Table 3.

3.2 Results The experiment was conducted with 270 records and 13 attributes that predict cardiac disease in patients. The important attributes were determined by using in-built function in Python and then the weighted average was calculated for each feature. According to their values, the features were accorded ranks based on their correlation with the target variable (Table 4). The best 8 attributes were considered for further experiments with the classifiers to measure the efficiency of the classifiers after feature selection. The accuracy of Classifiers like Random Forest, Decision Trees, Logistic Regression, and Extra Trees have been studied for cases before FS and after applying FS methods. For getting better results, 10-fold cross-validation was also performed on the data set. The results were calculated in Table 5.

198

D. Panda and S. R. Dash

Table 3 Statlog dataset description S.No

Features

Feature type

1

Patient age

Real

Allowed values

2

Patient sex

Binary

3

Type of chest pain

Nominal

4

BP at rest

Real

5

Cholesterol

Real

6

Fasting blood sugar (FBS) Binary

1: FBS > 120 mg/dl

7

ECG at rest

Nominal

0;1;2

8

Maximum HR

Real

9

Exercise induced angina (EIAng)

Binary

10

ST at rest

Real

11

SlopeEx

Ordered

12

Mvcol

Real

0–3

13

Thal

Nominal

3: normal; 6: fixed defect; 7: reversible defect

14

Target value

Predicted/output variable

Absence: 1, Presence: 2

1; 2; 3; 4

0;1

Table 4 Ranking of features based on a weighted average of feature importance Attributes

Decision tree (DT)

Random forest (RF)

Extra trees (ET)

Weighted average

Rank

– Mvcol

0.154

0.142

0.123

0.154

1

– Thal

0.270

0.121

0.168

0.144

2

– Chest-Pain-Type

0.082

0.169

0.130

0.127

3

– Age

0.092

0.078

0.068

0.080

4

– MaxHR

0.046

0.090

0.100

0.079

5

– STRest

0.050

0.121

0.061

0.078

6

– RestBP

0.083

0.077

0.064

0.075

7

– Chol

0.085

0.063

0.068

0.072

8

– SlopeEx

0.028

0.053

0.063

0.048

9

– ExIAng

0.046

0.021

0.070

0.046

10

– Sex

0.046

0.031

0.037

0.038

11

– RECG

0.018

0.025

0.034

0.029

12

– FBS

0.000

0.009

0.014

0.008

13

Predictive Analytics for Advance Healthcare Cardio Systems Table 5 Classifiers: measurement of performance (accuracy) before and after FS

199

Classifiers

Accuracy before FS Accuracy after FS (%) (%)

– Decision tree

74.81

– Random forest

81.11

99.62

– Extra trees

78.52

100.00

– Logistic regression

83.70

100.00

100.00

3.3 Summary From Table 5, it is observed that a greater number of output classes decrease the efficiency of the models. The prediction accuracy for binary classification is greater than that of multiple classification. Random Forest Classifier, with an accuracy of 63.33%, is better than other methods in multiple classification, whereas Gaussian Naive Bayes, with 91.66% accuracy, performs best in binary classification problems. In Table 5, the classifiers efficiency was studied by eliminating irrelevant features and there has been a huge difference in accuracy and most of them have had almost greater than 20% improvement in their performance.

4 Role of Feature Selection in Prediction of Heart Disease Feature selection plays an important role in eliminating redundant and irrelevant features and reduces the training cost and time of the predictive models. The classification algorithms which have been analyzed include Naïve Bayes, Random Forest, Extra Trees, and Logistic regression which have been provided with selected features using LASSO and Ridge regression. The accuracy of the classifiers shows remarkable improvement after using feature selection and Lasso has given better results as compared to the ridge.

4.1 Dataset Description The experiment is done from the publicly available Cleveland heart dataset obtained from UCI [54]. It contains 76 attributes and 303 records in total and the last column denotes the target variable. The columns which contain patient identification details have been dropped (attributes 1, 2, 75, and 76) and 72 attributes have been considered for the study. The missing values have been removed from the dataset and finally, total instances considered are 115.

200

D. Panda and S. R. Dash

4.2 Results Accuracy has been used as the metric for this experiment. The tabulated results after selecting the best features from Lasso and ridge regression have been listed in Table 6. Among the four different classifiers, GNB Classifier performed better than others and gave 94.92% accuracy using best features obtained from Lasso and Ridge regression [55]. Dimensionality reduction produced more accurate predictions with the same dataset. It was observed that Lasso regression produced better results in most of the cases. Figure 1 illustrates the graphical representation of the performance of the classifiers (before and after feature selection). Table 6 Predicting heart disease with best features obtained using regression models Classifiers

Original features before FS (Accuracy %)

After FS using Lasso Regression (Accuracy %)

After FS using Ridge Regression (Accuracy %)

Random forest

47.02

84.98

85.31

Extra trees

55.83

90.32

84.77

Gaussian Naive Bayes

57.17

94.92

94.92

Logistic regression

40.73

63.73

59.12

Fig. 1 Performance of classifiers before and after feature selection

Predictive Analytics for Advance Healthcare Cardio Systems

201

4.3 Summary As observed from the results, it is evident that embedded methods of feature selection gave better results as compared to the other classifiers. Our experiment included the embedded methods using Ridge and Lasso regression. The regression methods shrink a multi-dimensional problem to a problem with fewer relevant dimensions. Ridge imposes L2 penalty function as compared to Lasso, which uses L1 penalty function. The penalty functions are imposed to minimize the loss while reducing the variables in the function. Ridge regression defines L2 regularization by imposing penalty equal to sum of squares of coefficients of the variables. Whereas in Lasso, the sum of the absolute values of the coefficients is considered as the penalty that is L1. Lasso basic objective is shrinkage of an absolute value (L1 penalty) toward zero rather than a using sum of squares (L2 penalty). Ridge regression suffers from a limitation that is it cannot diminish the coefficients to zero. It can perform either selecting all the coefficients or selecting none of the coefficients. As compared to ridge regression, LASSO can do both parameter shrinkage and variable selection, because of its capability of reducing the coefficients of collinear variables to zero. In this experiment, comparable results have been obtained when classifiers used features identified with lasso and ridge regression. The dataset had 76 attributes out of which 72 attributes were considered for the study. The accuracy of all the four classifiers increased remarkably when they were trained with reduced attributes. The GNB (Gaussian Naïve Bayes) classifier obtained the best results and gave accuracy of 94.92%. While comparing results for selected features from Lasso and ridge, Lasso gave slightly better results.

5 Enhancing the Performance of Extreme Learning Machines Using FS with GA for Identification of Heart Disease of Fetus The classifiers performance improved by using feature selection methods which aided in identification of the disease at an early stage. It is important to note that diseases of the heart affect the fetus as it affects. It can appear as a birth incongruity in a new-born child and can also prove to be fatal in some of the cases. Thus, it is important to study the heart health of the fetus. The heart health can be observed to detect irregular heartbeats and forecast heart ailments that can cause concern for the new-born. In this section, we have tried to focus on fetal heart health condition and their detection by using extreme learning machine classifier. This classifier is studied by applying various activation functions and by applying feature selection methods like Genetic algorithm. The ELM has been studied to measure its performance with

202

D. Panda and S. R. Dash

standard activation functions using Fourier, Hyperbolic tan, sigmoid, and a userdefined novel activation function (roots). The classifier is studied with and without feature selection to measure the effectiveness of the novel activation function and to evaluate the model using feature selection method [56]. Cardiotocography is one of the most used non-stress tests which helps in determining the fetus health inside the mother’s womb and during labor. Cardiotocography is useful as it portrays both contractions of the uterus along with the heart rate of the fetus. Fetal heart rate is represented with characteristics like baseline heart rate, variations in baseline heart rate, increase in heart rate (accelerations), decrease in heart rate (decelerations), and uterine contractions. The cardiotocography test acts as an important tool in examining the baseline heart rate and uterine contractions of the fetus. By analyzing the pattern of the baseline heart rate. Healthcare experts derive a clear picture of the condition of the fetus inside the mother’s womb. They can identify any serious concern for the fetus like insufficient blood supply or oxygen to the body or any of its parts as a result of the malfunctioning heart. The National Institute of Child Health and Human Development (NICHD has identified some important factors like the variability of baseline heart rate, its accelerations, and deceleration along with non-stress test (NST) which are critical and needs monitoring to maintain the physical well-being of the fetus [57]. The device used to monitor the heart health of the fetus and for conducting the cardiotocography is called Electronic Fetal Monitor [58]. This device generates two output signals fetal heart rate (FHR) and uterine contractions (UC). The cardiotocography includes the Non stress test and contraction stress test as its major components [59]. When the fetus is being monitored for its well-being, NST becomes helpful in determining whether the fetus is in discomfort and whether any respiratory malfunction of the placenta is observed which is determined by the CST.

5.1 Dataset The Cardiotocography dataset that has been used for the experiment has been obtained from UCI and has 2126 instances with 23 attributes. It has two target classes, one class includes pattern (1–10) and the other target class contains the fetal state (where, N: Normal, S: Suspect, P: Pathologic). Our experiment is carried out on 21 attributes with NSP as the target class.

5.1.1

Pre-processing of the Data Set for Training the Model

The columns which have been considered for our experiment include the columns mentioned in Table 1 and two output classes “CLASS” and “NSP”. The other 23 columns from the original database have 23 other columns that have been removed and are not considered for this experiment. The dataset consisting of the above data

Predictive Analytics for Advance Healthcare Cardio Systems

203

is named “DT” and has been split into two subsets named “DT CLASS” and “DT NSP”, which represent the output class labels of “CLASS” and “NSP”, respectively. Twelve duplicate rows along with four rows with null values were deleted. After the pre-processing of the dataset DT NSP, it was fragmented within the ratio of 80:20 so train the ELM (80%) and obtain the results for test data (20%).

5.1.2

Selection of Important Features

We have already studied the importance of feature selection in designing a predictive model by eliminating irrelevant features and in turn reducing the calculation time of the classifiers. In this chapter, genetic algorithm (GA) has been used to reduce unwanted and irrelevant features and identify the optimized subset of features for training the classification models. The subset with the most relevant attributes is retained for study and experimentation. The data set bearing 80% data was used to study ELM with different activation functions. Three different regression functions including Linear, Ridge, and Lasso are studied for cross-validating the features generated from GA. The best set of attributes have been selected and the classification algorithms are trained using these as input features.

5.2 Genetic Algorithm (GA) Genetic algorithm is based on Darwinian’s principle which emphasizes on “survival of the fittest”. The genetic algorithm is an easy scalable algorithm which arbitrarily produces a new population and determines the individual with the best fitness value. Its sole purpose is to determine the attributes with maximum output weights [60]. It tries to find the attributes with the best fitness value so that they can adapt to the adverse conditions. The population constitute of the entire set of candidate solutions and every possible solution has been termed as an individual. In our experiment, genetic algorithm has been implemented with three different models of regression, viz., Ridge, Linear, and Lasso for cross-validation. The chromosomes are nothing but a representation of the attributes. The fitness values generated provide a true or false value for each attribute. The process is replicated over all generations(n) at the end of which only the attributes with the best fitness values are retained and the subset of features is determined, which can be used for studying the performance of the classifiers. GA depends upon the total number of generations, the number of chromosomes present, the total number of offspring generated during crossover, and the finest obtained chromosomes. The parent chromosomes are mated depending on the best fitness values [61]. Two parents were mutated after crossover, to produce the new population. This entire method is iterated for 20 generations, and the fitness value of features after some generations

204

D. Panda and S. R. Dash

remains unchanged. To finish, only those attributes that possess the highest fitness values are extracted. Regression models: These models are efficient in determining the correlation between dependent and independent variables and dependent variables are expressed in relation to dependent variables. These models are supervised methods in machine learning which are normally used for feature reduction. The regression models that have been experimented in GA are: Linear Regression: It is expressed by the equation: y = β0 +

p 

βik xik

(1)

k=1

Ridge: This regression model is based on L2 regularization and L2 is known to be the penalty [62] which is equal to the total summation of the magnitude of the coefficients. Variance is a resultant of multi-collinearity of variables and ridge regression effectively deals with this variance [63]. Lasso: This regression prototypical focuses on L1 regularization and L1 is considered as the penalty, added to the summation of the absolute value of coefficients. In this case, the least related variables are considered to be zero. This model helps in minimizing the least relevant features.

5.2.1

Algorithm for Selecting Best Features Using GA

1. The population was arbitrarily initialized by creating individuals where the chromosome represented the total no. of input features. Randomly each feature is included or excluded in the initial phase. One individual may have the chromosome as illustrated in Fig. 2. where one box denotes a gene i.e. one feature of the dataset. The green color denotes “True” value that means (feature is involved in the chromosome) and the red color denotes “False” that means (attribute is not included in the chromosome). 2. For each generation: 2.1 Fitness score has been calculated for every individual as per the following: 2.1.1 The target was modeled with the features, which were either included or excluded depending upon the individual chromosome, using regression models. 1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

Fig. 2 One individual chromosome representing features of one row of the dataset

19

20

21

Predictive Analytics for Advance Healthcare Cardio Systems

205

After mutation,

Fig. 3 Chromosome after crossover and mutation

For creating a visualization of the individual chromosome that has been shown above, the chromosome has been built by all features included except for features represented by 2nd, 8th, 12th, and 17th features. 2.1.2 The scores for cross-validation were calculated using negative mean square error (NMSE), represented by the formula: n N MSE = −

i=1



xi − n

n i=1

n

xi

2

 n =

i=1

n

xi

n

2 −

i=1

n

xi2

(2)

2.1.3 Fitness value is equal to Mean (cross-validation). 2.2 Individuals were sorted according to the ascending fitness values order. 2.3 The last n individuals (considered to be the best n individuals in the population) were identified. 2.4 Again, among the nominated individuals, for the number (i) in range of n/2, the ith and (n − i)th individuals were crossed over as depicted below in Fig. 3. 2.5 Mutation process was carried on daughter chromosome to generate the next population. 3. After the mutation process, the chromosomes with best features that is with best fitness values were considered.

5.3 ELM as a Classifier Extreme learning machines proposed by G. B. Huang, Q. Y. Jhu, and Siew effectively address the slowness of feed-forward neural networks as a result of their iterative

206

D. Panda and S. R. Dash

training. ELMs are designed to be extremely faster single-layer feed-forward networks (SLFNs) possessing hidden neurons which don’t need further fine-tuning [64]. ELMs need lesser training time and can be utilized as classifier, regressor, or feature extractor are efficient. It randomly allocates connections between hidden neuron and the input layer and these are not altered during the experiment. Accordingly, the output is attuned to a minimum cost solution [65]. There are various types of ELM like pruned ELM, simple ELM, ensembled ELM, and incremental ELM [64, 65]. Our experiment has been conducted with a simple ELM, along with various activation functions [66, 67] and their performance has been compared on the metrics like accuracy, F-measure score, time to train, specificity, precision, sensitivity, and AUC. A simple ELM can be denoted as: For N, arbitrary dissimilar samples (xi , ti ) ∈ R d × R m , SLFNs with L hidden nodes possessing parameters (ai , bi ), i{1, 2, . . . , L} are mathematically written as L 

L    βi gi x j = βi G ai , bi , x j = o j , j{1, 2, . . . , N }

i=1

(3)

i=1

where βi , represents weight of output of the i th node (hidden) and g(x) represents the activation function. having (zero error), that is   LThat SLFNs can approximate considered  L N samples = t o − t  = 0, i.e. ∃(a , b ), β : β G a , b , x j j i i i i i i j j , j{1, 2, . . . , N }. j=1 i=1 Rewriting the above N equations as Hβ = T

(4)

where ⎤ ⎡ ⎤ G(a1 , b1 , x1 ) · · · G(a L , b L , x1 ) h1 ⎥ ⎢ ⎥ ⎢ .. .. .. H = ⎣ ... ⎦ = ⎣ ⎦ . . . G(a1 , b1 , x N ) · · · G(a L , b L , x N ) N ×L hn ⎡ T⎤ β1 ⎢ .. ⎥ β=⎣ . ⎦ ⎡

β NT



t1T



(5)

(6)

L×m

⎢ ⎥ T = ⎣ ... ⎦ t NT N ×m

(7)

Predictive Analytics for Advance Healthcare Cardio Systems

207

Four non-linear activation functions have been used for experimentation. One activation function has been defined by us and is termed user defined. The activation functions are listed below: Sigmoid: G(a, b, x) =

1 e−(ax+b)

(8)

G(a, b, x) = sin(ax + b)

(9)

G(a, b, x) = tanh(ax + b)

(10)

1+

Fourier:

Hyp. Tangent:

Roots (User-defined):  G(a, b, x) =

0, x = −b a |ax+b|(n+1) , x = ax+b

−b a

(11)

where n ∈ R denotes a parameter, that can take values between 0 and 1. When the value of n = 1, then it converts to a linear function. The graphical picture of the above-mentioned functions is illustrated in Fig. 4.

Fig. 4 Graph of different activation functions used in ELM

208

D. Panda and S. R. Dash

For N random distinct samples (xi , ti ) ∈ R d × R m , SLFNs having L hidden nodes possessing parameters (ai , bi ), i{1, 2, . . . , L} are mathematically exhibited as L 

L    βi gi x j = βi G ai , bi , x j = o j , j{1, 2, . . . , N }

i=1

(12)

i=1

where βi , is the weight of output of the ith concealed node and g(x) is an activation function.

5.4 Results The cardiotocography data set is a multi-class data set and the output class having a value of “3” is denoted as pathological cases. These pathological cases have been considered for the current experiment and emphases in determining the cases which are pathological, to forecast heart disease of the fetus. Genetic algorithm (GA) has been used to identify the most relevant features. While using linear regression and lasso for cross-validation, the same group of 11 attributes is extracted. The best 11 attributes which have been selected are DS, DP, LB, MLTV, ASTV, UC, ALTV, Width, Max, Median, and Variance which have been considered for studying the effect of ELM. The performance of the ELM classifier has also been evaluated by applying ridge regression as a cross-validation function in the genetic algorithm. This method generates the 12 best features, which are UC, DS, ASTV, LB, DP, MLTV, Min ALTV, Variance, Max, Nmax, and Median. The case studies have ELM without considering feature selection and also after considering feature selection. The performance of ELM is noted by applying various activation functions and with the best 11 and 12 features extracted from GA. For evaluating the performance of various ELMs, the novel activation function “roots” has been related with additional three standard activation functions like Fourier, Sigmoid, and hyperbolic tan. The metrics considered for comparison contain confusion matrix, accuracy, F score, precision, and AUC. The confusion matrix considered to classify the pathological condition of the CTG data set is denoted in Table 7 as mentioned below: where • TP: True positive, target with class 3 denoted as pathological case Table 7 Confusion matrix

Predicted → Actual ↓

1

2

3

1

TN

TN

FP

2

TN

TN

FP

3

FN

FN

TP

Predictive Analytics for Advance Healthcare Cardio Systems

209

• TN: True negative, target with class 1 and 2 are denoted as non-pathological. • FP: False positive, target with class 1 and 2 are denoted as pathological case • FN: False negative, target with class 3 is denoted as non-pathological case. The metrics considered for tabulating the success of classification are Accuracy = (T P + T N )/(T P + T N + F P + F N )

(13)

Sensitivit y = T P/(T P + F N )

(14)

Speci f icit y = T N /(T N + F P)

(15)

Pr ecision = T P/(T P + F P)

(16)

F measur e = (2 ∗ Pr ecision ∗ Sensitivit y)/(Pr ecision + Sensitivit y) (17) AU C = (Sensitivit y + Speci f icit y)/2

(18)

Our study compared the Python in-built ELM module with ELM studied with sigmoid, Fourier, hyperbolic tangent, and roots (n = 0.25, 0.4, 0.5). Accuracy was used for forecasting heart disease both before and after feature selection. The results are presented in Table 8. From the experiment, it is found that the built-in function in ELM, available in Python, suffers from issues of underfitting. To address the problem, alternative group Table 8 Performance of Classifier (ELM) considering best features Classifier

21 features (before FS) (Accuracy %)

After FS (Accuracy %)

ELM

Original data set Linear best 11 21 features features

Lasso (α = 0.0001) best 11 features

Ridge (α = 0.0001) best 12 features

Inbuilt Python function)

11.11

10.17

10.17

10.16

Roots (n = 0.25) 94.56

95.98

95.98

95.74

Roots (n = 0.4)

94.33

96.45

96.45

95.04

Roots (n = 0.5)

94.80

96.21

96.21

95.04

Sigmoid

92.67

94.56

94.56

94.33

Fourier

90.07

90.07

90.07

90.07

Hyperbolic tangent

93.14

94.56

94.56

93.85

210

D. Panda and S. R. Dash

of activation functions is studied for creating the model. The diagram shown in Fig. 4 depicts the measure of accuracy of different activation functions after varying the hidden nodes. These functions have been evaluated in ELM before using feature selection. The user-defined activation function, which has been named as “roots”, is plotted along with the other activation functions. Our model surpasses other in-built activation functions when studied for accuracy while hidden inputs range from 0 to 1000. When a number of hidden inputs increase over 200, the graphs of three activation functions pertaining to hyperbolic tangent, sigmoid, and roots remain unchanged. Hence, the number of hidden inputs is kept constant at 200, to study other evaluating functions and formulas to further study the “Roots” function. From the results, we can observe that while using n = 0.4 for Roots function, the results are optimized and are noted in Table 8. The performance of these functions has been measured and graph is plotted for the measured accuracy of these functions where roots took value n = 0.4. When Fourier activation function was studied in ELM, it underperformed than rest of the functions like Hyperbolic tangent Sigmoid and Roots activation functions. The Fourier activation does not get affected with feature selection and has lower performance for which it is not considered for further study. The three functions sigmoid, hyperbolic tangent, and Roots were further studied with various other metrics including the computation time for classifying pathological cases. The Roots’ function took lesser time as compared to sigmoid and tanhyp to compute the results for test set. The results of ELM before FS and with original feature set and reduced feature set by applying GA have been recorded in Table 9. Three cross-validation functions have been used to identify the best features in GA. The enhanced performance of ELM was achieved using selected features from linear and lasso as compared to ridge. The “roots” activation function outperformed the other 2 activation functions. Results are reliant on no. of hidden inputs that have been taken and can change. Also, in the case while altering the hidden inputs in the range from 0 to 1000, ELM considering roots activation function gave better results in most cases.

5.5 Summary ELM with genetic algorithm-based features gives an accuracy of 95% and above with Sigmoid and user-defined functions. ELMs are considered to be self-tuned and thus are time effective. However, the computation time depends upon the various activation functions of the ELM and can affect the performance of the ELM. In this experiment, the computation time of user-defined activation function (roots) is less than the training time of sigmoid functions considering hidden inputs as 200. Genetic algorithm has played an important factor in selecting features that improve the accuracy of ELMs than other features selection methods like Ridge and Lasso. Experiments can be carried out with different activations’ functions to study the

⎤ 294 29 2 ⎥ ⎢ ⎢ 26 28 2 ⎥ ⎦ ⎣ 4 23 15

92.67

35.71

98.95

78.95

49.18

67.33

3.05

Sensitivity

Specificity

Precision

F-measure

AUC

Computation Time in secs



3.27

75.80

65.67

88.00

99.21

52.38

94.56

2.31

74.61

63.64

87.50

99.21

50.00

94.33

⎤ 306 19 0 ⎥ ⎢ ⎢ 21 32 3 ⎥ ⎦ ⎣ 4 17 21 ⎡

2.50

82.14

78.26

100.00

100.00

64.29

96.45

Best features (11 attributes) ⎤ ⎡ 301 24 0 ⎥ ⎢ ⎢ 24 32 0 ⎥ ⎦ ⎣ 2 13 27

Original DT

Original DT

Best features (11 attributes) ⎤ ⎡ 299 25 1 ⎥ ⎢ ⎢ 24 30 2 ⎥ ⎦ ⎣ 2 18 22

Roots (n = 0.4) function

Sigmoid function

Accuracy

Confusion matrix

ELM activation functions →/ Metrics ↓

Table 9 ELM performance measured before and after feature selection by GA (Linear and Lasso)

2.67

67.59

50.85

88.24

99.48

35.71

93.14

⎤ 295 30 0 ⎥ ⎢ ⎢ 24 30 2 ⎥ ⎦ ⎣ 3 24 15 ⎡

Original DT

2.67

75.80

65.67

88.00

99.21

52.38

94.56

Best features (11 attributes) ⎤ ⎡ 299 26 0 ⎥ ⎢ ⎢ 24 29 3 ⎥ ⎦ ⎣ 1 19 22

Hyperbolic tangent function

Predictive Analytics for Advance Healthcare Cardio Systems 211

212

D. Panda and S. R. Dash

effect of the parameters required for classifying pathological cases in CTG dataset. They can aid as an active prediction tool to predict whether the fetus is suffering from cardiological abnormalities and will help as an important tool for medical experts.

6 COPD and Cardiovascular Diseases: Are They Interrelated? After studying the critical features responsible for heart disease in the earlier chapters, the effect of other factors like other critical diseases which can cause heart disease has been considered for our current study and experimentation. In this chapter critical illness of the lungs, its causes, and its impact on the heart of an individual is studied with the help of a novel data set obtained from a government hospital. The critical factors responsible for causing the disease have been identified and its impact is studied on the heart health of patients. The basic objective is to identify the lifestyle factors which are responsible for the cause of disease of the heart and the lungs. What is COPD? Coronary obstructive pulmonary disease (COPD) is another disease that is a major concern among the human population due to its severity of symptoms and fatalities. COPD is defined by blockage occurring in the airways over a consistent and considerable period. Among other complications, it causes patients to suffer from cardiovascular diseases and is one of the main causes of mortality. With the progression of COPD in an individual, several symptoms occur and in severe cases, it affects the heart and the brain which makes the administration more complex for medical experts. As a result, the disease needs to be diagnosed in its early stages. It has three stages–bronchial asthma, bronchitis, and emphysema. Bronchial asthma is the inflammation of air pathways, bronchitis is the swelling of bronchial tubes, and emphysema is a disease where bronchioles become inflated inside the lungs. So, when a person is suffering from COPD, he suffers from all of the above three conditions. Common symptoms associated with COPD include coughing, wheezing, breathlessness, excess mucous discharge, weakness, chest pain, nervousness, and loss of weight. People with COPD are susceptible to various life-threatening diseases such as pneumonia, pneumothorax (lung collapse), arrhythmia (irregular heartbeat), osteoporosis, sleep apnea (repeated starting and stopping of breathing during sleep), edema, hepatomegaly (liver enlargement) and cor pulmonale (failure of the right side of the heart), diabetes, stroke, hypertension, and heart failure. The relationship between COPD and cardiac diseases is illustrated in Fig. 5.

6.1 Dataset The novel dataset collected from SCB Medical College is listed in Table 10.

Predictive Analytics for Advance Healthcare Cardio Systems

213

Fig. 5 Relationship between heart disease and COPD

6.2 Results The comorbidities which are responsible for the deadly diseases are mainly related to lifestyle factors. There is a perilous relationship between heart disease and COPD. The classifier’s accuracy has been noted in Table 11. The prime factors that have been identified as causing COPD, are Smoking, age, and Cor Pulmonale. The correlated factors are obtained using a heatmap and are shown in Fig. 6. These factors are then investigated with multiple algorithms to predict the outcome. It is also observed that smokers also have higher risks for other respiratory diseases. The Gaussian Naïve Bayes classifier has given 77.5% accuracy with 80.95% precision and 77.27% recall. Random Forest classifier has given the optimum results with an accuracy of 87.5 and 95.23% precision and 90.90% recall. It is an ensemble classifier and in our case, the ensemble classifier has performed better than other basic classifiers. Random forests are more flexible and do not require much parameter tuning and yield the best results in our case. Logistic regression has also given comparable performance with 82.5% accuracy having the precision of 85.71% and recall at 81.81% The classifier’s performance also has been studied by plotting the ROC for all methods and results show the best curve for the random forest classifier.

214

D. Panda and S. R. Dash

Table 10 Dataset of COPD patients from SCB Medical College #

Attribute

Values

Detail description

1

Gender

0: Male, 1: Female

Gender of the patient

2

Lifestyle

0: Active, 1: Sedentary

Type of lifestyle of the patient

3

Literacy

0: Literate, 1: Illiterate

The patient is illiterate/literate

4

Respiratory diseases

1: Present, 0: Absent

The patient has other respiratory diseases or not

4

Family history

1: Patient has family history, 0: Patient does not have family history

Any family history of the disease

5

Weight

Weight of the patient in Kilogram

6

Systolic pressure

Blood pressure while heart beating

7

Diastolic pressure

Blood pressure while heart resting

8

Alcohol

1: Patient drinks, 0: Patient does not drink

The patient consumes alcohol or not

9

Smoking

1: Patient smokes, 0: Patient does not smoke

The patient is a smoker or non-smoker

10

Age

11

Col Pulmonale

1: Present, 0: absent

Heart failure resulting due to pulmonary hypertension

12

Hemoptysis

1: Present, 0: absent

Patient is coughing out blood or not

13

Type

1: Patient with COPD detected, 0: Patient without COPD

Patient is affected with COPD or not

Current age of the patient

Table 11 Comparison of accuracy of six classifiers Classification method

Accuracy

Precision

Recall

AUC

Gaussian Naïve Bayes classifier

77.5

80.95

77.27

0.77

Random forest classifier

87.5

95.23

90.90

0.87

Decision tree classifier

67.5

71.43

68.18

0.67

Logistic regression

82.5

85.71

81.81

0.82

KNN classifier

72.5

76.19

72.72

0.72

SVM classifier

77.5

80.95

77.27

0.77

6.3 Summary The critical attributes determined for causing COPD are Cor pulmonale, age, and smoking. Subsequent, to these factors blood pressure both systolic and diastolic have an impact on COPD. Cor pulmonale is an extraordinary increase in the size of the

Predictive Analytics for Advance Healthcare Cardio Systems

215

Fig. 6 Heat map for important feature determination

heart and resulting in its failure. The heart pathologically gets pressed by the inflamed lungs or the blood vessels resulting in heart failure. A person affected by COPD has blood pressure also, and this is a cause of heart disease as well. In this experiment, we have established a strong correlation between these two diseases.

7 Conclusion and Future Work As per the India Fit report 2021, lifestyle diseases affecting the Indian population have been reported as heart disease, high cholesterol, diabetes, blood pressure, and thyroid. Diabetes has seen a very high surge during the year and has underlying factors of stress, sleep quality, physical inactivity, and unhealthy diet. The report depicts that 8.71% of teens are affected by diabetes which is almost double the rate of young adults, i.e., 4.46%. The situation is also alarming when the data is reflected for adults, older adults, and seniors, where seniors almost have a threefold increase as compared to adults. Our experiment has been done with datasets where the sample size was 200 patients with COPD, approx. 0.300 in Cleveland and Statlog. An extensive study can be carried out for a greater number of patients to get more standardized models. The classification methods may be combined with optimization techniques to improve the accuracy of forecasts. The model can aid medical experts to determine heart diseases and pulmonary diseases like COPD well in advance.

216

D. Panda and S. R. Dash

From other research works, factors like blood pressure, smoking, obesity, diabetes, sleep quality, physical activity, and anxiety were found to be contributory factors of heart disease and in our experiment, we have also established similar factors responsible for the cause of the disease. This research work is intended to determine the lifestyle factors that are responsible for the heart disease and suggesting preventive care to avoid the disease and avoid fatal occurrences. Further, this research work can be extended to large data sets of hospital databases and can assist medical experts to opine their patients. Even smart technological readings can be combined to give a true picture of the underlying conditions of an individual, so that necessary precautions can be taken in time.

References 1. Reid, P.P., Compton, W.D., Grossman, J.H., Fanjiang, G.: Information and communications systems: the backbone of the health care delivery system. In: Building a Better Delivery System: A New Engineering/Health Care Partnership. National Academies Press (US) (2005) 2. Ortiz, E., Clancy, C.M.: Use of information technology to improve the quality of health care in the United States. Health Serv. Res. 38(2), xi (2003) 3. Lun, K.C.: The role of information technology in healthcare cost containment. Singap. Med. J. 36, 32–34 (1995) 4. Pantelopoulos, A., Bourbakis, N.G.: A survey on wearable sensor-based systems for health monitoring and prognosis. IEEE Trans. Syst. Man Cybern. Part C (Appl. Rev.) 40(1), 1–12 (2009) 5. Benjamin, E.J., Blaha, M.J., Chiuve, S.E., Cushman, M., Das, S.R., Deo, R., De Ferranti, S.D., Floyd, J., Fornage, M., Gillespie, C., Muntner, P.: Heart disease and stroke statistics—2017 update: a report from the American Heart Association. Circulation 135(10), e146-e603 (2017) 6. Hinton, G., Deng, L., Yu, D., Dahl, G.E., Mohamed, A.R., Jaitly, N., Senior, A., Vanhoucke, V., Nguyen, P., Sainath, T.N., Kingsbury, B.: Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups. IEEE Signal Process. Mag. 29(6), 82–97 (2012) 7. Hinton, G., et al.: Deep neural networks for acoustic modeling in speech recognition: the shared views of four research groups. IEEE Signal Process. Mag. 29, 82–97 (2012) 8. Hirschberg, J., Manning, C.D.: Advances in natural language processing. Science 349, 261–266 (2015) 9. Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z., Karpathy, A., Khosla, A., Bernstein, M., Fei-Fei, L.: Imagenet large scale visual recognition challenge. Int. J. Comput. Vision 115(3), 211–252 (2015) 10. Esteva, A., Robicquet, A., Ramsundar, B., Kuleshov, V., DePristo, M., Chou, K., Cui, C., Corrado, G., Thrun, S., Dean, J.: A guide to deep learning in healthcare. Nat. Med. 25(1), 24–29 (2019) 11. Litjens, G., Kooi, T., Bejnordi, B.E., Setio, A.A.A., Ciompi, F., Ghafoorian, M., Van Der Laak, J.A., Van Ginneken, B., Sánchez, C.I.: A survey on deep learning in medical image analysis. Med. Image Anal. 42, 60–88 (2017) 12. Shen, D., Wu, G., Suk, H.I.: Deep learning in medical image analysis. Annu. Rev. Biomed. Eng. 19, 221–248 (2017) 13. Carneiro, G., Zheng, Y., Xing, F., Yang, L.: Review of deep learning methods in mammography, cardiovascular, and microscopy image analysis. In: Deep Learning and Convolutional Neural Networks for Medical Image Computing, pp. 11–32. Springer, Cham (2017)

Predictive Analytics for Advance Healthcare Cardio Systems

217

14. Jang, H.J., Cho, K.O.: Applications of deep learning for the analysis of medical data. Arch. Pharmacal. Res. 42(6), 492–504 (2019) 15. Panda, D., Ray, R., Dash, S.R.: Feature selection: role in designing smart healthcare models. In: Smart Healthcare Analytics in IoT Enabled Environment, pp. 143–162. Springer, Cham (2020) 16. Palaniappan, S., Awang, R.: Intelligent heart disease prediction system using data mining techniques. In: 2008 IEEE/ACS International Conference on Computer Systems and Applications, pp. 108–115. IEEE (Mar 2008) 17. Bhatia, S., Prakash, P., Pillai, G.N.: SVM based decision support system for heart disease classification with integer-coded genetic algorithm to select critical features. In: Proceedings of the World Congress on Engineering and Computer Science, pp. 34–38 (Oct 2008) 18. Duangsoithong, R., Windeatt, T.: Relevant and redundant feature analysis with ensemble classification. In: 2009 Seventh International Conference on Advances in Pattern Recognition, pp. 247–250. IEEE. (Feb 2009) 19. Lin, K.C., & Chien, H.Y.: CSO-based feature selection and parameter optimization for support vector machine. In: 2009 Joint Conferences on Pervasive Computing (JCPC), pp. 783–788. IEEE. (Dec 2009) 20. Srinivas, K., Rao, G.R., Govardhan, A.: Analysis of coronary heart disease and prediction of heart attack in coal mining regions using data mining techniques. In: 2010 5th International Conference on Computer Science & Education, pp. 1344–1349. IEEE. (Aug 2010) 21. Son, Y.J., Kim, H.-G., et al.: Application of SVM in medical adherence in heart failure patients, pp. 253–259 (2010). ISSN 22. Khemphila, A., Boonjing, V.: Heart disease classification using neural network and feature selection. In: 2011 21st International Conference on Systems Engineering, pp. 406–409. IEEE (Aug 2011) 23. Fida, B., Nazir, M., Naveed, N., Akram, S.: Heart disease classification ensemble optimization using genetic algorithm. In: 2011 IEEE 14th International Multitopic Conference, pp. 19–24. IEEE (Dec 2011) 24. Bhatla, N., Jyoti, K.: An analysis of heart disease prediction using different data mining techniques. Int. J. Eng. 1(8), 1–4 (2012) 25. Sundar, N.A., Latha, P.P., Chandra, M.R.: Performance analysis of classification data mining techniques over heart disease database. Int. J. Eng. Sci. Adv. Technol. 2(3), 470–478 (2012) 26. Jabbar, M.A., Deekshatulu, B.L., Chandra, P.: Heart disease classification using nearest neighbor classifier with feature subset selection. Anale. Seria Informatica 11, 47–54 (2013) 27. Patel, S.B., Yadav, P.K., Shukla, D.P.: Predict the diagnosis of heart disease patients using classification mining techniques. IOSR J. Agric. Vet. Sci. (IOSR-JAVS) 4(2), 61–64 (2013) 28. Subanya, B., Rajalaxmi, R.: A novel feature selection algorithm for heart disease classification. Int. J. Comput. Intell. Inf. 4(2) (2014) 29. Kumar, S., Sahoo, G.: Classification of heart disease using Naive Bayes and genetic algorithm. In: Computational Intelligence in Data Mining, vol. 2, pp. 269–282. Springer, New Delhi (2015) 30. Krishnaiah, V., Srinivas, M., Narsimha, G., Chandra, N.S.: Diagnosis of heart disease patients using fuzzy classification technique. In: International Conference on Computing and Communication Technologies, pp. 1–7. IEEE (Dec 2014) 31. Lakshmi, K.P., Reddy, C.R.K.: Fast rule-based heart disease prediction using associative classification mining. In: 2015 International Conference on Computer, Communication and Control (IC4), pp. 1–5. IEEE (Sept 2015) 32. Radhimeenakshi, S., Nasira, G.M.: Remote heart risk monitoring system based on efficient neural network and evolutionary algorithm. Indian J. Sci. Technol. 8(14), 1 (2015) 33. Ali, L., Niamat, A., Golilarz, N.A., Ali, A., Xingzhong, X.: An expert system based on optimized stacked support vector machines for effective diagnosis of heart disease. IEEE Access (2019) 34. Kavitha, R., Kannan, E.: An efficient framework for heart disease classification using feature extraction and feature selection technique in data mining. In: 2016 International Conference on Emerging Trends in Engineering, Technology and Science (icetets), pp. 1–5. IEEE (Feb 2016)

218

D. Panda and S. R. Dash

35. Vivekanandan, T., Iyengar, N.C.S.N.: Optimal feature selection using a modified differential evolution algorithm and its effectiveness for prediction of heart disease. Comput. Biol. Med. 90, 125–136 (2017) 36. Rairikar, A., Kulkarni, V., Sabale, V., Kale, H., Lamgunde, A.: Heart disease prediction using data mining techniques. In: 2017 International Conference on Intelligent Computing and Control (I2C2), pp. 1–8. IEEE (June 2017) 37. Liu, X., Wang, X., Su, Q., Zhang, M., Zhu, Y., Wang, Q., Wang, Q.: A hybrid classification system for heart disease diagnosis based on the RFRS method. Comput. Math. Methods Med. (2017) 38. Haq, A.U., Li, J.P., Memon, M.H., Nazir, S., Sun, R.: A hybrid intelligent system framework for the prediction of heart disease using machine learning algorithms. Mobile Inf. Syst. (2018) 39. Yazid, M.H.A., Satria, H., Talib, S., Azman, N.: Artificial neural network parameter tuning framework for heart disease classification. Proc. Electric. Eng. Comput. Sci. Inf. 5(1), 674–679 (2018) 40. Panda, N.K., Subashini, M.M., Kejriwal, M.: Rheumatic heart disease classification using adaptive filters. In: MATEC Web of Conferences, vol. 225, p. 03006. EDP Sciences (2018) 41. Haq, A. U., Li, J., Memon, M.H., Memon, M.H., Khan, J., Marium, S.M.: Heart disease prediction system using model of machine learning and sequential backward selection algorithm for features selection. In: 2019 IEEE 5th International Conference for Convergence in Technology (I2CT), pp. 1–4. IEEE (Mar 2019) 42. Javeed, A., Zhou, S., Yongjian, L., Qasim, I., Noor, A., Nour, R.: An intelligent learning system based on random search algorithm and optimized random forest model for improved heart disease detection. IEEE Access 7, 180235–180243 (2019) 43. Latha, C.B.C., Jeeva, S.C.: Improving the accuracy of prediction of heart disease risk based on ensemble classification techniques. Inf. Med. Unlocked 16, 100203 (2019) 44. Baccouche, A., Garcia-Zapirain, B., Castillo Olea, C., Elmaghraby, A.: Ensemble deep learning models for heart disease classification: a case study from Mexico. Information 11(4), 207 (2020) 45. Akgül, M., Sönmez, Ö.E., Özcan, T.: Diagnosis of heart disease using an intelligent method: a hybrid ANN–GA approach. In: International Conference on Intelligent and Fuzzy Systems, pp. 1250–1257. Springer, Cham (July 2019) 46. Shah, S.M.S., Shah, F.A., Hussain, S.A., Batool, S.: Support vector machines-based heart disease diagnosis using feature subset, wrapping selection and extraction methods. Comput. Electr. Eng. 84, 106628 (2020) 47. Li, H., Luo, M., Zheng, J., Luo, J., Zeng, R., Feng, N., Du, Q., Fang, J.: An artificial neural network prediction model of congenital heart disease based on risk factors: a hospital-based case-control study. Medicine 96(6) (2017). 48. Ornish, D., Scherwitz, L.W., Billings, J.H., Gould, K.L., Merritt, T.A., Sparler, S., Armstrong, W.T., Ports, T.A., Kirkeeide, R.L., Hogeboom, C., Brand, R.J.: Intensive lifestyle changes for reversal of coronary heart disease. JAMA 280(23), 2001–2007 (1998) 49. Hu, F.B.: Diet and lifestyle influences on risk of coronary heart disease. Curr. Atheroscler. Rep. 11(4), 257–263 (2009) 50. Chiuve, S.E., Rexrode, K.M., Spiegelman, D., Logroscino, G., Manson, J.E., Rimm, E.B.: Primary prevention of stroke by healthy lifestyle. Circulation 118(9), 947 (2008) 51. Sin, N.L., Kumar, A.D., Gehi, A.K., Whooley, M.A.: Direction of association between depressive symptoms and lifestyle behaviors in patients with coronary heart disease: the Heart and Soul Study. Ann. Behav. Med. 50(4), 523–532 (2016) 52. Liu, G., Li, Y., Hu, Y., Zong, G., Li, S., Rimm, E.B., Hu, F.B., Manson, J.E., Rexrode, K.M., Shin, H.J., Sun, Q.: Influence of lifestyle on incident cardiovascular disease and mortality in patients with diabetes mellitus. J. Am. Coll. Cardiol. 71(25), 2867–2876 (2018) 53. Wilson, P.W., Abbott, R.D., Castelli, W.P.: High density lipoprotein cholesterol and mortality. The Framingham Heart Study. Arterioscler. Official J. Am. Heart Assoc. Inc. 8(6), 737–741 (1988) 54. UCI Machine Learning Repository [homepage on the Internet]. Arlington: The Association; 2006; updated 1996 Dec 3; cited 2011 Feb 2. http://archive.ics.uci.edu/ml/datasets/Heart+Dis ease

Predictive Analytics for Advance Healthcare Cardio Systems

219

55. Panda, D., Ray, R., Abdullah, A.A., Dash, S.R.: Predictive systems: role of feature selection in prediction of heart disease. J. Phys. Conf. Ser. 1372(1), 012074 (Nov 2019). IOP Publishing 56. Panda, D., Panda, D., Dash, S.R., Parida, S.: Extreme earning Machines with feature selection using GA for effective prediction of fetal heart disease: a novel approach. Informatica 45(3) 57. National Institute of Child Health and Human Development Research Planning Workshop: Electronic fetal heart rate monitoring: research guidelines for interpretation. Am. J Obstet. Gynecol. 177, 1385–1390 (1997) 58. Schmidt, J.V., McCartney, P.R.: History and development of fetal heart assessment: a Composite. J. Obstet. Gynecol. Neonatal. Nurs. 29(3), 295–305 (2000) 59. Campos, D.A.D., Spong, C.Y., Chandraharan, E.: FIGO consensus guidelines on intrapartum fetal monitoring: Cardiotocography. Int. J. Gynecol. Obstet. 131(1), 13–24 (2015) 60. Singh, R.S., Saini, B.S., Sunkaria, R.K.: Detection of coronary artery disease by reduced features and extreme learning machine. Clujul Med. 91(2), 166 (2018) 61. Nikam, S., Shukla, P., Shah, M.: Cardiovascular disease prediction using genetic algorithm and neuro-fuzzy system (2017) 62. Comert, Z., Kocamaz, A.F., Gungor, S.: Classification and comparison of cardiotocography signals with artificial neural network and extreme learning machine 63. Hoodbhoy, Z., Noman, M., Shafique, A., Nasim, A., Chowdhury, D., Hasan, B.: Use of machine learning algorithms for prediction of fetal risk using cardiotocographic data. Int. J. Appl. Basic Med. Res. 9(4), 226 (2019) 64. Li, B., Li, Y., Rong, X.:The extreme learning machine learning algorithm with tunable activation function. Neural Comput. Appl. 1–9 (2013) 65. Huang, G.B., Wang, D.H., Lan, Y.: Extreme learning machines: a survey. Int. J. Mach. Learn. Cybern. 2(2), 107–122 (2011) 66. Cao, J., Lin, Z.: Extreme learning machines on high dimensional and large data applications: a survey. Math. Probl. Eng. (2015) 67. Miehe, Y., Sorjamaa, A., Bas, P., Simula, O., Jutten, C., Lendasse, A.: OP-ELM: optimally pruned extreme learning machine. IEEE Trans. Neural Netw. 21(1), 158–162 (2009)

Performance Optimization Strategies for Big Data Applications in Distributed Framework Mir Wajahat Hussain

and Diptendu Sinha Roy

Abstract The evolution and advancements in Information and Communication Technologies (ICT) have enabled large scale distributed computing with a huge chunk of applications for massive number of users. This has obviously generated large volumes of data, thus severely burdening the processing capacity of computers as well as the inflexible traditional networks. State-of-the-art methods for addressing datacenter level performance fixes are yet found wanting for sufficiently addressing this huge processing, storage, and network movement with proprietary protocols for this voluminous data. In this chapter the works have focused on addressing the backend server performance through effective reducer placement, intelligent compression policy, handling slower tasks and in-network performance boosting techniques through effective traffic engineering, traffic classification, topology discovery, energy minimization and load balancing in datacenter-oriented applications. Hadoop, the defacto standard in distributed big data storage and processing, has been designed to store data with its Hadoop Distributed File System (HDFS) and processing engine MapReduce large datasets reliably. However, the processing performance of Hadoop is critically dependent on the time taken to transfer data during the shuffle generated during MapReduce. Also, during concurrent execution of tasks, slower tasks need to be properly identified and efficiently handled to improve the completion time of jobs. To overcome these limitations, three contributions have been made; (i) Compression of generated map outputs at a suitable time when all the map tasks are yet to be completed to shift the load of network onto the CPU; (ii) Placing the reducer onto the nodes where the computation done is highest based on a couple of counters, one maintained at the rack level and another at node level, to minimize the run-time data copying; and (iii) Placing the slower map tasks onto the nodes where the computation done is highest and network is handled by prioritizing. Software defined networking M. W. Hussain (B) Department of Computer Science and Engineering, Alliance University, Anekal, Karnataka, India e-mail: [email protected] D. S. Roy Department of Computer Science and Engineering, National Institute of Technology Meghalaya, Shillong, Meghalaya, India e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 S. R. Dash et al. (eds.), Intelligent Technologies: Concepts, Applications, and Future Directions, Volume 2, Studies in Computational Intelligence 1098, https://doi.org/10.1007/978-981-99-1482-1_10

221

222

M. W. Hussain and D. S. Roy

(SDN) has been a boon for next generation networking owing to the separation of control plane from the data plane. It has the capability to address the network requirements in a timely manner by setting flows for every to and fro data movement and gathering large network statistics at the controller to make informed decisions about the network. A core issue in the network for the controller is traffic classification, which can substantially assist SDN controllers towards efficient routing and traffic engineering decisions. This chapter presents a traffic classification scheme utilizing three classifiers namely Feed-forward Neural Network (FFNN), Logistic Regression (LR), Naïve Bayes and employing Particle Swarm Optimization (PSO) for improved traffic classification with less overhead and without overlooking the key Quality of Service (QoS) criterion. Also lowering energy minimization and link utilization has been an important criterion for lowering the operating cost of the network and effectively utilizing the network. This issue has been addressed in the chapter by formulating a multi-objective problem while simultaneously addressing the QoS constraints by proposing a metaheuristic, since no polynomial solution exists and hence an evolutionary based metaheuristic (Clonal Selection) based energy optimization scheme, namely, Clonal Selection Based Energy Minimization (CSEM) has been devised. The obtained results show the efficacy of the proposed traffic classification scheme and CSEM based solution as compared with the state-of-the-art techniques. SDN has been a promising newer network paradigm but security issues and expensive capital procurement of SDN limit its full deployment hence moving to a hybrid SDN (h-SDN) deployment is only logical moving forward. The usage of both centralized and decentralized paradigms in h-SDN with intrinsic issues of interoperability poses challenges to key issues of topology gathering by the controller for proper allocation of network resources and traffic engineering for optimum network performance. State-of-the-art protocols for topology gathering, such as Link Layer Discovery Protocol (LLDP) and Broadcast Domain Discovery Protocol (BDDP) require a huge number of messages and such schemes only gather link information of SDN devices leaving out legacy switches’ (LS) links which results in sub-optimal performance. This chapter provides novel schemes which unearth topology discovery by requiring fewer messages and gathering link information of all the devices in both single and multi-controller environments (might be used when scalability issue is prevalent in h-SDN). Traffic engineering problems in h-SDN are addressed by proper placement of SDN nodes in h-SDN by utilizing the analyzing key criterion of traffic details and the degree of a node while lowering the link utilization in realtime topologies. The results of the proposed schemes for topology discovery and SDN node placement demonstrate the merits as compared with the state-of-the-art protocols. Keywords Big data · SDN · Hybrid SDN · Link discovery · Traffic classification · Traffic engineering and energy minimization

Performance Optimization Strategies for Big Data Applications …

223

1 Performance Improvements in Big Data and SDN 1.1 Introduction The continuous technological developments of Information and Communication Technologies (ICT) have led to a tremendous growth in the number of applications and huge generation of data which have left many unprepared. This huge generation of data is termed as “Big data”. Big data points to datasets whose storage size exceeds the capability of a typical database. Minimizing the average computation time of the job in big data has been given serious efforts by the researchers. The computation time of the job in MapReduce computation engine can be minimized in several ways like by proper scheduling of tasks (map/reduce), effective placements of data, tuning of the job parameters etc. MapReduce as a computation paradigm is itself divided into sub-phases, where some sub-phases are dependent on others and some of the sub-phases are independent [1]. The key issue in the MapReduce computation issue is to minimize the job completion time [2]. MapReduce jobs tend to generate a lot of intermediate shuffle during map phase. This shuffle is to be delivered to the reduce phase as other phases of reduce do not proceed unless the shuffle is completed for completion of the job [3]. Thus, the reduction of shuffle might play a pertinent role in reducing the computation time of the job. This reduction of shuffle can be addressed in two ways; either through proper placement of the reducer or with the reduction in the quantum of the shuffle by means of compression [4, 5]. Since in Hadoop reduce tasks tends to be placed on any anonymous nodes ignoring all the information and its computational-workload and it might incur huge performance penalties due to run-time data transfers [6]. This implies proper placement of reducer is crucial which can greatly alleviate real-time data copying across the nodes of the cluster and thus can improve Hadoop performance. Also, compression might prove to be a handy tool to reduce the quantum of shuffle generated from a cluster of nodes where network bottleneck happens, and I/O is the latency. Hadoop by default starts the compression after all the map tasks are completed. If a policy is implemented to start the compression along with the map tasks at an apt-time, the performance can be improved. Thus, intelligent placement of the reducer and the apt time to start the compression might prove successful in reducing the completion time. Also in a MapReduce application, all the nodes do not perform equally rather there exist huge differences among the computational capabilities of nodes within the cluster. Thus, slowness of nodes while performing map tasks has been a pertinent concern which impacts the performance of the MapReduce application in a significant way. Task slowness is addressed in Hadoop by speculating the task to other nodes having an empty slot [7]. However, without noting the computational power of the node, anonymous placement might lead to a formidable increase in the computation time of the job. Thus, scheduling the slowest tasks at proper nodes and improvement in network resources might improve the computation time of the job in a prolific way.

224

M. W. Hussain and D. S. Roy

1.2 Open Issues The works addressing the minimization of the completion time in Hadoop have focused primarily on data placement, tuning of job parameters and map phases of the job. Although both the task speculation which might take place if the task appears slow and reducer placement has been less explored. Thus, developing appropriate algorithms to reduce the computation time is needed for the hour. After carrying out a thorough literature survey the following gaps have been identified; • In distributed systems, all the nodes do not perform equally. Thus, the performance might be affected by the slow performing node. During map tasks which occur parallelly, the performance gets impelled seriously by the map task which is the slowest and hence proper policy needs to be implemented while scheduling speculated tasks on nodes having better computational resources and network conditions. • Task speculation has been addressed by several researchers but most of the work has overlooked the bandwidth requirements and data skew factor which are critical for the task to be speculated or not. • Speculated tasks are placed on nodes which have performed better by means of a node counter which gets updated each time a map task is completed. • CPU utilization is low for many jobs in MapReduce, so the surplus processor cycles might be used to compress the map output and thus improving the slowest phase (shuffle) in reduce. • The shuffle phase needs to be completed before other sub-phases of Hadoop, so an intelligent policy of compression needs to be proposed for performance improvement. Compression must be done at a suitable time when all the map tasks are yet to be completed and thus both map and compression might occur concurrently to reduce the time as well as the shuffle size. • As pointed out earlier, the shuffle emanating from the map tasks tends to be huge. Also, unless the shuffle sub-phase of the MapReduce is not completed other phases do not proceed. Thus, the reducer placement plays a critical role in reducing the computation time of the job and improving the MapReduce performance.

1.3 Counter Based Reducer Placement The topology related information of the cluster nodes in Hadoop is maintained through the default rack awareness policy. Hadoop leverages the rack awareness policy for collocating map tasks with the data blocks required and thus ensures no performance degradation. Figure 1 describes the taxonomy of the sequence of operations required to place the reducer based on the counter-based reducer placement. Counters are maintained at every node and rack in the cluster which counterbased MAP_COUNTER and RACK_COUNTER. These counters are incremented

Performance Optimization Strategies for Big Data Applications …

225

Fig. 1 Counter based reducer placement sequence diagram

after every task is completed. JobTracker allocates tasks to the varied TaskTracker and after the completion of the map tasks (early/late shuffle), several TaskTracker sends requests for reduce. JobTracker then ascertains the rack and node at which the computation done is maximum based on the node and rack counter values. After evaluating the node and rack with maximum computation the request for reduce is granted by the JobTracker to the TaskTracker [8]. This set of operations ushers into placing the reducer at the place where the shuffle required for reduce is minimum which eventually improves MapReduce improvement.

226

M. W. Hussain and D. S. Roy

Fig. 2 MapReduce job execution without compression

1.4 Intelligent Data Compression Policy A MapReduce job is composed of multiple independent tasks and each task needs to have a specific system slot to run it. Figures 2, 3 and 4 demonstrate the time schedule versus the job lifespan plotted in X and Y axes respectively. Although each task might take a different time to execute the task but here, we assume that tasks take a similar time to execute. In Fig. 2, the job gets completed at time t20 as there is no compression. For Fig. 3, a scenario of compression is used. As per the former Fig. 3, all the map tasks of the job get completed at t14 , and then compression is started. The compression puts the load from the network into the CPU and hence requires 1 slot to complete it and subsequently, it gets finished at t15 . After the compression of map output-data, shuffling comes into the picture. Since leveraging compression reduces the size of the data to be shuffled so a single time slot is required to complete it. After completion of shuffle is done, reduce task will be executed and finally the job gets completed at t19 . The reduced tasks in Hadoop need to be in waiting as long as the shuffle gets completed. If a policy can be designed whereby both map and shuffle can operate concurrently to reduce the job completion time. Figure 4 demonstrates another scenario where compression is started at t13 prior to completion of all the map tasks. Thus the operation of map/shuffle tasks together minimizes the time required and hence shuffle gets finally completed at time t15 . After completion of shuffle is done, reduce task will be executed at time t16 and finally the job gets completed at t18 . Thus, as per Fig. 4, the enabling of compression when both map tasks and shuffle occur simultaneously, minimizes the Hadoop job computation time and improves throughput and the resource utilization rate is improved [4]. The stage at which map/compression operate together must be properly selected for reduction of the job computation time.

1.5 Resource Aware Task Speculation A Hadoop cluster of 4 machines is considered with 1–3 machines as WorkerNodes and the last one machine as MasterNode. Figure demonstrates the map task execution scenario. The task execution time is shown inside the block. From Fig. 5, it can be

Performance Optimization Strategies for Big Data Applications …

227

Fig. 3 MapReduce job execution utilizing compression

Fig. 4 MapReduce job execution utilizing intelligent compression policy

deduced that after time 3t is elapsed task execution time has quite a different time syntax. The task representation t++t describes that ‘++t’ represents the task execution time left and ‘t’ describes that time has already been executed for that time. Assume that each task normally requires a task execution time of ‘t’ and ‘2t’. The MapReduce job described has a total of 9 map tasks to be completed. Node profiling is done at time quantum ‘4t’ by the quantum of map tasks completed by each node. From the node profiling, 4, 2 and 1 tasks are executed by WorkerNode 1, 2 and 3 respectively. For the nodes 2 and 3, 1 each task is currently being processed. WorkerNode 1 is free after time ‘4t’ and hence notifies the JobTracker through the heartbeat intimation. At ‘4t’ all the map tasks are completely scheduled and node 1 being free hence becomes suitable for task speculation. The task remaining time is considered for task speculation which is currently scheduled at nodes 2 and 3. The remaining task execution time at nodes 2 and 3 be ‘t’ and ‘2t’ which can be verified in terms of the records processed/unprocessed. As per the remaining time, the task currently scheduled at node 3 is perfect for task speculation owing to higher task remaining time. The task speculation for this task is executed at node 1 and thus takes ‘2t’ time to get completed. When the task gets completed at node 1, the MasterNode is intimated about the same and the task which is still running at node 3 is killed. So, in totality, a time of ‘t’ is saved with the help of proper task scheduling and hence there is performance improvement owing to the controlled speculation [9]. The above three works were carried out as an initial investigation and the contributions have been primarily targeted in the domain of SDN.

228

M. W. Hussain and D. S. Roy

Fig. 5 Task speculation using the resource aware execution technique

1.6 Results-Counter Based Reducer Placement A small Hadoop cluster was set up using HP Proliant Blade server for carrying out the experiments. With the use of Hyper-V manager and virtualization, 12 nodes were created and segregated under 3 racks following Hadoop’s rack awareness policyoriented configuration. All the nodes are equivalent in their software/hardware capability. The Hadoop cluster was set up with 12 nodes. Several benchmarks were utilized to test the efficacy of the proposed technique. Benchmarks used for the evaluation include Sort, WordCount and TeraSort. There are several key performance indicators used like the execution time. In each iteration map tasks quantum are scaled up to note the varying pattern. Figures 6, 7 and 8 demonstrate the effect on the completion time of the job as there is an increase of map tasks. TeraSort and Sort show major performance improvement as a lot of intermediate data is generated than WordCount. The average performance improvement in WordCount, Sort and TeraSort is about 7, 15.3 and 17.2% respectively.

1.7 Results-Intelligent Compression The compression policy has a testbed setup of 6 nodes. These 6 nodes include; a single Master node and the rest of them are Slave nodes. Master node acts as Namenode, JobTracker while Slave node is the Tasktracker, Datanode. To test the efficacy of the proposed scheme a variety of benchmarks are operated on like GREP, Sort and TeraSort. The key performance indicators include the computation time required for

Performance Optimization Strategies for Big Data Applications … Fig. 6 WordCount benchmark with late shuffle in CBRP scheme

Fig. 7 Sort benchmark with late shuffle in CBRP scheme

Fig. 8 TeraSort benchmark with late shuffle in CBRP scheme

229

230

M. W. Hussain and D. S. Roy

the job. In each iteration, the number of map tasks are continuously increased to note the varying pattern. Figures 9, 10 and 11 denotes the effect on the average computation as the map/data size tasks increases in the cluster. TeraSort and Sort show major improvement than GREP and is due to the generation of large intermediate shuffle. GREP is a map intensive task and shows a little performance improvement. Thus, a proper compression policy is required to minimize the quantum of shuffle to reduce the computation time. The average computation time reduced in GREP, Sort and TeraSort benchmark is by about (8, 14, 15).

Fig. 9 Computation time required for the GREP benchmark in intelligent compression

Fig. 10 Computation time required for the TeraSort benchmark in intelligent compression

Performance Optimization Strategies for Big Data Applications …

231

Fig. 11 Computation time required for the Sort benchmark in intelligent compression

1.8 Results-Resource Aware Task Speculation The proposed resource aware task speculation was evaluated in a testbed setup of 10 nodes. These 10 nodes include; a single Master node and the rest of them are Slave nodes. Master node acts as JobTracker, Namenode while Slave node is the Tasktracker, Datanode. These nodes were separated by an SDN switch in 2 racks. With the number of map tasks completed the nodes were labelled as better performing/low performing nodes. To test the efficacy of the proposed scheme a variety of benchmarks are operated on like WordCount, Sort and TeraSort. The key performance indicators include the computation time required for the job. In each iteration, the number of map tasks are continuously increased to note the varying pattern. Figures 12, 13 and 14 describes the effect on the average computation as the map/data size tasks increases in the cluster. As the data size is small there is no significant performance improvement with task speculation on/off. When the quantum of data size increases the performance improvement in the proposed task speculation is much better than task speculated off or anonymous task speculation. The average computation time reduced in the above benchmarks is by about 10–15%.

2 Topology Discovery in Hybrid SDN 2.1 Introduction Topology is categorized as an essential service initiated by the controller which is a continuous process to know the network resources at a particular instant [10, 11].

232

M. W. Hussain and D. S. Roy

Fig. 12 Computation time required for the WordCount benchmark in resource aware task scheme

Fig. 13 Computation time required for the TeraSort benchmark in resource aware task scheme

Since controller manages the whole network care must be taken to devise efficient solutions for topology discovery to ensure the controller is not overwhelmed by it. In this context, topology discovery is handled in the context of h-SDN which includes traditional and SDN nodes where challenges are considerable owing to varied protocols used by the devices [12]. Firstly, the number of messages emanating from the controller are minimized to fetch link discovery of SDN nodes in h-SDN. Secondly, a scheme is proposed which finds link information of all the (traditional, SDN) nodes while ensuring messages are kept to a minimum in a single controller environment. Finally, with the scaling of switches in the network, a single controller is not enough to maintain the overall network. Thus, link discovery needs to be fetched

Performance Optimization Strategies for Big Data Applications …

233

Fig. 14 Computation time required for the Sort benchmark in resource aware task scheme

in a multi-controller h-SDN environment to find links of all the devices which ensures optimal performance in the network.

2.2 Open Issues All the works pertaining to SDN link discovery had either focused on minimizing the messages required or addressing several security concerns in the Link Layer Discovery Protocol (LLDP). Most of the research works have addressed issues in a complete SDN environment. In the h-SDN environment, both OpenFlow switch (OFS) and Legacy switches (LS) operate, so for the controller to allocate network resources to applications, topology discovery is a must. Thus, the following gaps have been identified; • Link discovery protocols (LLDP) and Broadcast Domain Discovery Protocol (BDDP) incur considerable messages to fetch the links information of SDN devices based on the quantum of active ports of switches which puts a considerable load onto the controller. Thus, a need arises to fetch the links information of SDN devices with fewer messages. • State-of-the-art protocols are unable to unearth the link information of traditional devices, which ensures a sub-optimal performance for the controller in the network. Thus, a scheme needs to be put forth across a network which gathers link information of the complete devices (OpenFlow, Legacy) in the network by the controller. • Scalability issues are prevalent in the partial deployment owing to increasing interoperability between nodes and with the scaling up of OpenFlow devices in the network. Thus, a definite scheme needs to be devised for gathering link discovery in a multi-controller environment as link discovery is a continuous process.

234

M. W. Hussain and D. S. Roy

• A robust mechanism must be put in place to avoid security concerns during link discovery in h-SDN owing to interoperability.

2.3 Indirect Link Discovery (ILD) The majority of the operation of the state-of-the-art protocols for link discovery in SDN requires a huge number of messages (Packet-Out (Pout )) to be dispatched from the controller. A large number of Pout events emanated from limits the controller performance. Thus, ILD scheme detects links in a partial deployment of OFS by restricting the number of Pout events to 1 which gets spread across the network to gather the link discovery [13].

2.4 Broadcast Based Link Discovery (BBLD) State-of-the-art protocols require every Pout for every active interface of the OFS. The controller in SDN also must perform other key roles like host addition/deletion, routing, network management and monitoring. A large number of Pout events emanated from limits the controller performance. BBLD scheme gathers link discovery by the emanating of a single Pout from the controller and it gets broadcast across all the OFS to gather the link discovery. OFS on the reception of Pout reply to the controller via Packet-in (Pin ) [14].

2.5 Indirect Controller Legacy Forwarding (ICLF) The controller is responsible for initiating the link discovery in a bidirectional way requiring an increased quantum of messages to capture the link discovery. The stateof-the-art protocols are unable to detect the direct/interconnected links of LS—LS which leaves the global link discovery undetected. ICLF gathers link discovery by sending a Pout from the network. The frame emanating from the controller is circulated inside every nook and corner. Successive OFS and LS appends the link information inside the frame and ultimately dispatch it to the controller. For every reception of Pin, the link discovery gets updated bidirectionally, thus requiring lesser messages to capture the link discovery [15]. Further for capturing the link discovery a novel frame format is also proposed.

Performance Optimization Strategies for Big Data Applications …

235

2.6 Extended Indirect Controller Legacy Forwarding (E-ICLF) The link discovery in E-ICLF is gathered by sending a unitary Pout which gets emanated from the controller. Pout is emanated from each domain controller and gets disseminated across all the controller domains with nodes appending the interface information in the packet. The link information from the switches gets in the Pin and is finally dispatched to the controller to enable domain controller link discovery. The inter-area links are captured from one switch in one domain controller appends the interface information and when another edge switch receives this packet and appends the frame information then the frame gets encapsulated in the form of Pin to the controller [16].

2.7 Evaluation Platform The evaluation platform used for ILD, BBLD, ICLF and E-ICLF schemes is Mininet. Mininet is one of the pioneer platforms used for creating networks comprised of switches (OFS, LS), hosts, controllers, and links. Ryu is leveraged as the controller for the emulation and is written in Python. OpenFlow Topology Discovery Protocol (OFDP) and LLDP are used interchangeably. For the BBLD scheme, a pure SDN environment is considered. For ILD, ICLF and E-ICLF h-SDN topologies are used. For all schemes except E-ICLF, the controller number used in the network is kept constant at 1. In the E-ICLF scheme, the controller’s number used in the network is kept constant at 2. In ICLF, E-ICLF scheme the number of LS are scaled up. The topologies used for evaluation are tree, linear and fat-tree. The switches used in the above topologies are continuously scaled up.

2.8 Result Analysis-ILD Figures 15 and 16 shows the number of Pout and Packets lost sent by the controller to obtain the link discovery inside the network in varying topologies. OFDP requires higher Pout and packets lost to obtain the link discovery. SLDP does an optimization by restricting the number of Pout to ports of OFS connected to hosts thereby minimizing the number of packets lost [17]. ILDP requires a single Pout to each OFS for capturing link discovery and the received Pout is thus disseminated to all OFS and thereby reducing the number of packets lost to 1. The experiment on the Pout events reduction is by an average of about 75.76–48.94%, 66.28–30.95%, and 78.95–60% in linear, tree, and fat-tree in comparison to OFDP and OFDP-v2. The experiment on the packets’ lost reduction is by an average of about 50–48.95%, 57.89–50%, and 71.4– 50% in linear, tree, and fat-tree in comparison to OFDP and OFDP-v2.

236

M. W. Hussain and D. S. Roy

Fig. 15 Pout sent for link discovery in ILD scheme

Fig. 16 Packets lost for link discovery in ILD scheme

2.9 Result-Analysis-BBLD Figures 17, 18 and 19, shows the messages required to obtain the link discovery as the nodes are increased in Linear, Tree and Fat-tree topologies. The messages required to obtain the link discovery are Pout and Pin . Pout is highest in OFDP due to the dispatch of those messages to all active interfaces of the OFS. The number of Pout required in OFDP-v2 is equal to the number of OFS in the network. BBLD shows fewer messages required to obtain the link discovery. The number of Pout , Pin

Performance Optimization Strategies for Big Data Applications …

237

events for the BBLD in the network is 1 and one-half to the number of links in the network. The experiment performed shows that there is a message reduction of about (42.4%, 31%), (41.8%, 30.3%) and (38.2%, 28%) as compared to OFDP, OFDP-v2 in Linear, Tree and Fat-tree topologies as the number of OFS are scaled.

Fig. 17 Link discovery messages required in linear topology with BBLD scheme

Fig. 18 Link discovery messages required in tree topology with BBLD scheme

238

M. W. Hussain and D. S. Roy

Fig. 19 Link discovery messages required in fat-tree topology with BBLD scheme

2.10 Result Analysis-ICLF Figures 20, 21 and 22 describe the port detection number as LSs are scaled in the network for a specific topology. The increase of LSs shows a downward trend in port detection in both OFDP and SLDP. LLDP frames are dropped by LSs which results in the number of ports detected in the Ryu controller. SLDP shows a higher degree in terms of port detection owing to the forwarding of the frame by the LSs. OFS and LS ports are captured by the proposed scheme ICLF and hence shows the highest port detection. The experiment performed demonstrates that the number of ports detection improvement is by an average of about (97.9%, 186.3%), (61.1%, 56.5%), (50%, 100%) and (34.6%, 73.2%) in varied topologies. It must be noted due to the lack of any peer schemes in h-SDN for gathering link discovery the increase of ports detection appears to be substantially higher.

2.11 Result Analysis-E-ICLF Figures 23, 24 and 25 describe the number of detection of ports as the number of nodes and LSs are scaled in a network. All the figures under consideration demonstrate that port detection is the highest among the E-ICLF scheme which shows the proposed scheme’s efficacy. E-ICLF can gather all switch ports including ports connected to both LSs and OFSs. SLDP shows the next higher percentage of detection of ports as compared to LLDP. LLDP shows the lowest detection of ports percentage because it only gathers OpenFlow ports which are only a single hop [15]. The experiment performed on the detection of the port’s percentage shows improvement by an

Performance Optimization Strategies for Big Data Applications …

Fig. 20 Ports detected in linear based h-SDN topology with ICLF scheme Fig. 21 Ports detected in tree based h-SDN topology with ICLF scheme

Fig. 22 Ports detected in fat-tree based h-SDN topology with ICLF scheme

239

240

M. W. Hussain and D. S. Roy

average of about (3.2%, 4.8%), (6.5%, 8.1%), (5.1%, 10.2%), (8.9%, 17%), (5.4%, 10.8%) and (12.3%, 21.6%) respectively in Linear, Tree and Fat-tree topologies in the network.

Fig. 23 Ports detected in linear based h-SDN topology with E-ICLF scheme

Fig. 24 Ports detected in tree based h-SDN topology with E-ICLF scheme

Performance Optimization Strategies for Big Data Applications …

241

Fig. 25 Ports detected in fat-tree based h-SDN topology with E-ICLF scheme

3 Traffic Classification and Energy Minimization in SDN 3.1 Introduction The controller gathers considerable statistics from the data plane. The gathering of network statistics at the controller must be an ideal case for the controller to develop proper solutions for improving the network performance (classifying traffic, minimizing energy consumption, and improving link utilization). Applications using network resources need to be allocated proper network resources for avoiding the QoS degradation which requires traffic classification at the controller [18, 19]. Also, lowering the cost of the network is essential for increasing the profit of the enterprises where often networking devices do not operate at their full capacity. Thus, the need to design algorithms for minimizing the energy consumption and improving the link utilization of the network is a must [20]. In this section, the focus is on traffic network classification and minimizing the total energy consumed in the network. This section focuses on equipping the SDN controller’s decision-making capability with Machine Learning (ML) algorithms to enable QoS based traffic classification and real-time traffic classification. Artificial Neural Network and Particle Swarm Optimization (PSO) are used along with the classifiers like Naïve Bayes, Logistics Regression (LR) and Feed Forward Neural Network (FFNN) to improve the performance. The former framework jointly leverages both semi-supervised ML and optimization algorithms for better classification of traffic while ensuring the overhead to be kept at a minimum between SDN switches and controllers. The proposed traffic classification scheme shows the merits of utilizing real-time Internet traffic datasets. The centralized feature of SDN can enable it to jointly minimize network load balancing and energy consumption using a couple of constraints. For the load balancing and energy consumption a multi-objective optimization problem (MOOP)

242

M. W. Hussain and D. S. Roy

is proposed. A novel discrete and metaheuristic solution is proposed via Clonal Selection based Energy Minimization (CSEM). Experiments show that the CSEM shows merits in terms of jointly minimizing energy and load balancing and the results have been validated using multiple benchmarks.

3.2 Open Issues The centralized controller in SDN is a perfect case for developing intelligent algorithms for addressing the network concerns. Traffic classification is an important requirement for application-aware networking. For an SDN the traffic classification enabled at the controller might facilitate knowing the applications’ network requirements. Also, several energy optimization and load balancing algorithms need to be developed for minimizing the overall network costs for an enterprise. The following gaps were found in the literature: • Machine learning algorithms have been applied with several classifiers (FFNN, BN and LR) for knowing the traffic classification pattern of SDN networks for making informed decisions about underlying applications and their QoS requirements and the proposed method does not incur any significant overhead. • A novel multi-objective Clonal Selection Algorithm (CSA), solution is proposed namely CSEM for joint optimization of link utilization and energy consumption.

3.3 Traffic Classification Using Intelligent SDNs Traffic classification is a prerequisite to know about the application and their QoS requirements and ML plays a pivotal role. For the normalization of datasets for ML, three classifiers (FFNN, NB and LR) were leveraged and juxtaposed with hybrid NNPSO. This normalization of datasets was utilized both for improving the testing and training datasets obtained from open source. The accuracy of the traffic classification has been performed using ML algorithms. NN-PSO improves the classification accuracy using the same classifiers. The proposed traffic classification shows better performance and with almost no significant overhead.

3.4 Clonal Selection Based Energy Minimization Minimizing energy consumption and load balancing simultaneously with QoS constraints is a NP-hard problem. Solutions designed for such problems are computationally challenging and hence require Metaheuristic to solve. The Clonal selection algorithm is a nature inspired algorithm and hence a multi-objective problem and is

Performance Optimization Strategies for Big Data Applications …

243

formulated based on minimizing energy consumed and minimizing link utilization while following a fewer QoS [21].

3.5 Dataset Description and Experimental Analysis The datasets employed for the traffic classification were obtained from Kaggle and Ant datasets [22]. To test the efficacy of the proposed traffic classification three experiments were conducted. Firstly, the change had to be noted when the traffic characteristics of the dataset were changed. The second experiment was to demonstrate the impact on classification accuracy which is online where majority and minority real classes were not taken into consideration on the training datasets. Finally, the last experiment executed was to check: the change on the traffic features (inter-arrival time and packet length) if hybridized optimization algorithm is applied or network improvement accuracy can be improved via evolutionary algorithms (EA)?

3.6 Result Analysis Traffic Classification From Fig. 26, it can effectively be concluded that the classifier which has lesser accuracy is FFNN. FFFNN has lower accuracy due to the carriage of a higher quantum of the traffic. Instant Messaging has the highest traffic accuracy owing to the quantum of lower traffic carriage. Carriage. Three types of ML algorithms are considered and corresponding to the traffic specific proportion percentage has been fetched and accuracy comparison for three algorithms are described. Figure 27 demonstrates the complete picture. The pertinent question is whether if EA are implemented is there any appreciable change in traffic. The response to the former question is presented in Fig. 28 which clearly specifies that the inclusion of hybrid EA has improved accuracy than a single optimization algorithm. Thus, it can be effectively concluded that the use of EA improves traffic classification in a formidable way.

3.7 Simulation Setup SNDLIB has been used to import the traffic dataset which includes traffic and links data to execute the simulation [23]. An assumption is taken where 100W and 20W are taken for the chassis power and line power for an 8-port OFS respectively. The entire dataset had a different format and needed to properly fit in the MATLAB program. The mean and standard deviation of the data were taken for obtaining the min and max values and to check the variation of data. MATLAB was leveraged to run the code for the CSEM algorithm.

244

M. W. Hussain and D. S. Roy

Fig. 26 TP percentage of different ML algorithms based on QoS parameters

Fig. 27 Accuracy comparison among the different ML algorithms

3.8 Results Energy Minimization To test the efficacy of the proposed algorithm a comparison is made with the Genetic Algorithm (GA). In all the simulations under consideration, there is an equivalent search space comprising of 40*200 rows and columns. Each row, column has a value of 0 or 1. For testing the benchmark on GA and CSA two pertinent parameters are considered which includes mutation and cloning rate. Performance of both GA and CSA is effectively captured through Figs. 29, 30 and 31.

Performance Optimization Strategies for Big Data Applications …

245

Fig. 28 Accuracy improvement using the optimization algorithm

Fig. 29 Response for Cross-in Tray function with CSA

4 Traffic Engineering in SDN 4.1 Introduction Traffic management is necessary to maintain the overall performance of the network. Traffic Engineering (TE) ensure performance of the applications running in the

246

M. W. Hussain and D. S. Roy

Fig. 30 Response for Rosenbrock function with CSA

Fig. 31 Response for De Jong’s function with CSA

network are maintained [4]. The goals of the TE include; (i). The cost of the network should be minimized. (ii). Minimizing the maximum link utilization. In this section, for TE placement of SDN nodes is crucial [24]. Conventional network devices transfer data via the shortest path in Open Shortest Path First (OSPF). For routing via shortest paths, not all the links are efficiently utilized and thus puts strain on some of the links which eventually become over-utilized [25]. This work

Performance Optimization Strategies for Big Data Applications …

247

does the proper placement of SDN nodes which ensures the maximum link utilization is minimum. Placing SDN nodes is done based on the input traffic to the node and the data received from other nodes. SDN nodes enable flow splitting and paths are traversed by nodes which are non-shortest to remove the flexibility in conventional networks.

4.2 Open Issues Placement of SDN nodes is crucial for TE which ensures flexibility in the network [26, 27]. Most of the works focused on SDN nodes deployment have placed nodes where the number of edges is denser. The gaps in the literature mentioned for TE are summarized as follows; • Traffic volume which includes intermediate data which comes to a node as well as the node’s traffic requirements need to be considered for placement of SDN nodes. • SDN nodes need to be placed in a manner which might not cause loops in the network owing to the running of centralized and decentralized paradigms in the network.

4.3 Intelligent Node Placement (INP) Research works in h-SDN have shown that the preferable placement of the SDN node is the place where there is an increase in the number of paths. Placement of SDN nodes just based on the higher number of paths is not suitable as utilization is an important criterion which must not be overlooked. The proposed INP scheme uses OSPF protocol and places SDN nodes utilizing multiple parameters like the degree of the node, traffic matrix of the network and loop less path formation [28].

4.4 Simulation Platform For INP scheme, real world topology obtained from SNDLIB is considered for simulation [23]. The topology included from the SNDLIB has information pertaining to the traffic matrix of the topology. The topologies under consideration are Geant, Abilene and Atlanta. Geant, Abilene and Atlanta have an average node degree of about 3.27, 2.5 and 4 with demands of about 462, 132 and 210 respectively.

248

M. W. Hussain and D. S. Roy

Fig. 32 Maximum link utilization in Abilene topology with INP scheme

4.5 Result Analysis The performance indicator used for INP is Maximum link utilization (MLU). INP scheme is compared both with the greedy technique and traditional OSPF. In Greedy technique, SDN nodes are placed at locations where the number of paths. For Geant, Abilene and Atlanta the number of SDN nodes required in Greedy and INP schemes are (6,3,4) and (4,2,2) respectively. OSPF performs worse than the Greedy technique because SDN node placement plays a critical role in minimizing MLU. INP performs much better than both greedy and OSPF as the placement of SDN nodes is based on multiple parameters. In some situations, Greedy technique outperforms the proposed INP technique due to the increase in the number of SDN nodes. INP reduces the average MLU by about 5–19%, which is clearly demonstrated in Figs. 32, 33 and 34.

5 Conclusions and Future Work With reference to the performance improvement in MapReduce as demonstrated in Sect. 1, three works have been presented, namely, reducer placement, intelligent compression, and handling of speculated map tasks. Placing the reducer at the highest computation lowers the job completion time. However, in this work, certain issue was not addressed and could be further investigated, like finding the optimal

Performance Optimization Strategies for Big Data Applications …

Fig. 33 Maximum link utilization in Atlanta topology with INP scheme

Fig. 34 Maximum link utilization in Geant topology with INP scheme

249

250

M. W. Hussain and D. S. Roy

number of reducers for an application and how their concurrent execution affects the job completion time due to the unavailability of resources in the cluster. As far as intelligent compression is concerned, only a single compression algorithm, namely, Lempel–Ziv-Oberhumer (LZO) was utilized for all the benchmarks. However, a single compression might not be suitable for all the different benchmarks, thus a variety of compression algorithms and their suitability based on the compression/ decompression speeds to different benchmarks need to be precisely studied. Speculated map tasks were efficiently placed at nodes where computation power was high, and the network was prioritized via SDN. This work only studied slow map tasks but did not consider slow reduce tasks and how its concurrent job’s execution time might be a major cause of performance deterioration. Further during failures of map/reduce tasks speculation needs to be duly investigated. Section 2 proposes a variety of novel schemes for obtaining link discovery in SDN and h-SDN. These works focused on reducing link discovery messages. However, during minimizing messages robust security features are not incorporated inside and hence false topology might be added inside the controller. This might make the controller vulnerable to attacks and might limit the performance of the controller and thus needs to be properly investigated. Also, the E-ICLF scheme did not consider resource sharing between multi-controllers which might limit the performance of the networks where maintenance of global link information is a must. Section 3 employs evolutionary algorithms for traffic classification in SDN. During traffic classification, only a few applications are considered for traffic classification, but future works may be focused on categorizing a greater number of applications. Also, the proposed traffic classification need to be tested in a variety of platforms like Windows, Linux, and iOS. It also proposes a novel CSA algorithm for joint optimization of energy minimization and load balancing in SDN while maintaining QoS constraints. This energy minimization algorithm was designed for datasets imported from SNDLIB and in future, the algorithm needs to be tested for varied network topologies. Further, the proposed algorithm needs to be tested in networks comprising of links with different link capacities. Section 4 proposes a novel INP scheme for the placement of SDN nodes in an incremental fashion in h-SDN based on location and static traffic (emanating/incoming) from nodes. The proposed placement of SDN nodes in hSDN needs to be tested where traffic can be generated in real-time and takes care of historical traffic matrices via machine learning algorithms for an effective and improved TE in the network.

Performance Optimization Strategies for Big Data Applications …

251

References 1. White, T.: Hadoop: The definitive guide. O’Reilly Media, Inc. (2012) 2. Hussain, M.W., Roy, D.S.: A counter-based profiling scheme for improving locality through data and reducer placement. In: Advances in Machine Learning for Big Data Analysis, pp. 101– 118. Springer, Singapore. (2022) 3. Chen, Q., Liu, C., Xiao, Z.: Improving MapReduce performance using smart speculative execution strategy. IEEE Trans. Comput. 63(4), 954–967 (2013) 4. Ashu, A., Hussain, M.W., Sinha Roy, D., Reddy, H.K.: Intelligent data compression policy for Hadoop performance optimization. In: International Conference on Soft Computing and Pattern Recognition, pp. 80–89. Springer, Cham (2019). (Dec 2019) 5. Hammoud, M., Sakr, M.F.: Locality-aware reduce task scheduling for MapReduce. In: 2011 IEEE Third International Conference on Cloud Computing Technology and Science, pp. 570– 576. IEEE (2011). (Nov 2011) 6. Singh, A.P., Hemant Kumar, G., Paik, S.S., Sinha Roy, D.: Storage and analysis of Synchrophasor data for event detection in Indian power system using Hadoop ecosystem. In: Data and Communication Networks, pp. 291–304. Springer, Singapore (2019) 7. Zaharia, M., Konwinski, A., Joseph, A.D., Katz, R.H., Stoica, I.: Improving MapReduce performance in heterogeneous environments. In: Osdi, vol. 8, no. 4, p. 7 (2008). (Dec 2008) 8. Hussain, M.W., Reddy, K.H., Roy, D.S.: A counter based approach for reducer placement with augmented Hadoop rackawareness. Turkish J. Electric. Eng. Comput. Sci. 29(1), 437–453 (2021). (SCI-indexed, I.F=0.682) 9. Hussain, M.W., Reddy, K.H.K., Roy, D.S.: Resource aware execution of speculated tasks in Hadoop with SDN. Int. J. Adv. Sci. Technol. 28(13), 72–84 (2019). (Scopus -indexed) 10. Pakzad, F., Portmann, M., Tan, W.L., Indulska, J.: Efficient topology discovery in software defined networks. In: 2014 8th International Conference on Signal Processing and Communication Systems (ICSPCS), pp. 1–8. IEEE (2014). (Dec 2014) 11. Pakzad, F., Portmann, M., Tan, W.L., Indulska, J.: Efficient topology discovery in OpenFlowbased software defined networks. Comput. Commun. 77, 52–61 (2016) 12. Sinha, Y., Haribabu, K.: A survey: hybrid sdn. J. Netw. Comput. Appl. 100, 35–55 (2017) 13. Hussain, M.W., Sinha Roy, D.: Enabling indirect link discovery between SDN switches. In: Proceedings of the International Conference on Computing and Communication Systems, pp. 471–481. Springer, Singapore (2021) 14. Hussain, M.W., Moulik, S., Roy, D.S.: A broadcast based link discovery scheme for minimizing messages in software defined networks. In: 2021 IEEE Globecom Workshops (GC Wkshps), pp. 1–6. IEEE (2021). (Dec 2021). 15. Hussain, M.W., Reddy, K.H.K., Rodrigues, J.J., Roy, D.S.: An indirect controller-legacy switch forwarding scheme for link discovery in hybrid SDN. IEEE Syst. J. 15(2), 3142–3149 (2020). (SCI-indexed, I.F=3.987) 16. Hussain, M.W., Khan, M.S., Reddy, K.H.K., Roy, D.S.: Extended indirect controller-legacy switch forwarding for link discovery in hybrid multi-controller SDN. Comput. Commun. 189, 148–157 (2022). (SCI-indexed, I.F=5.047) 17. Nehra, A., Tripathi, M., Gaur, M.S., Battula, R.B., Lal, C.: SLDP: a secure and lightweight link discovery protocol for software defined networking. Comput. Netw. 150, 102–116 (2019) 18. Khan, S., Gani, A., Wahab, A.W.A., Guizani, M., Khan, M.K.: Topology discovery in software defined networks: Threats, taxonomy, and state-of-the-art. IEEE Commun. Surv. Tutor. 19(1), 303–324 (2016) 19. Pradhan, B., Hussain, M.W., Srivastava, G., Debbarma, M.K., Barik, R.K., Lin, J.C.W.: A neuro-evolutionary approach for software defined wireless network traffic classification. IET Commun. (2022) 20. Zhu, R., Wang, H., Gao, Y., Yi, S., Zhu, F.: Energy saving and load balancing for SDN based on multi-objective particle swarm optimization. In: International Conference on Algorithms and Architectures for Parallel Processing, pp. 176–189. Springer, Cham (2015). (Nov 2015)

252

M. W. Hussain and D. S. Roy

21. Hussain, M.W., Pradhan, B., Gao, X.Z., Reddy, K.H.K., Roy, D.S.: Clonal selection algorithm for energy minimization in software defined networks. Appl. Soft Comput. 96, 106617 (2020). (SCI-indexed, I.F=6.472) 22. Ant Dataset (2010). https://ant.isi.edu/datasets/index.html 23. SNDLIB. http://sndlib.zib.de/home.action. 24. Agarwal, S., Kodialam, M., Lakshman, T.V.: Traffic engineering in software defined networks. In: 2013 Proceedings IEEE INFOCOM, pp. 2211–2219. IEEE (2013). (Apr 2013). 25. Vissicchio, S., Vanbever, L., Rexford, J.: Sweet little lies: fake topologies for flexible routing. In: Proceedings of the 13th ACM Workshop on Hot Topics in Networks, pp. 1–7 (2014). (Oct 2014) 26. Guo, Y., Wang, Z., Yin, X., Shi, X., Wu, J.: Traffic engineering in SDN/OSPF hybrid network. In: 2014 IEEE 22nd International Conference on Network Protocols, pp. 563–568. IEEE (2014). (Oct 2014) 27. Caria, M., Jukan, A., Hoffmann, M.: A performance study of network migration to SDN-enabled traffic engineering. In: 2013 IEEE Global Communications Conference (GLOBECOM), pp. 1391–1396. IEEE (2013). (Dec 2013) 28. Hussain, M.W., Sinha Roy, D.: Intelligent node placement for improving traffic engineering in hybrid SDN. In: Advances in Communication, Devices and Networking, pp. 287–296. Springer, Singapore (2022)