257 63 20MB
English Pages [633] Year 2023
Advances in Intelligent Systems and Computing 1442
Abhishek Bhattacharya Soumi Dutta Paramartha Dutta Vincenzo Piuri Editors
Innovations in Data Analytics Selected Papers of ICIDA 2022
Advances in Intelligent Systems and Computing Volume 1442
Series Editor Janusz Kacprzyk, Systems Research Institute, Polish Academy of Sciences, Warsaw, Poland Advisory Editors Nikhil R. Pal, Indian Statistical Institute, Kolkata, India Rafael Bello Perez, Faculty of Mathematics, Physics and Computing, Universidad Central de Las Villas, Santa Clara, Cuba Emilio S. Corchado, University of Salamanca, Salamanca, Spain Hani Hagras, School of Computer Science and Electronic Engineering, University of Essex, Colchester, UK László T. Kóczy, Department of Automation, Széchenyi István University, Gyor, Hungary Vladik Kreinovich, Department of Computer Science, University of Texas at El Paso, El Paso, TX, USA Chin-Teng Lin, Department of Electrical Engineering, National Chiao Tung University, Hsinchu, Taiwan Jie Lu, Faculty of Engineering and Information Technology, University of Technology Sydney, Sydney, NSW, Australia Patricia Melin, Graduate Program of Computer Science, Tijuana Institute of Technology, Tijuana, Mexico Nadia Nedjah, Department of Electronics Engineering, University of Rio de Janeiro, Rio de Janeiro, Brazil Ngoc Thanh Nguyen , Faculty of Computer Science and Management, Wrocław University of Technology, Wrocław, Poland Jun Wang, Department of Mechanical and Automation Engineering, The Chinese University of Hong Kong, Shatin, Hong Kong
The series “Advances in Intelligent Systems and Computing” contains publications on theory, applications, and design methods of Intelligent Systems and Intelligent Computing. Virtually all disciplines such as engineering, natural sciences, computer and information science, ICT, economics, business, e-commerce, environment, healthcare, life science are covered. The list of topics spans all the areas of modern intelligent systems and computing such as: computational intelligence, soft computing including neural networks, fuzzy systems, evolutionary computing and the fusion of these paradigms, social intelligence, ambient intelligence, computational neuroscience, artificial life, virtual worlds and society, cognitive science and systems, Perception and Vision, DNA and immune based systems, self-organizing and adaptive systems, e-Learning and teaching, human-centered and human-centric computing, recommender systems, intelligent control, robotics and mechatronics including human-machine teaming, knowledge-based paradigms, learning paradigms, machine ethics, intelligent data analysis, knowledge management, intelligent agents, intelligent decision making and support, intelligent network security, trust management, interactive entertainment, Web intelligence and multimedia. The publications within “Advances in Intelligent Systems and Computing” are primarily proceedings of important conferences, symposia and congresses. They cover significant recent developments in the field, both of a foundational and applicable character. An important characteristic feature of the series is the short publication time and world-wide distribution. This permits a rapid and broad dissemination of research results. Indexed by DBLP, INSPEC, WTI Frankfurt eG, zbMATH, Japanese Science and Technology Agency (JST). All books published in the series are submitted for consideration in Web of Science. For proposals from Asia please contact Aninda Bose ([email protected]).
Abhishek Bhattacharya · Soumi Dutta · Paramartha Dutta · Vincenzo Piuri Editors
Innovations in Data Analytics Selected Papers of ICIDA 2022
Editors Abhishek Bhattacharya Institute of Engineering and Management Kolkata, West Bengal, India
Soumi Dutta Institute of Engineering and Management Kolkata, West Bengal, India
Paramartha Dutta Visva-Bharati University Shantiniketan, West Bengal, India
Vincenzo Piuri Università degli Studi di Milano Milan, Italy
ISSN 2194-5357 ISSN 2194-5365 (electronic) Advances in Intelligent Systems and Computing ISBN 978-981-99-0549-2 ISBN 978-981-99-0550-8 (eBook) https://doi.org/10.1007/978-981-99-0550-8 © The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 This work is subject to copyright. All rights are solely and exclusively licensed by the Publisher, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed. The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use. The publisher, the authors, and the editors are safe to assume that the advice and information in this book are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or the editors give a warranty, expressed or implied, with respect to the material contained herein or for any errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional claims in published maps and institutional affiliations. This Springer imprint is published by the registered company Springer Nature Singapore Pte Ltd. The registered company address is: 152 Beach Road, #21-01/04 Gateway East, Singapore 189721, Singapore
Foreword
Welcome to the International Conference on Innovations in Data Analytics (ICIDA 2022) which was held on 29th–30th November 2022 in Kolkata, India. As a premier conference in the field, ICIDA 2022 provides a highly competitive forum for reporting the latest developments in the research and application of data analytics and data mining. We are pleased to present the proceedings of the conference as its published record. The theme this year is Crossroad of Data Mining and data analytics, a topic that is quickly gaining traction in both academic and industrial discussions because of the relevance of Privacy Preserving Data Mining (PPDM model). ICIDA is a young conference for research in the areas of Data Sciences, Big Data, and Data Mining. Although 2022 was the debut year for ICIDA, it has already witnessed significant growth. As evidence of that, IEMIS received a record 415 submissions. The authors of the submitted papers come from 21 countries and regions. The authors of accepted papers are from 11 countries. We hope that this program will further stimulate research in Information Security and data mining, and provide practitioners with better techniques, algorithms, and tools for deployment. We feel honored and privileged to serve the best recent developments in the field of Data Mining to you through this exciting program. Dr. Satyajit Chakrabarti President of IEM Group, Chief Patron, IEMIS 2022, India
v
Preface
This volume presents proceedings of the International Conference on Innovations in Data Analytics (ICIDA 2022), which took place in the Eminent College of Management and Technology, India, on 29th–30th November 2022. The volume appears in the series “Advances in Intelligent Systems and Computing” (AISC) published by Springer Nature, one of the largest and most prestigious scientific publishers, in the series which is one of the fastest-growing book series in their program. The AISC is meant to include various high-quality and timely publications, primarily conference proceedings of relevant conference, congresses, and symposia but also monographs, on theory, applications, and design methods of Intelligent Systems and Intelligent Computing. Virtually all disciplines such as engineering, natural sciences, computer and information science, ICT, economics, business, e-commerce, environment, healthcare, life science are covered. The list of topics spans all the areas of modern intelligent systems and computing such as computational intelligence, soft computing including neural networks, fuzzy systems, evolutionary computing and the fusion of these paradigms, social intelligence, ambient intelligence, computational neuroscience, artificial life, virtual worlds and society, cognitive science and systems, Perception and Vision, DNA and immune-based systems, self-organizing and adaptive systems, e-Learning and teaching, human-cantered and human-centric computing, recommender systems, intelligent control, robotics and mechatronics including human-machine teaming, knowledge-based paradigms, learning paradigms, machine ethics, intelligent data analysis, knowledge management, intelligent agents, intelligent decision-making and support, intelligent network security, trust management, interactive entertainment, Web intelligence and multimedia, to just mention a few. These areas are at the forefront of science and technology, and have been found useful and powerful in a wide variety of disciplines such as engineering, natural sciences, computer, computation and information sciences, ICT, economics, business, e-commerce, environment, health care, life science, and social sciences. The AISC book series is indexed by DBLP, INSPEC, WTI Frankfurt eG, zbMATH, Japanese Science and Technology Agency (JST). All books published in the series are submitted for consideration in Web of Science.
vii
viii
Preface
In this volume of “Advances in Intelligent Systems and Computing”, we would like to present results of studies on selected problems of Data Mining and data analytics implementation is the contemporary answer to new challenges in threat evaluation of complex systems. This book will be of extraordinary incentive to a huge assortment of experts, scientists, and understudies concentrating on the human part of the internet, and for the compelling assessment of safety efforts, interfaces, client focused outline, and plan for unique populaces, especially the elderly. We trust this book is instructive yet much more than it is provocative. We trust it moves, driving per user to examine different inquiries, applications, and potential arrangements in making sheltered and secure plans for all. The Programme Committee of the ICIDA2022 Conference, its organizers and the editors of these proceedings would like to gratefully acknowledge participation of all reviewers who helped to refine the contents of this volume and evaluated conference submissions. Our thanks go to all respected Keynote Speakers: Dr. Wen Cheng Lai, Dr. Saptarshi Ghosh, Dr. Nilanjan Dey, Prof. Carlos A. Coello Coelloy, Dr. Tanupriya Choudhury, Dr. Debabrata Samanta, and to our all session chairs. Thanking all the authors who have chosen ICIDA2022 as the publication platform for their research, we would like to express our hope that their papers will help in further developments in design and analysis of engineering aspects of complex systems, being a valuable source material for scientists, researchers, practitioners, and students who work in these areas. Kolkata, India Kolkata, India Shantiniketan, India Milan, Italy
Abhishek Bhattacharya Soumi Dutta Paramartha Dutta Vincenzo Piuri
Contents
Computational Intelligence Transience in COVID Patients with Comorbidity Issues—A Systematic Review and Meta-Analysis Based on Indian and Southeast Asian Context . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Sharmistha Dey and Mauparna Nandan NFT HUB—The Decentralized Non-Fungible Token Market Place . . . . . P. Divya, B. Bharath Sudha Chandra, Y. Harsha Vardhan, T. Ananth Kumar, S. Jayalakshmi, and R. Parthiban Hashgraph: A Decentralized Security Approach Based on Blockchain with NFT Transactions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . P. Divya, S. Rajeshwaran, R. Parthiban, T. Ananth Kumar, and S. Jayalakshmi
3 17
33
Med Card: An Innovative Way to Keep Your Medical Records Handy and Safe . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Abhishek Goel, Mandeep Singh, Jaya Gupta, and Nancy Mangla
51
Delta Operator-Based Modelling and Control of High Power Induction Motor Using Novel Chaotic Gorilla Troop Optimizer . . . . . . . . Rahul Chaudhary, Tapsi Nagpal, and Souvik Ganguli
61
An Enhanced Optimize Outlier Detection Using Different Machine Learning Classifier . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Himanee Mishra and Chetan Gupta
71
Prediction of Disease Diagnosis for Smart Healthcare Systems Using Machine Learning Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Nidhi Sabre and Chetan Gupta
85
Optimization Accuracy on an Intelligent Approach to Detect Fake News on Twitter Using LSTM Neural Network . . . . . . . . . . . . . . . . . . . . . . . Kanchan Chourasiya, Kapil Chaturvedi, and Ritu Shrivastava
99
ix
x
Contents
Advance Computing Mining User Interest Using Bayesian-PMF and Markov Chain Monte Carlo for Personalised Recommendation Systems . . . . . . . . . . . . . . 115 Bam Bahadur Sinha and R. Dhanalakshmi Big Data and Its Role in Cybersecurity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131 Faheem Ahmad, Shafiqul Abidin, Imran Qureshi, and Mohammad Ishrat QR Code-Based Digital Payment System Using Visual Cryptography . . . 145 Surajit Goon, Debdutta Pal, Souvik Dihidar, and Subham Roy A Study of Different Approaches of Offloading for Mobile Cloud Computing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159 Rajani Singh, Nitin Pandey, Deepti Mehrotra, and Devraj Mishra Use of Machine Learning Models for Analyzing the Accuracy of Predicting the Cancerous Diseases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 169 Shanthi Makka, Gagandeep Arora, Sai Sindhu Theja Reddy, and Sunitha Lingam Predict Employee Promotion Using Supervised Classification Approaches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 181 Mithila Hoq, Papel Chandra, Shakayet Hossain, Sudipto Ghosh, Md. Ifran Ahamad, Md. Shariar Rahman Oion, and Md. Abu Foyez Smart Grid Analytics—Analyzing and Identifying Power Distribution Losses with the Help of Efficient Data Regression Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 193 Amita Mishra, Babita Verma, Smita Agarwal, and K. Subhashini Spurjeon Load Balancing on Cloud Using Genetic Algorithms . . . . . . . . . . . . . . . . . . 203 Kapila Malhotra, Rajiv Chopra, and Ritu Sibal Network Security and Telecommunication Hybrid Feature Selection Approach to Classify IoT Network Traffic for Intrusion Detection System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 211 Sanskriti Goel and Puneet Jai Kaur A Deep Learning-Based Framework for Analyzing Stress Factors Among Working Women . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 225 Chhaya Gupta, Sangeeta, Nasib Singh Gill, and Preeti Gulia Automatic Question Generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 235 Abhishek Phalak, Shubhankar Yevale, Gaurav Muley, and Amit Nerurkar Automatic Construction of Named Entity Corpus for Adverse Drug Reaction Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 245 Samridhi Dev and Aditi Sharan
Contents
xi
A Space Efficient Metadata Structure for Ranking Subset Sums . . . . . . . 257 Biswajit Sanyal, Subhashis Majumder, and Priya Ranjan Sinha Mahapatra Brain Tumour Detection Using Machine Learning . . . . . . . . . . . . . . . . . . . . 277 Samriddhi Singh Implementation of a Smart Patient Health Tracking and Monitoring System Based on IoT and Wireless Technology . . . . . . . . 285 Addagatla Prashanth, Kote Rahulsree, and Panasakarla Niharika Diabetes Disease Prediction Using KNN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 293 Makarand Shahade, Ashish Awate, Bhushan Nandwalkar, and Mayuri Kulkarni Review: Application of Internet of Things (IoT) for Vehicle Simulation System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 307 Rishav Pal, Arghyadeep Hazra, Subham Maji, Sayan Kabir, and Pravin Kumar Samanta Data Science and Data Analytics Time Series Analysis and Forecast Accuracy Comparison of Models Using RMSE–Artificial Neural Networks . . . . . . . . . . . . . . . . . . . 317 Nama Deepak Chowdary, Tadepally Hrushikesh, Kusampudi Madhava Varma, and Shaik Ali Basha A Non-Recursive Space-Efficient Blind Approach to Find All Possible Solutions to the N-Queens Problem . . . . . . . . . . . . . . . . . . . . . . . . . . 327 Suklav Ghosh and Sarbajit Manna Handling Missing Values Using Fuzzy Clustering: A Review . . . . . . . . . . . 341 Jyoti, Jaspreeti Singh, and Anjana Gosain Application of Ensemble Methods in Medical Diagnosis . . . . . . . . . . . . . . . 355 Ramya Shree, Suraj Madagaonkar, Lakshmi Aashish Prateek, Alan Tony, M. V. Rathnamma, V. Venkata Ramana, and K. Chandrasekaran Some Modified Activation Functions of Hyperbolic Tangent (TanH) Activation Function for Artificial Neural Networks . . . . . . . . . . . . 369 Arvind Kumar and Sartaj Singh Sodhi Advancements and Challenges in Smart Shopping System . . . . . . . . . . . . . 393 Mamta and Suman Sangwan Ensemble Model for Music Genre Classification . . . . . . . . . . . . . . . . . . . . . . 407 Kriti Singhal, Shubham Chawla, Arnav Agarwal, Pranshu Goyal, Abhiroop Agarwal, and Prashant Singh Rana
xii
Contents
Decoding Low-Code/No-Code Development Hype—Study of Rapid Application Development Worthiness and Overview of Various Platforms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 419 Nisha Jaglan and Divya Upadhyay Pattern Recognition IoT Framework for Quality-of-Service Enhancement in Building Management System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 431 Uma Tomer and Parul Gandhi Blockchain IoT: Challenges and Solutions for Building Management System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 439 Uma Tomer and Parul Gandhi Evolving Connections in Metaverse . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 451 M. Ranjitha, M. O. Divya, Anju Treesa Thomas, Prarthana Ponnath, and Juthy Shaji A Systematic Approach on Blockchain Security Methodologies in Iot . . . 467 Monika and Brij Mohan Goel A Systematic Analysis on Airborne Infectious Virus Diseases: A Review . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 489 Sapna Kumari and Munish Bhatia Comparison of Different Similarity Methods for Text Categorization . . . 499 Ulligaddala Srinivasarao, R. Karthikeyan, Mohammad J Bilal, and Shanmugasundaram Hariharan The Upsurge of Deep Learning for Disease Prediction in Healthcare . . . . 511 Aman and Rajender Singh Chhillar SDN-Based Cryptographic Client Authentication: A New Approach to DHCP Starvation Mitigation . . . . . . . . . . . . . . . . . . . . . . . . . . . 519 Gilbert C. Urama, Chiagoziem C. Ukwuoma, Md Belal Bin Heyat, Victor K. Agbesi, Faijan Akhtar, Muhammed Amin Abdullah, Nitish Pathak, Neelam Sharma, and Soumi Dutta Information Retrieval Power of Deep Learning Models in Bioinformatics . . . . . . . . . . . . . . . . . . . . 535 Preeti Thareja and Rajender Singh Chhillar Deep Neural Network with Optimal Tuned Weights for Automated Crowd Anomaly Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 543 Rashmi Chaudhary and Manoj Kumar Tuning Geofencing in Child Kidnapping Prevention Methods . . . . . . . . . . 565 Parul Arora and Suman Deswal
Contents
xiii
Segmentation of the Eye Fundus Images Using Edge Detection, Improved Image, and Clustering of Density in Diabetic Retinopathy . . . 577 Abhijeet Kumar, Naveen Kumar, and Khushboo Singh Securing Medical Images Using Quantum Key Distribution Scheme BB84 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 585 Siddhartha Roy and Anushka Ghosh Design and Development of AUV for Coral Reef Inspection and Geotagging Using CV/ML . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 595 Austin Davis and Surekha Paneerselvam Theoretical Approach of Proposed Algorithm for Channel Estimation in OFDM MIMO System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 611 Chirag and Vikas Sindhu An Approach for Digital-Social Network Analysis Using Twitter API . . . 625 Erita Cunaku, Jona Ndrecaj, Shkurte Berisha, Debabrata Samanta, Soumi Dutta, and Abhishek Bhattacharya Author Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 637
About the Editors
Dr. Abhishek Bhattacharya is Assistant Professor at Institute of Engineering & Management, India. He has completed his Ph.D. (CSE), BIT, Mesra. He is certified as Publons Academy Peer Reviewer, 2020. His research interests are data mining, cyber-security, and mobile computing. She has published 25 conference and journal papers in Springer, IEEE, IGI Global, Taylor & Francis, etc. He has 3 chapters in Taylor & Francis Group EAI. He is Peer Reviewer and TPC Member in different international journals. He was Editor in IEMIS 2020, IEMIS 2018, and special issues in IJWLTT. He is Member of several technical functional bodies such as IEEE, IFERP, MACUL, SDIWC, Internet Society, ICSES, ASR, AIDASCO, USERN, IRAN, and IAENG. He has published 3 patents. Dr. Soumi Dutta is Associate Professor at Institute of Engineering & Management, India. She has completed her Ph.D. (CST), IIEST, Shibpur. She received her B.Tech. (IT) and M.Tech. (CSE), securing 1st position (Gold Medallist), from MAKAUT. She is certified as Publons Academy Peer Reviewer, 2020, and Certified Microsoft Innovative Educator, 2020. Her research interests are data mining, information retrieval, online social media analysis, and image processing. She has published 30 conference and journal papers in Springer, IEEE, IGI Global, Taylor & Francis, etc. She has 5 chapters in Taylor & Francis Group and IGI Global. She is Peer Reviewer and TPC Member in different international journals. She was Editor in CIPR 2020, CIPR 2019, IEMIS 2020, CIIR 2021, IEMIS 2018, and special issues in IJWLTT. She is Member of several technical functional bodies such as ACM, IEEE, IFERP, MACUL, SDIWC, Internet Society, ICSES, ASR, AIDASCO, USERN, IRAN, and IAENG. She has published 3 patents. Dr. Soumi Dutta has delivered 15 keynote talks in different international conferences. She has been awarded with Rashtriya Shiksha Ratna Award, InSc Research Education Excellence Award, International Teacher Award 2020–2021 by Ministry of MSME, Government of India. Dr. Paramartha Dutta is currently Professor in the Department of Computer and System Sciences in Visva-Bharati University, Shantiniketan, India. He did Bachelors and Masters in Statistics from ISI, Kolkata, India. Subsequently, he did Master xv
xvi
About the Editors
of Technology in Computer Science from ISI, Kolkata, India. He did Ph.D. (Engineering) from BESU, Shibpore, India. He is Co-author of eight authored books apart from thirteen edited books and more than 200 and 40 research publications in peerreviewed journals and conference proceedings. He is Co-inventor of 17 published patents. He is Fellow of IETE, Optical Society of India, IEI, Senior Member of ACM, IEEE, Computer Society of India, International Association for Computer Science and Information Technology, and Member of Advanced Computing and Communications Society, Indian Unit of Pattern Recognition and AI—the Indian Affiliate of the International Association for Pattern Recognition, ISCA, Indian Society for Technical Education, System Society of India. Vincenzo Piuri has received his Ph.D. in computer engineering at Polytechnic of Milan, Italy (1989). He is Full Professor in computer engineering at the University of Milan, Italy (since 2000). He has been Associate Professor at Polytechnic of Milan, Italy, Visiting Professor at the University of Texas at Austin, USA, and Visiting Researcher at George Mason University, USA. His main research interests are artificial intelligence, computational intelligence, intelligent systems, machine learning, pattern analysis and recognition, signal and image processing, biometrics, intelligent measurement systems, industrial applications, digital processing architectures, fault tolerance, cloud computing infrastructures, and Internet of things. Original results have been published in 400+ papers in international journals, proceedings of international conferences, books, and chapters. He is Fellow of the IEEE, Distinguished Scientist of ACM, and Senior Member of INNS. He is President of the IEEE Systems Council (2020–21) and IEEE Region 8 Director-elect (2021–22), and has been IEEE Vice President for Technical Activities (2015), IEEE Director, President of the IEEE Computational Intelligence Society, Vice President for Education of the IEEE Biometrics Council, Vice President for Publications of the IEEE Instrumentation and Measurement Society and the IEEE Systems Council, and Vice President for Membership of the IEEE Computational Intelligence Society. He has been Editor-in-Chief of the IEEE Systems Journal (2013-19). He is Associate Editor of the IEEE Transactions on Cloud Computing and has been Associate Editor of the IEEE Transactions on Computers, the IEEE Transactions on Neural Networks, the IEEE Transactions on Instrumentation and Measurement, and IEEE Access. He received the IEEE Instrumentation and Measurement Society Technical Award (2002) and the IEEE TAB Hall of Honor (2019). He is Honorary Professor at Obuda University, Hungary; Guangdong University of Petrochemical Technology, China; Northeastern University, China; Muroran Institute of Technology, Japan; Amity University, India; and Galgotias University, India.
Computational Intelligence
Transience in COVID Patients with Comorbidity Issues—A Systematic Review and Meta-Analysis Based on Indian and Southeast Asian Context Sharmistha Dey and Mauparna Nandan
Abstract Past years were very important for human civilization and lesson learning. During the pandemic, our human civilization has witnessed a massive change in lifestyle which has created a disruptive impact on physical as well as mental wellbeing. As every situation in our lives may be treated as either a reward or a lesson, we can learn the importance of immunity to fight against any disease. According to recent studies on COVID, it has been observed that the mortality rate of COVID patients is varying along with different comorbidities like hypertension, diabetes, COPD, cancer, and any lung-related disease. The after-effect for patients with comorbidity issues is also much higher than for patients with zero comorbidity issues. A systematic review has been performed to analyze the effect of comorbidity issues on COVID patients compared to general conditions. In this review, we have covered three major comorbid diseases—diabetes, COPD, and any incurable disease like cancer or HIV. We have selected several peer-reviewed articles from some popular databases such as PubMed, BMJ, Google Scholar, and Lancet and focused and performed a metaanalysis based on several comorbidity factors and risks associated with that. Keywords Adaptive immunity · Comorbidity · Diabetes mellitus · COPD · Cytokine storm · First section
1 Introduction During the recent lockdown phase, we have witnessed a massive change in our lifestyle as well as a major paradigm shift in the medical research on our physical as well as mental health. According to the statistics received till August 2020, the statistics of COVID-affected persons were nearly 1.93 million, and the total death count S. Dey (B) Chandigarh University, Ajitgarh, Punjab, India e-mail: [email protected] M. Nandan Haldia Institute of Technology, Haldia, India © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 A. Bhattacharya et al. (eds.), Innovations in Data Analytics, Advances in Intelligent Systems and Computing 1442, https://doi.org/10.1007/978-981-99-0550-8_1
3
4
S. Dey and M. Nandan
Fig. 1 Different types of immunities (Natural and Adaptive Immunities)
was nearly 8 million [1]. The report published by WHO shows that the after-effect of COVID creates a much more impact on comorbid patients than on normal ones [2–5]. The immunity power of a person plays a major role in recovery from COVID, and this may affect the mortality condition. Mainly factors like age, genetic history, and infection problems can reduce the normal immunity of a person. The technique to battle against such kind of diseases there are different types of immunities, which could be affected due to comorbidity issues. The following figure (see Fig. 1) says about different types of immunities. The above diagram depicts different types of immunity issues and according to a survey done by American Psychological Association (2020), there are certain reasons due to which the immunity of an individual may get damaged [5]. Mental Health issues like stress, anxiety, and depression are one of the prominent factors in losing immunity, which actually weakens our digestive system, nervous system, or respiratory system. Diabetes is another killer disease which causes a harmful impact on the COVID patient, by reducing the immunity system. Patients having diabetes, any incurable disease like HIV or cancer, COPD, and mental health diseases are the factors to reduce immunity and hence they cause greater risks for COVID patients. In the next section, we shall discuss the risk factors of COVID related to a few vulnerable diseases.
1.1 Risk of COVID Patients with Comorbidity Issues SARS-CoV-2 infects persons of all ages, however, people above 60 years of age, as well as those possessing comorbidities such as diabetes, chronic respiratory disease, and cardiovascular disease, are at a higher risk of contracting an infection. Due to reduced phagocytic cell capacities, people with diabetes are more likely to contract infections. COVID-19 is also made more likely in diabetes patients for a number
Transience in COVID Patients with Comorbidity Issues …
5
of other reasons. Mendelian randomization study indicated a higher level of ACE2 receptors to be causally associated with diabetes; this may predispose patients with diabetes to SARS-CoV-2 infection [6]. The phagocytic cell containing Furin proprotein convertase aids in the virus’s entry into the host cell by reducing SARSreliance CoV-2 on human proteases, which in turn activates the SARS-CoV-2 spike (S) protein, attached to ACE-2 receptors. This pre-activation of the S protein permits the virus to enter the cell and avoid detection by the human immune system [7]. As a result, a dysregulated immune response characterized by elevated ACE-2 receptors and finds expression may result in greater lung inflammation and reduced insulin levels. For diabetic individuals, the virus’s easy entry creates a life-threatening condition [8]. COVID-19 patients with diabetes account for 11–58% of all COVID19 patients, according to new data, and diabetic patients have an 8% COVID-19 fatality rate [7]. COVID-19 participants with diabetic comorbidity had a 14.2% higher probability of ICU admission than those without diabetes [9].
2 COVID Patient Statistics with Comorbidity Issues Though this research field is a very new one, researchers are effectively researching in this area. Some studies have established a relationship between immunity and death rate by using comorbidity factors. In different countries, according to age, physical condition, and depending on their immunity level or comorbidity factors, people have shown different mortality rates. The following study has been performed by the Government of the United State of America (shown in Fig. 2), where we can observe the region-wise COVID death rate and confirmed cases rate in the USA, which can give us an idea about the demographics of the cases. The above diagram shows the region-wise death rate for COVID-19. This graph can show the statistics of COVID confirmed cases in the USA during the 2020 pandemic situation. From this demographic scenario, we can get an idea of the
Fig. 2 Region-wise death rate and confirmed cases of COVID 19
6
S. Dey and M. Nandan
spreading of the disease area-wise. In the following figure, we have shown the data according to the regional category of people. Like other countries, in India, COVID cases increased from the end of March to August last year. After a complete lockdown and maintaining social distance, the cases were under control from August 2020. This year, the second wave of COVID has practically devastated India and the mortality rate is much higher. In the following diagram, the statistics of the five most affected states from India have been demonstrated, both from 2020 and 2021. The majorly affected state or union territory areas in India are Delhi, Maharashtra, Kerala, Uttar Pradesh, and West Bengal. The following figures depict the COVID case scenario for the period during which the COVID cases were at their peak. Delhi was one of the most affected areas in India, for both years, whereas in Uttar Pradesh, in 2020 cases were minimal but in 2021, the cases were much higher. It was the same in West Bengal; the deaths as well as the overall affect rate in West Bengal were much less than in 2021. These statistics help to understand the overall aspect of COVID cases in different states (shown in Figs. 3, 4, 5, 6, 7 and 8). The above diagrams provide us with an idea about the overall COVID scenario and death or cured cases in different states of India. Among the death cases of COVID-19,
Fig. 3 COVID cases in Delhi (May–August 2020 and March–June 2021): Confirmed, cured, and death cases
Fig. 4 COVID cases in Kerala (May–August 2020 and March–June 2021): Cured cases, death, and confirmed cases
Transience in COVID Patients with Comorbidity Issues …
7
Fig. 5 COVID cases in Maharashtra (May–August 2020 and March–June 2021): Cured cases, death, and confirmed cases
Fig. 6 COVID cases in Uttar Pradesh (May–August 2020 and March–June 2021): Cured cases, death, and confirmed cases
Fig. 7 COVID cases in West Bengal (May–August 2020 and March–June 2021): Cured cases, death, and confirmed cases
comorbidity has been noted as a prime factor. Many deaths have been recorded in cases of patients having high diabetes, heart disease, COPD, etc. [10–24]. Among the COVID patients, it has been observed that the affect rate of comorbid patients is much higher, and also the immunity level may be varied based upon the natural immunity as per the genetic factors. Due to the high prevalence of diabetes, it is a common cause of comorbidity in patients associated with coronavirus. Diabetes prevalence in COVID-19 patients differs by the geographical regime, age, and ethnicity, as one might expect. It’s unclear whether diabetics with well-controlled
8
S. Dey and M. Nandan
Fig. 8 COVID cases according to different racial categories (shows the natural immunity level)
blood glucose levels are at a higher risk of infection with severe acute respiratory syndrome for coronavirus. COVID-19 infection is more likely to affect people who have uncontrolled medical diseases such as diabetes, hypertension, lung, liver, and renal illness; cancer patients on chemotherapy; smokers; transplant recipients; and patients who take steroids on a long-term basis. Studies reveal that patients with diabetes and those affected by COVID-19 are more prone to develop a worse prognosis and a higher mortality rate. Hypertension (15.8%), cardiovascular and cerebrovascular disorders (11.7%), and diabetes (9.4%) are the most common comorbidities observed in patients with COVID-19. Coexisting HIV and hepatitis B infection (1.5%), malignancy (1.5%), respiratory illnesses (1.4%), renal disorders (0.8%), and immune deficiencies were the less prevalent comorbidities (0.01%). The comorbidity issues generally affect natural immunity, and it can create a great impact on the death or the severity rate of COVID cases. The following diagram (see Fig. 9) helps us to understand the severity issues related to Comorbidity diseases. From the above diagram, we can get a statistics of different comorbidity issues. Now, it is time to discuss our search criteria and article selection procedure, which will be discussed in the next section.
Transience in COVID Patients with Comorbidity Issues …
9
Fig. 9 COVID-affected cases with different comorbidity diseases
3 Research Methodology We have performed a systematic review from several popular medical journal databases like PUBMED, BMJ, Lancet, etc. and several popular conference papers.
3.1 Search Strategy We have performed a systematic review to recognize all studies that inspected mortality factors for comorbidities related to COVID-19. We performed the review in accordance with the PRISMA Statement methodology checklist [23]. A total of 530 papers were initially selected from different databases. After the initial screening according to title selection, 250 papers were on the list. But after checking the topic relevance and studying the full paper, overall 70 papers were selected and from those 70 papers, we have finally selected 12 papers for qualitative analysis, as per their relevance with our purpose of study and data availability. The following figure states our Prisma statement we have selected for our analysis (see Fig. 10), prepared as per the guidelines of Prisma statement [25]. Criteria of our study We have selected a popular medical database initially for selecting our articles. We have chosen mostly papers from 2020 and 2021, since papers from these two years mostly cover the aspects of COVID. Other than that, for immunity-related issues we have chosen papers from 2013 to 2021. For studying some basics about the comorbidity index or basic knowledge about our study, we have selected papers from past years also. The main niche of our study is comorbidities, and we have segmented the targeted papers as per those criteria.
10
S. Dey and M. Nandan
Fig. 10 Prisma statement to demonstrate article selection process for the review
Study Quality We have rated our selective papers in three different categories: Good, fair, and poor. The article which matches more than 50% of the purpose of our study and has been rated as good, and they have been selected for primary study after identification and screening. Some papers may not satisfy the purpose of our study directly, but they support the base of our idea; they have been included in our study under the fair category and some supporting facts have been taken from those papers. The papers which don’t match our study purpose, but title-wise have some similarities, have been categorized in the poor category. With the applied research methodology, we have analyzed the research that is going on related to that field. Our next section gives a complete view of several works performed in this area.
Transience in COVID Patients with Comorbidity Issues …
11
4 Related Literature Survey in This Area Several researchers have performed research on COVID comorbidity issues and diseases like high diabetes, heart and lung disease, COPD, and other diseases which reduce the immunity power, which are responsible to cut down the mortality rate of patients suffering from COVID-19. Some researchers have emphasized on development of the internal immune system [2], whereas many researchers have given their focus on studying the immunity index they have either followed or developed in their research [8–9, 12, 26]. In their paper, the authors [2] have represented a hypothesis to develop a powerful immune system against COVID-19. They discussed an increase of Cellular Adenosine Triphosphate (C-ATP) protein structure for potential improvement of our immunity. In their paper, they have discussed the inner immunity of a person that can help to fight COVID-19. According to the researchers, this C-ATP can stop the cytokine storm, which is a major cause of being badly affected by COVID19. In their hypothesis, the researchers have biologically illustrated the reasons why some people are more prone to be affected by COVID-19 and reach a more vulnerable condition. Some chronic medical illnesses like chronic heart, pulmonary, and kidney disease may be a reason for lower immunity as they are unable to produce CATP in sufficient amounts [7]. The major comorbidity factors are high diabetes, lung disease, renal problems, and some disease like cancer or oncological disorders which can directly affect the immunity power of our body. The authors [27] have discussed the issue of diabetes as a comorbid factor for COVID patients [10]. According to their research, patients with high diabetes are more prone to having a high mortality rate in COVID-19 [28]. The researchers have raised some research questions and addressed those questions with the support of the dataset and analysis. They have used a predictive model to analyze the impact of glucose levels on COVID patients. But they have only approached their research based on some specific research questions, not the overall impact of diabetes as a vital comorbidity factor. Researchers from multiple countries [29] are doing their research on comorbidity factors. One important analysis has been performed by [30], where they have estimated the risk factors from different comorbidities from a database received from a Korean health insurance claim. The researchers have used the Charlson comorbidity index. But in their study, they haven’t considered age as a constraint, while it is a prime factor in the case of comorbidity [13]. Zhao et al. [31] have discussed the estimation of different risk factors for COVID-19 mortality related to the comorbidity factors. They have discussed chronic respiratory disease and cardiovascular disease. But the major limitation of their study is that they didn’t cover accurate prevalence for gender per age or age-wise comorbidities, which may impact the result [12, 32–40]. Here, most of the studies we have covered are from China and Southeast Asia. But the authors in [41–46] performed their study on Italy, where the death rate was among one of the highest in the 2020 pandemic. Italy mostly covers old people. It has been observed that the fatality rate among old people is much higher, even without having any comorbidity issues. With comorbidity issues, the risk is much higher [46]. They
12
S. Dey and M. Nandan
Table 1 Literature survey based on Comorbid diseases (Diabetes, COPD, Hypertension, and Cancer) Comorbidity types
References
Geographical area covered
Mortality rate
Sample size
High Diabetes
(i) Gupta et al. [10] (ii) Guan et al. [15]
China
(i) With 95% CI, odd ratio of mortality rate 1.35–4.05 (ii) Hypertension—19.3%, pulmonary disease—12.6%, diabetes mellitus—11.9%
(i) 201 (age range 43–60) (ii) 1590 patients from 575 hospitals (mean age 47 years)
Cho et al. [13]
South Korea
With 92% CI, odd ratio of mortality rate 1.60–3.75
3095 male and 4495 female
Cardiovascular disease, chronic respiratory problem
Caramelo et al. [12]
China
Cardiovascular disease (adjusted odd ratio = 12.83) Chronic respiratory disease OR = 7.7925
504 deaths in 20,812 confirmed cases (age: more than 60, 60 to > 30 and below 30)
Hypertension, organ dysfunction, diabetes
Wu et al. [15]
China
19% more death rate than patients in normal condition (patients having diabetes) 13.7% more death rate for patients with hypertension
201 patient of age 43–60 years
COPD, Heart disease, cancer
Moccia et al. [31]
China, Italy
Diabetes—33.9%, COPD—13.7%, hypertension—73.8%, ischemic heart disease—30%, cancer (last 5 years)—19%
N.A
COPD
Zhao et al. [34]
South East Asia (five databases)
Pre-existing 2002 patients COPD—60% mortality Person with smoking habit 95% CI: 1.29-3.05
have basically performed a comparative study over a few journals and case reports and identified the comorbidity issues according to age and gender. The Table 1 discusses the impact of several comorbidity factors like diabetes, COPD, kidney dysfunction, and cancer-like diseases on COVID patients and how they affect the overall mortality rate than in normal conditions. From the above table, we can reach a conclusion that diseases like diabetes, COPD, or other cardiovascular diseases or cancer, and HIV can increase mortality and affect rate more than that in patients in normal conditions. Several prediction models can estimate the mortality
Transience in COVID Patients with Comorbidity Issues …
13
rates of COVID patients. In the following sections, we shall discuss several prediction models and comorbidity indexes that have been used in several research.
5 Discussion Researchers have focused on several comorbidity factors like hypertension, diabetes mellitus, cardiovascular diseases, and diseases like cancer or HIV as one of the major comorbid factors, responsible for the high mortality rates in COVID patients, and to measure those comorbidities certain popular comorbidity indexes are present to determine comorbidity rate. Some of the comorbidity indexes have been described in the following sub-section. Researchers have determined different comorbidity indexes based on certain criteria and diagnosis processes [44]: • Charlson Comorbidity Index (CCI)—It is a survival rate calculation tool that can assess comorbidity risk associated with several health conditions. It predicts a ten-year calculation to predict the comorbidities. This index contains 17 points of patients’ health conditions such as pulmonary disease, kidney dysfunction, diabetes mellitus, and HIV. Some researchers have used this index to determine the impact of comorbidity issues on COVID-19 mortality [13]. • Elixhauser Comorbidity Measure (ECM)—ECM index was developed for use in administrative patient databases to predict hospital charges and in-hospital mortality using records of 1,779,167 patients in the hospital database. It is one of the most commonly used indexes in comorbidity research. • Multimorbidity Index (MMI)—Multimorbidity index is an authenticated index comprising 40 conditions in the population and was developed based on healthrelated qualities of life. MMI was calculated in two ways: one, by counting morbidities and in other ways by weighting morbidities and analyzing by regression techniques.
6 Conclusion and Future Scope Several studies have been performed over the last two years to correlate the mortality rate of COVID patients having comorbidity issues like high diabetes, COPD, organ dysfunction, and fatal diseases which generally reduce immunity. From most of the studies, it has been clear that high diabetes, COPD, heart diseases, and cancer-like diseases are more prime comorbid factors as a contributor to raising the mortality rates of COVID-19 deaths or severity rate. Though most of the studies have been performed on patients from China or other Southeast Asian countries, and other than comorbidity diseases other factors may play a major role in overall immunity, such as overall lifestyle, drinking or smoking habits, and eating habits even, these are also very significant factors and play a role in the case of overall immunity factors.
14
S. Dey and M. Nandan
Still, people having all the covered comorbidity issues create an impact on COVID patients. This area leaves an ample scope of research for budding researchers and in future advanced algorithms like deep learning algorithms may be used for more specific prediction models of these comorbidity factors.
References 1. C. Zhang, H. He, L. Wang, N. Zhang, H. Huang, Q. Xiong et al., Virus-triggered ATP release limits viral replication through facilitating IFN-β production in a P2X7-dependent manner. J. Immunol. 199(4), 1372–1381 (2017) 2. F.T. Hesary, H. Akbari. The powerful immune system against powerful COVID-19: a hypothesis. Med. Hypothesis 140(109762) (2021) 3. V. Firoozeh, Geyrayeli, S. Milne, C. Cheung et al., COPD and the risk of poor outcomes in COVID-19: a systematic review and meta-analysis. Eclinical Med. 33(100789). Lancet (2021). https://doi.org/10.1016/j.eclinm.2021.100789 4. L. Zeming, L. Jinpeng, H. Jianglong, G. Liang, G. Rongfen et al., Association between diabetes and COVID-19: a retrospective observational study with a large sample of 1,880 cases in Leishenshan hospital, Wuhan. Frontiers Endocrinol. 11(2020), https://doi.org/10.3389/fendo. 2020.00478 5. H. Nadia, 6 signs you have a weakened immune system [2020], https://www.pennmedicine. org/up-dates/blogs/health-and-wellness. accessed May 20, 2022: 11:00 AM, India 6. Things that suppress your immune system, WebMD. https://www.webmd.com/cold-and-flu/ ss/slideshow-how-you-suppress-immune-system/. Accessed May 20, 2022: 11:30 AM, India 7. S. Cherri, S. Noventa, A. Zaniboni, Is the oncologic patient more susceptible to covid19 but perhaps less likely to undergo severe infection-related complications due to fewer cytokines storm as a consequent of the associated immunodeficiency? Med. Hypotheses 141(109758) (2020) 8. E. Prompetchara, C. Ketloy, T. Palaga, Immune responses in COVID-19 and potential vaccines: lessons learned from SARS and MERS epidemic. Asian Pac. J. Allergy Immunol. (2020) 9. R. Gupta, A. Hussain, A. Misra, Diabetes and COVID-19: evidence, current status and unanswered research questions. Eur. J. Clin. Nutr. 74(2), 864–870 (2020) 10. J. Yang, Y. Zheng, X. Gou, K. Pu, Z. Chen, Q. Guo et al., Prevalence of comorbidities in the novel Wuhan coronavirus (COVID-19) infection: a systematic review and meta-analysis. Int. J. Infect Dis. S1201–9712(20, )30136–3 (2020), https://doi.org/10.1016/j.ijid.2020.03.017 11. F. Caramelo, N. Ferreira, B. Oliveiros, Estimation of risk factors for COVID-19 mortality— preliminary results, MedRxiv(2020)https://doi.org/10.1101/2020.02.24.20027268(Preprint) 12. S.I. Cho, S. Yoon, H.J. Lee, Impact of comorbidity burden on mortality in patients with COVID19 using the Korean health insurance database. Scientic Reports 11(6375), 11 (2021), https:// doi.org/10.1038/s41598-021-85813-2 13. C. Wang, P.W. Hornby, F.G. Hayden, G.F. Gao, A novel coronavirus outbreak of global health concern. Lancet 395(10223), 470–473, https://doi.org/10.1016/S0140-6736(20)301 85-9. (preprint) 14. W.J. Guan et al., Comorbidity and its impact on 1590 patients with COVID-19 in China: a nationwide analysis. EurRespir J. 55(5), 201–208 (2020). https://doi.org/10.1183/13993003. 00547-2020 15. H. Quan, V. Sundararajan, P. Halfon, A. Fong, B. Burnand, J.C. Luthi, L.D. Saunders, C.A. Beck, T.E. Feasby, W.A. Ghali, Coding algorithms for defining comorbidities in ICD-9-CM and ICD-10 administrative data. Med. Care. 43(11), 1130–1139 (2005), https://doi.org/10.1097/01. mlr.0000182534.19832.83
Transience in COVID Patients with Comorbidity Issues …
15
16. C. Wu, X. Chen, Y. Cai, J. Xia, X. Zhou, S. Xu et al., Risk factors associated with acute respiratory distress syndrome and death in patients with coronavirus disease 2019 pneumonia in Wuhan, China. JAMA Intern Med. (2020), https://doi.org/10.1001/jamainternmed.2020. 0994 17. M. Pourhomayoun, M. Shakibi, Predicting mortality risk in patients with COVID-19 using artificial intelligence to help medical decision-making, medRxiv. (2020); (published online April 1). https://doi.org/10.1101/2020.03.30.20047308 18. V. Jain, J.M. Yuan, Predictive symptoms and comorbidities for severe COVID-19 and intensive care unit admission: a systematic review and meta-analysis. Int. J. Public Health 65(5), 533–546. https://doi.org/10.1007/s00038-020-01390-7 19. M. Mudatsir et al., Predictors of COVID-19 severity: a systematic review and meta-analysis. F1000Res. 2020 Sep 9;9:1107, https://doi.org/10.12688/f1000research.26186.2 20. Y. Tjendra, A.F. Al Mana, A.P. Espejo, Y. Akgun, N.C. Millan, C. Gomez-Fernandez, C. Cray, Predicting disease severity and outcome in COVID-19 patients: a review of multiple biomarkers. Arch Pathol Lab Med. 144(12), 1465–1474 (2020). https://doi.org/10.5858/arpa. 2020-0471-SA 21. A. Sanyaolu et al., Comorbidity and its impact on patients with COVID-19. SN Comprehensive Clinical Med. 2(1), 1069–1076 (2020), https://doi.org/10.1007/s42399-020-00363-4 22. D. Moher, A. Liberati, J. Tetzlaff, D. Altman, Preferred reporting items for systematic reviews and meta-analyses: the PRISMA statement. Ann. Intern. Med. 151, 264–269 (2009) 23. M. Goodwin, What Is a Coronavirus? URL: https://www.healthline.com/health/coronavirustypes. Accessed 22 May 2022 :11:00 AM, India 24. A. Jain, Coronavirus—What Is It? How Does It Spread? What Are The Symptoms And Cure? https://www.indiatimes.com/trending/human-interest/coronavirus-symptoms-cure-med icines-sars-505173.html. Accessed 20 May 2022 :11:23 AM, India 25. Y. Gao, Structure of the RNA-dependent RNA polymerase from COVID-19 virus 368(6492), 779–782 (2021), https://doi.org/10.1126/science.abb7498 26. Y. Chen, Q. Liu, D. Guo, Emerging coronaviruses: Genome structure, replication, and pathogenesis. J. Med. Virology 27. Five factors that affect the immune system, Available: https://www.kemin.com/na/en-us/blog/ human-nutrition/five-factors-that-affect-immune-system. Accessed 23 May 2022: 10:00 PM, India 28. S. Nadeem, Coronavirus COVID-19: available free literature provided by various companies. J. Organ. Around World J. Ong. Chem. Res. 5(1), 7–13 (2020), Document ID: 2020JOCR37, https://doi.org/10.5281/zenodo.3722904 29. L. Wiegner, D. Hange, C. Björkelund, G. Ahlborg, Prevalence of perceived stress and associations to symptoms of exhaustion, depression and anxiety in a working age population seeking primary care—an observational study. BMC Fam. Pract. 16, 38 (2015) 30. B. Wang, R. Li, Z. Lu, Y. Huang, Does comorbidity increase the risk of patients with COVID-19: evidence from meta-analysis. AGING 2020 12(7) (2020) 31. Q. Zhao, M. Meng, R. Kumar, Y. Wu, J. Huang, N. Lian, Y. Deng, S. Lin, The impact of COPD and smoking history on the severity of COVID-19: a systemic review and meta-analysis. J. Med. Virol. 92(10), 1915–1921 (2020). https://doi.org/10.1002/jmv.25889 32. COVID-19 immunity: 15% immune instead of expected 40% in Sweden; what immunological dark matter means and more, Available https://www.firstpost.com/health/covid-19-immunity15-immune-instead-of-expected-40-in-sweden-what-immunological-dark-matter-means-andmore-8701191.html 33. M. Caminati et al., BCG vaccination and COVID-19: much ado about nothing? Med. Hypothesis 144 (2020), https://doi.org/10.1016/j.mehy.2020.110109 34. Y. Shi, Y. Wang, C. Shao et al., COVID-19 infection: the perspectives on immune responses. Cell Death Differ. 27, 1451–1454 (2020). https://doi.org/10.1038/s41418-020-0530-3 35. A. Miller et al., Correlation between universal BCG vaccination policy and reduced morbidity and mortality for-COVID-19: an epidemiological study, https://doi.org/10.1101/2020.03.24. 2004293
16
S. Dey and M. Nandan
36. S. Dey, C. Chakraborty, Emotional Intelligence-creating a new roadmap for Artificial Intelligence. Int. J. Eng. Simul. Model. Syst. Inderscience 12(4), 291–300 (2021) 37. K. Newman, Study: High Blood Pressure, Obesity Are Most Common Comorbidities in COVID-19 Patients, URL: https://www.usnews.com/news/healthiest-communities/articles/ 2020-04-22/obesity-hypertension-most-common-comorbidities-for-coronavirus-patients. Accessed 24 May 2022: 11:00 AM, India 38. https://data.cdc.gov/NCHS/Provisional-COVID-19-Death-Counts-by-Sex-Age-and-S/9bhghcku/data 39. F.T. Hesary, H. Akbari, The powerful immune system against powerful COVID-19: a hypothesis. Med. Hypotheses 140(109762) (2020), https://doi.org/10.1016/j.mehy.2020.109762 40. R. Ganji, P.H. Reddy, Impact of COVID-19 on mitochondrial-based immunity in aging and age-related diseases. Front Aging Neurosci. 12, article. 614650 (2020), https://doi.org/10.3389/ fnagi.2020.614650. PMID: 33510633; PMCID:PMC7835331 41. Y. Xu, D.J. Baylink, C.S. Chen, M.E. Reeves, J. Xiao, C. Lacy, E. Lau, H. Cao, The importance of vitamin d metabolism as a potential prophylactic, immunoregulatory and neuroprotective treatment for COVID-19. J. Transl. Med. 18(1), 322, https://doi.org/10.1186/s12967020-02488-5.2021 42. K. Holder, P.H. Reddy, The COVID-19 effect on the immune system and mitochondrial dynamics in diabetes, obesity, and dementia. Neuroscientist 26, 1073858420960443 (2020). https://doi.org/10.1177/107385842096044 43. K.T. Bajgain, S. Badal, B.B. Bajgain, M.J. Santana, Prevalence of comorbidities among individuals with COVID-19: A rapid review of current literature. Am. J. Infect. Control. 49(2), 238–246 (2021). https://doi.org/10.1016/j.ajic.2020.06.213 44. D. Ferreira-Santos, P. Maranhão, M. Monteiro-Soares, COVIDcids. Identifying common baseline clinical features of COVID-19: a scoping review. BMJ Open. 10(9). https://doi.org/10. 1136/bmjopen-2020-04107 45. Charlson Comorbidity Index, URL : https://www.mdapp.co/charlson-comorbidity-index-ccicalculator-131/. Accessed on may 25,2022:10:30AM, India 46. A.K. Singh, R. Gupta, A. Ghosh, A. Misra, Diabetes in COVID-19: prevalence, pathophysiology, prognosis, and practical considerations. Diabetes Metab Syndr Clin Res Rev 14(4), 303–310 (2014)
NFT HUB—The Decentralized Non-Fungible Token Market Place P. Divya, B. Bharath Sudha Chandra, Y. Harsha Vardhan, T. Ananth Kumar, S. Jayalakshmi, and R. Parthiban
Abstract In today’s digital environment, content ownership is a huge issue. Monetization for material such as artwork, photographs, novels, gifs, and memes is not possible on any social networking platform. This project enables digital creators and owners to produce and sell customized crypto assets that serve as proof of ownership and aid in detecting and preventing digital content counterfeiting and copying. It is also used to raise the ownership standards for digital content. A non-fungible token can be made out of any digital file. NFTs are files that are tracked using the same blockchain technology that is used to support cryptocurrencies such as Bitcoin and Ethereum. This system allows buyers and sellers to keep track of who owns which files. A Non-Fungible Token (NFT) defines a non-transferable data unit that may be bought and traded on a digital ledger described as a Blockchain. Web content such as photography, videos, and audio files can be linked using a variety of NFT data units. NFTs vary from exchangeable cryptocurrencies like bitcoin in that each token is uniquely identifiable. We can identify the owners of digital content and provide remuneration to the creators of digital content using the NFT marketplace. Digital assets, such as cryptocurrencies and tokens, are supported by blockchain technology. Tokens are typically created utilizing smart contracts on top of the blockchain network. Blockchain is a new technique that can solve a variety of problems, such as ten birds with one bullet. As a result, there is no need to discover answers to each and every difficulty with Blockchain. In the next 5–10 years, Blockchain technology will be embraced by 20 different sorts of companies. Blockchain is a decentralized system that solves the issues that centralized systems have. Blockchain combines peer-topeer networks with cryptographic techniques to create a safe and secure platform for application development.
P. Divya (B) · B. B. S. Chandra · Y. H. Vardhan · T. Ananth Kumar · S. Jayalakshmi · R. Parthiban IFET College of Engineering, Villupuram, India e-mail: [email protected] T. Ananth Kumar e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 A. Bhattacharya et al. (eds.), Innovations in Data Analytics, Advances in Intelligent Systems and Computing 1442, https://doi.org/10.1007/978-981-99-0550-8_2
17
18
P. Divya et al.
Keywords Non-fungible tokens · Blockchain · Polygon network · Secure hash algorithm · Ethereum
1 Introduction 1.1 Non-Fungible Token Non-Fungible Tokens (NFT) concept is based on the Ethereum Standard token which tries to differentiate each and every token using the unique signals [1]. These tokens can be linked to as unique identifiers; these tokens can be connected to virtual/digital properties. These cryptographic assets, also known as non-fungible tokens (NFTs), are based on blockchain technology and contain unique identifiers and information that distinguishes them from one another. Real-world assets, like artwork and real estate, can be represented by such tokens [2]. NFTs are blockchain cryptographicbased tokens that are one-of-a-kind and cannot be duplicated. NFTs may also be used to represent real-world objects such as art and real estate. These real-time world tangible goods may be “tokenized” to make buying, selling, and exchanging them more effective while also decreasing the risk of fraud. All of the aforementioned assets may be freely exchanged using NFTs at values that are configurable depending on their ages, rarity, liquidity, and other factors [3]. It has significantly expedited the decentralized application (DApp) industry’s growth. Non-Fungible Token (NFT) is a cryptocurrency that was created using Ethereum’s smart contracts.
1.2 Blockchain Blockchain functions as an immutable ledger, enabling decentralized transactions [4]. In Blockchain, all the committed transactions will be maintained by a list of blocks, which may be thought about as the public ledger. Asymmetric cryptography and distributed consensus approaches have been used to ensure the security of the user and the ledger consistency [5]. Decentralization, persistence, anonymity, and auditability are all significant elements of blockchain technology. The key elements of Blockchain: 1. Distributed Ledger Technology 2. Immutable Records 3. Smart Contracts.
NFT HUB—The Decentralized Non-Fungible Token Market Place
19
2 Literature Survey Non-fungible tokens (NFTs) have grown in popularity in 2021, with billions of dollars in transaction volume. Market intelligence software has also been developed to track summary information on pricing and sales activities across several NFT collections [6]. We demonstrate how marketplace design affects market intelligence by focusing on bidding expenses, transaction costs, the quantity of selling bots, and the interface for making bids all influence the price. We create an empirical relationship of the dynamic interplay among sellers and buyers utilizing data again from the Crypto Punks marketplace. According to counterfactual simulations, lowering bidding prices has no effect on the amount of sales, but it does raise the percentage of sales resulting from bids [7]. Listing prices climb as sellers expect more bids, making products appear more desirable. The disparity between rare and common asset listing and realized selling price ratios is narrowing, giving the market a more uniform appearance. Collections given by two independent marketplaces may have radically different market statistics due to changes in bidding prices rather than differences in fundamental merit [8]. The findings have implications for the interpretation of NFT market intelligence. Because of the rapid growth of blockchain technology and the growing demand for partial decentralization of the Internet, the usage of the technology underlying based on the blockchain has been widely criticized [9]. Along with decentralized items, Ethereum, a programmable money system, is gaining traction. In the guise of decentralization, smart contracts, on the other hand, jeopardize security. Ethereum faces a potentially fatal problem with a vast number of users, and user negligence contract drafting poses a threat to the Ethereum network as a whole [10]. As a result, the purpose of the essay is to look into and expand the uses of smart contracts on the Ethereum blockchain. We start with the fundamentals to building Ethereum’s structure, then move on to smart contract security issues. Finally, [11], an effective smart contract auction application is created, which will help with the further consolidation and understanding of smart contracts in practice. The use of modern technology to keep track of any sensitive papers relating to education, health, or money has skyrocketed. It aids in the prevention of attackers gaining illegal access to data [12]. All current advanced technologies, however, have certain limitations due to inherent unpredictability. In terms of privacy, security, transparency, reliability, and flexibility, these systems have certain flaws. When dealing with sensitive data, such as school or medical certificates, these elements are required. As a result, in this article [13], we constructed an Industry-based blockchain application on the Remix Ethereum blockchain to administrate medical certificates. This application also makes use of a distributed application which employs an RPCbased test Ethereum blockchain as well as a user expert system as a knowledge agent [14]. The use of logistic Map encryption cipher on the existing certificates of medicals while uploading them to the blockchain is a significant strength of our study. This tool assists in the quick examination of the birth, death, and sick rate depending on characteristics like location and year.
20
P. Divya et al.
Fig. 1 Block diagram
3 Proposed System Nowadays, we have seen that content creators are not getting appropriate recognition and credit in social media platforms [15]. In this project, we provided the revenuegenerating platform for the content creators which reduces the gas fee value when compared to the existing system. They utilized the Ethereum blockchain network in the previous system, which had a higher gas charge [16]. So, in this project we use the polygon Network-based Ethereum Blockchain to reduce the gas fee by less than about 10%.
3.1 Blockchain Blockchain technology creates a decentralized database that is both safe and secure, as well as simple to read. It facilitates buyer-seller contact on a peer-to-peer basis [17]. It makes the entire procedure transparent, and the data is unchangeable once it is entered. Our proposed solution is a low-cost, revenue-generating platform for content creators. It also supports Ethereum-based NFTs, allowing users to trade their NFTs on the Ethereum blockchain (Fig. 1).
NFT HUB—The Decentralized Non-Fungible Token Market Place
21
Fig. 2 Structure of Ethereum blockchain
3.2 Ethereum Blockchain The blockchain is a type of digital ledger. Ethereum Technology is well-suited to digital product exchange and is simple to integrate with artificial intelligence and virtual currency [18]. The industrial revolution is characterized by communicating, sharing, opening, and releasing information or data via a cloud computing system, as well as analyzing big data, using smart devices to collect info and enhance people’s well-being through the application of Information and Communications Technology (ICT) (Fig. 2).
3.3 Properties of Blockchain 1. 2. 3. 4. 5.
Distributed Systems Peer-to-Peer Networks Distributed Ledger Append-Only Updatable via Consensus (Fig. 3).
3.4 Generic Elements of the Blockchain 1. Address 2. Transaction 3. Peer-to-Peer Network
22
P. Divya et al.
Fig. 3 Generic structure of blockchain
Fig. 4 Structure of blockchain
4. Transaction 5. Scripting or Programming Language
3.5 Secure Hash Algorithm Security has grown in importance as a study issue in recent years. To increase the performance of these information-protection techniques, many cryptographic algo-
NFT HUB—The Decentralized Non-Fungible Token Market Place
23
Table 1 SHA-256 hash function characteristics Hash functions SHA-256 The hash value’s size Ct, number of constants The message block’s size Size of the words Digest round number
256 64 512 32 64
rithms have been created [19]. A hash function, such as SHA-1, is a cryptographic technique that does not require a key. The use of secure hash algorithms such as SHA-1, SHA-224, SHA-256, SHA-384, and SHA-512 is specified by the National Institute of Standards and Technology (NIST) standard. During data transmission, cryptographic hash algorithms are used to generate the key stream [20]. A hash value function produces a fixed-length output from an arbitrary-length message input [18]. It’s tough to reverse a hash code to the input of the message since it’s a one-way hash function. Furthermore, finding a message that gives the identical hash result is computationally impossible. These criteria become crucial in ensuring that a hash algorithm functions effectively (Fig. 4).
3.5.1
SHA-256 (Secure Hash Algorithm)
Pre-processing as well as hash computation are the two steps of the SHA-256 algorithm. Padding the signal and parsing padded information into m-blocks are both examples of pre-processing [21]. To be utilized in the hash computation, initialization values are set. From the padded message, hash computation generates a message schedule. The digested message is evaluated using the result value of the hash produced by hash computation. The message scheduler calculates the Mt of SHA-256 for the message. Hash computation incorporates its development of a scheduler message, functions, constants, and word operations repeatedly in order to produce a value of hash (Table 1). Pre-processing and hash computation are the two steps of the SHA-256 algorithm. Padding the signal and parsing padded information into m-blocks are both examples of pre-processing. To be utilized in the hash computation, initialization values are set. From the padded message, hash computation generates a message schedule. The digested message is evaluated using the result value of the hash produced by hash computation. The message scheduler calculates the Mt of SHA-256 for the message. Hash computation incorporates its development of a scheduler message, functions, constants, and word operations repeatedly in order to produce a value of hash [12]. For 0 ≤ T r ≤ 15, while for 16 ≤ T r ≤ 63, a message is straight from the input message; the following Eq. (1) is used to compute the weight of a message Mt where
24
P. Divya et al.
Table 2 SHA-256 initial hash values Register h0 h1 h2 h3 h4 h5 h6 h7
Buffer initialization 32’h6a09e648 32’hbb67ae75 32’h3c6ef325 32’ha54ff48q 32’h510e589j 32’h9b05592d 32’h1f83d5sc 32’h5be025sh
Tr is indicated as the number of rounds of transformation, RRn (l) is indicated as the right rotation of l by bits n, and RSHn (l) is indicated as the right shift of l by bits n. Message Scheduler of SHA-256, Mt is Mt = Input Message 0 ≤ T r ≤ 15 Mt = Mt = σ 1256 (Mt−2 ) + Mt−7 + σ 0256 (Mt−15 ) + Mt−16
16 ≤ T r ≤ 63 (1)
where σ 0256 (a) = R R 7 (l) + R R 18 (l) + RS H 3 (l)
(2)
σ 1256 (a) = R R 17 (l) + R R 19 (l) + RS H 10 (l)
(3)
Parameters s, t, u, v, w, x, y, and z are allocated the initial hash values h0 to h7, as indicated in the below table. These are 32-bit words for the initial hash values. In order to update the SHA-256 hash algorithm, every 32-bit constant Ct contains 64 values (Table 2). The length of the hash function affects the security of the SHA-256 hash algorithm. Pre-processing is the first stage in the SHA-256 hash algorithm; the source message was padded. After receiving the message input, padding begins, and at the conclusion of the message, a single 1-bit is appended. Then it is followed by the n 0-bits until the message’s length is equal to 448 modulo 512. The final 64 bits are allotted for computing message length. As a result, the total signal input is 512 bits. It is made up of four processes that round numbers 0 to Tr = 63. The four roles are from Tr = 1(a), whereas , ¬, and 0(a), and as follows: Ch (s, t, u), Maj (a, b, c), indicate logical AND gate, NOR gate, and XOR gate. Ch(a, b, c) = (a ∧ b) ⊕ (¬a ∧ c)
(4)
Ma j (s, t, u) = (s ∧ t) ⊕ (s ∧ u) ⊕ (t ∧ u)
(5)
NFT HUB—The Decentralized Non-Fungible Token Market Place
25
0(d) = R R2(d) + R R13(d) + R R22(d)
(6)
1( f ) = R R6( f ) + R R11( f ) + R R25( f )
(7)
In order to determine the four function Formulas (4), (5), (6), and (7), hash computation creates eight variables with initial values: s, t, u, v, w, x, y, and z. Constant Ct and signal input, Mt, make up the 64 iterative operations. The hash values are obtained using the calculations below T mp1 = H + 1(w) + Ch(s, t, u) + Ct + Mt T mp2 = 0(s) + Ma j (s, t, u) H=y y=x x =w w = v + T mp1 v=u u=t t =s s = T mp1 + T mp2 After 64 repetitions, hash values h0 to h7 are calculated using the modulo-32-bit adders: h0 = s + h0, h1 = b + h1 = c + h2, h3 = d + h3 h4 = w + h4, h5 = x + h5, h6 = y + h6, h7 = z + h7 MessageDigest = h 0 ||h 1 ||h 2 ||h 3 ||h 4 ||h 5 ||h 6 ||h 7
3.5.2
Design of SHA-256
The Bitcoin platform’s hash value and mining technique are the Secured Hashing Technique, SHA-256, which refers to a cryptographic hash that generates a 256-bit integer. It is in charge of the creation and management of addresses, including the validation of transactions. The SHA-256 technique is still hard to crack. Furthermore, the SHA256 algorithm, like the SHA-512 algorithm, is among the most widely utilized algorithms since it analyzes faster than most other safe top models [18] (Fig. 5).
26
P. Divya et al.
Fig. 5 Computation diagram of SHA-256
4 Results and Discussion 4.1 Deployment Confirmation Once the contract is deployed, the transaction will be recorded in the ether scan network. The length of the hash function affects the security of the SHA-256 hash algorithm. Pre-processing is the first stage in the SHA-256 hash algorithm; the source message was padded. After receiving the message input, padding begins, and at the conclusion of the message, a single 1-bit is appended (Figs. 6 and 7).
NFT HUB—The Decentralized Non-Fungible Token Market Place
27
Fig. 6 Computation diagram of SHA-256
Fig. 7 Status of contract creation
4.2 Metamask Wallet The length of the hash function affects the security of the SHA-256 hash algorithm. Pre-processing is the first stage in the SHA-256 hash algorithm; the source message was padded. After receiving the message input, padding begins, and at the conclusion of the message, a single 1-bit is appended (Fig. 8).
4.3 Minting Page The length of the hash function affects the security of the SHA-256 hash algorithm. Pre-processing is the first stage in the SHA-256 hash algorithm; the source message was padded. After receiving the message input, padding begins, and at the conclusion of the message, a single 1-bit is appended (Fig. 9).
28
P. Divya et al.
4.4 Transaction Address The length of the hash function affects the security of the SHA-256 hash algorithm. Pre-processing is the first stage in the SHA-256 hash algorithm; the source message was padded. After receiving the message input, padding begins, and at the conclusion of the message, a single 1-bit is appended (Fig. 10).
Fig. 8 Metamask wallet
Fig. 9 Minting (selling) NFTs
NFT HUB—The Decentralized Non-Fungible Token Market Place
29
Fig. 10 Hash values (token IDs)
Fig. 11 Dashboard (created NFTs)
4.5 Dashboard The length of the hash function affects the security of the SHA-256 hash algorithm. Pre-processing is the first stage in the SHA-256 hash algorithm; the source message was padded. After receiving the message input, padding begins, and at the conclusion of the message, a single 1-bit is appended (Fig. 11).
5 Conclusion Finally, the NFT marketplace is decentralized. We can improve data interpretation standards and create a safe NFT marketplace ecosystem with the aid of this prototype by preventing data breaches in NFT marketplaces. This protects people’s data from hackers and avoids ransomware assaults. Finally, the NFT method allows artists to get credit for and market their work.
30
P. Divya et al.
5.1 Future Scope of the Project NFTs are expanding quicker than we anticipate. Their popularity is at an all-time high, attracting significant cash from a variety of industries, including e-commerce. They’re stumbling into a slew of chances to boost popular adoption of Dapps [4]. It’s fair to assume that you’ll be tokenizing your prized possessions in the future. In terms of scope, it has undoubtedly benefitted intellectual property and licensing. Users will soon be able to buy the goods they want using non-fungible tokens and then borrow against them utilizing decentralized finance. Furthermore, in almost every industry, NFTs are on their way to finding a multitude of chances, liquidity, and value.
References 1. S. Nakamoto, Bitcoin: A Peer-to-Peer Electronic Cash System 2. I. Bashir, Mastering Blockchain (Packt Publishing Ltd. 2017) 3. Z. Zheng, S. Xie, H. Dai, X. Chen, H. Wang, An overview of blockchain technology: architecture, consensus, and future trends, in 2017 IEEE International Congress on Big Data (BigData Congress) (IEEE, 2017), pp. 557–564 4. S. Raval, Decentralized Applications: Harnessing Bitcoin’s Blockchain Technology (O’Reilly Media, Inc., 2016) 5. W. Cai, Z. Wang, J.B. Ernst, Z. Hong, C. Feng, V.C. Leung, Decentralized applications: the blockchain-empowered software system. IEEE Access 6, 53019–53033 (2018) 6. A. Bogner, M. Chanson, A. Meeuw, A decentralised sharing app running a smart contract on the Ethereum blockchain, in Proceedings of the 6th International Conference on the Internet of Things (2016), pp. 177–178 7. W. Rehman, H. Zainab, J. Imran, N.Z. Bawany, Nfts: applications and challenges, in 2021 22nd International Arab Conference on Information Technology (ACIT) (IEEE, 2021), pp. 1–7 8. K.B. Muthe, K. Sharma, K.E.N. Sri, A blockchain based decentralized computing and NFT infrastructure for game networks, in 2020 Second International Conference on Blockchain Computing and Applications (BCCA) (IEEE, 2020), pp. 73–77 9. S. Casale-Brunet, P. Ribeca, P. Doyle, M. Mattavelli, Networks of Ethereum non-fungible tokens: a graph-based analysis of the ERC-721 ecosystem, in 2021 IEEE International Conference on Blockchain (Blockchain) (IEEE, 2021), pp. 188–195 10. M. Nadini, L. Alessandretti, F. Di Giacinto, M. Martino, L.M. Aiello, A. Baronchelli, Mapping the NFT revolution: market trends, trade networks, and visual features. Sci. Rep. 11(1), 1–11 (2021) 11. J. Saravanan, R. Rajendran, P. Muthu, D. Pulikodi, R.R. Duraisamy, Performance analysis of digital twin edge network implementing bandwidth optimization algorithm. Int. J. Comput. Digital Syst. (2021) 12. R.A. Canessane, N. Srinivasan, A. Beuria, A. Singh, B.M. Kumar, Decentralised applications using Ethereum blockchain, in 2019 Fifth International Conference on Science Technology Engineering and Mathematics (ICONSTEM), vol. 1. (IEEE, 2019), pp. 75–79 13. S. Nakamoto, Bitcoin: a peer-to-peer electronic cash system. Decent. Bus. Rev. 21260 (2008) 14. K.S. Kumar, T. Ananth Kumar, A.S. Radhamani, S. Sundaresan, Blockchain technology: an insight into architecture, use cases, and its application with industrial IoT and big data, in Blockchain Technology (CRC Press, 2020), pp. 23–42 15. T.T. Kuo, H.E. Kim, L. Ohno-Machado, Blockchain distributed ledger technologies for biomedical and health care applications. J. Am. Med. Inform. Assoc. 24(6), 1211–1220 (2017)
NFT HUB—The Decentralized Non-Fungible Token Market Place
31
16. M.B. Hoy, An introduction to the blockchain and its implications for libraries and medicine. Med. Ref. Serv. Q. 36(3), 273–279 (2017) 17. S.A. Talesh, Data breach, privacy, and cyber insurance: how insurance companies act as “compliance managers” for businesses. Law Soc. Inq. 43(2), 417–440 (2018) 18. K. Peterson, R. Deeduvanu, P. Kanjamala, K. Boles, A blockchain-based approach to health information exchange networks (2016), https://www.healthit.gov/sites/default/files/12-55blockchain-based-approach-final.pdf. Accessed 05 September 2019 19. K.S. Kumar, T. Ananth Kumar, A.S. Radhamani, S. Sundaresan, 3 blockchain technology, in Blockchain Technology: Fundamentals, Applications, and Case Studies, vol. 23 (2020) 20. P. Zhang, M.A. Walker, J. White, D.C. Schmidt, G. Lenz, Metrics for assessing blockchainbased healthcare decentralized apps, in 2017 IEEE 19th International Conference on e-Health Networking, Applications and Services (Healthcom) (IEEE, 2017), pp. 1–4 21. N. Chaudhry, M.M. Yousaf, Consensus algorithms in blockchain: comparative analysis, challenges and opportunities, in 2018 12th International Conference on Open Source Systems and Technologies (ICOSST) (IEEE, 2018), pp. 54–63
Hashgraph: A Decentralized Security Approach Based on Blockchain with NFT Transactions P. Divya, S. Rajeshwaran, R. Parthiban, T. Ananth Kumar, and S. Jayalakshmi
Abstract Blockchain technology is used by developers to establish computational confidence in their goods. As a result, organizations and individuals who may not know or trust one another may collaborate quickly and cheaply. You can create and exchange value, establish identity, and check with public distributed ledgers, and the blockchain server is unique in that it achieves the same result as the most widely used public blockchains (such as Bitcoin or Ethereum). These advantages are due to the underlying hash-graph consensus technique and the global enterprise governing body that now owns and manages our proposed project’s efficient blockchain nodes. We will execute NFT transactions using our hedera-optimized blockchain network. This recommended approach combines a variety of areas, including encryption, banking, and financial transactions. The introductory portion below describes the study that allows NFT transactions in the hedera-optimized network phase of the three. This is the first eminent research on the NFT Transaction on the blockchain approach that we are aware of. By adopting an efficient method between data centers and HCS technology, it is possible to assess our multichain approach to banking and financial circumstances with ease. This technique for computational confidence in banking and financial situations is both economical and effective. The incident response and prevention team can quickly enhance any difficult and complicated procedure. It falls under the capability of NFT with blockchain to act rapidly and decide in a perplexed condition. It offers the easiest, most efficient route throughout the transaction phase for interacting with potential customers, integrating test-net, main-net, and mirrornet. With the simplicity of calculation and computation on this enhanced blockchain server, it draws new customers in the future. Keywords Blockchain · NFT (non-fungible token) · Security · Scalability · Authentication · Decentralized security · HBAR · Token service P. Divya (B) · S. Rajeshwaran · R. Parthiban · T. Ananth Kumar · S. Jayalakshmi IFET College of Engineering, Villupuram, India e-mail: [email protected] T. Ananth Kumar e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 A. Bhattacharya et al. (eds.), Innovations in Data Analytics, Advances in Intelligent Systems and Computing 1442, https://doi.org/10.1007/978-981-99-0550-8_3
33
34
P. Divya et al.
1 Introduction Our proposed optimized blockchain server is a proof-of-stake public distributed ledger that aspires to achieve full independence by combining a “route to trustless” (network nodes) and a “route to widespread coin distribution” (HBAR money) to maintain the network secure [1]. Let’s look at trustless nodes and coin distribution, as well as the function they play in achieving and maintaining independence in a secure manner [2]. Our proposed optimized blockchain server is launching as a public blockchain network with open access in the upper left quadrant—the network’s nodes will be administered by Executive Committee members who have been invited to join as network operators [3]. As the network’s performance, security, stability, and incentives improve, our optimized blockchain network will reduce permissions and allow node operation to additional companies and individuals [4]. This is the path Master server will take the steps necessary to achieve its goal of being the most distributed public trustless ledger on the market while safeguarding security at every step. Defi protocols are used to transfer funds from one blockchain server to another while ensuring that no outliers are detected [4]. In NFT transactions, Defi protocols are used to detect financial and banking operations such as loan amount computation, credit and debit status of client details, and customer due date for each transaction to be logged and recorded [5].
1.1 Optimized Blockchain Server The goal of an optimized blockchain server is to provide information through the blockchain node while minimizing the use of the root server and accessing a large number of transactions and data from multiple blockchain servers in a timely manner [6]. The method is to split a single blockchain node into numerous blocks that may be utilized for transactions. This might reduce the use of NFT transactions in highrisk areas, allowing data theft or failure to be prevented and saved on an optimized blockchain server [7]. By dividing and fragmenting the networks into three nodes (Test-net, Main-net, and Mirror-net) to commit the transaction in the database, this might lower the gas charge during large-scale transactions [8]. WSN nodes, edge computing devices connect with devices in the blockchain nodes across the hashgraph network to commit the NFT transaction in banking operations. These transactions are carried out in the most efficient manner possible on the blockchain nodes in order to interact with NFT transactions [9] (Fig. 1). To communicate and conduct operations effectively, NFT transactions in the blockchain node use hashgraph and their tokens. This decreases the network path and cost of executing the blockchain nodes, making transactions in the blockchain node safer [10]. This might be applied in the consensus network to perform network operations by interacting with gossip and hashgraph to execute NFT transactions in the communication between various blockchain nodes. This HCS approach is used
Hashgraph: A Decentralized Security Approach …
35
Fig. 1 Phase of blockchain server network optimization
to do operations in the blockchain node and effectively communicate. The gossip protocol is used to communicate in both private and public mode utilizing the HCS approach to establish the consensus operation [10]. The consensus engine may be used to monitor and regulate both public and private nodes. With a lower version of the gas cost in the blockchain node, the HCS may be utilized in Defi and other decentralized applications. It can also create up to 100 million tokens with less fees and a more secure link between blockchain nodes. We’ll get into more detail in the next part [11].
2 Literature Survey 2.1 Survey Report: On Token Service Baird et al. [4] In token service, which consists of 3 phases are Main-net, Mirror-net, and Test-net which is used to implement the NFT Transaction in our network. In the main-net phase, it contains main-net nodes, mirror nodes, network services, and support. In the mirror-net Phase, which contains the information about community mirror nodes, ETL, Mirror node, one-click node deployment, run your own beta mirror node [13]. The beta version can implement only if the NFT Transaction phase is before the deployment to main-net phase. In the test-net phase which contains test-net nodes, mirror nodes, network services, and support. The network services consist of cryptocurrency, consensus, tokens, files, and smart contracts [14].
2.2 Survey Report: On Main-Net Access Baird et al. [4] A main-net phase, which should have a main-net account to interact with and pay for any of the network services (cryptocurrency, consensus, tokens, files, and smart contracts). The account is what holds the balance of HBAR to be used for transfer accounts or payments for network.
36
P. Divya et al.
2.3 Survey Report: Remove Centralized Control: Hala Systems Singh et al. [3] Storing data to a central database necessitates trust—faith that the database will be adequately secured and that the data will not be tampered with or erased. Hala Systems needed a solution that could be implemented quickly after it had been written [17]. As an unbiased observer who is not influenced by them or any political party. Unitary organization, but one that is distributed and has different services among several neutral parties Several assurances are provided by the distributed nature of the system, as well as the fact that it is immutable [18] to discover who provided the data, when it was received, and what information was contained in it. As a result of illegal access to Hala Systems’ network domain, over 2 million people’s accounts have been alerted. Unauthorized access to company data resulted in 140 warning letters being sent to their clients [19]. During the case study [20], 250k persons experienced reduced trauma as a result of using this service. Every network service uses the consensus mechanism to establish distributed consensus. For achieving the most mathematically efficient type of consensus, the Consortium blockchain proposed two unique approaches: gossip chatter and virtual voting [21]. Hashgraph is not only efficient but also secure; due to its asynchronous Byzantine fault tolerance, it has been found to be more robust in more circumstances than alternative consensus algorithms.
3 Proposed System 3.1 Non-Fungible Transaction A non-fungible transaction (NFT) is a type of cryptographic token in which the NFT’s properties are frequently included in the issuing smart function Object () [native code] and are part of the NFT’s fundamental native setup before it is issued. The amount can be sent to the main-net, which can check and validate the amount, and the transaction and record can be received by the mirror-net, which then transmits the transaction record to the test-net for log tracking and outlier identification during the transaction. If the record is legitimate, the token id for each transaction will be generated. The user will next approach the service provider and ask for a receipt. Finally, it will produce the NFT class for using blockchain to transact the amount in the NFT phase. On the network service provider’s side, which dynamically assigns and creates customized free schedules to complete NFT transactions and allow users to transact in streams (Fig. 2). The providers will then dynamically set the non-fungible token so that each user’s identification is distinct from that of other users. In this article, we will discuss the mint and burn transactions in the NFT blockchain. Minting is the process of establishing a blockchain network for NFT transactions. When we talk about the
Hashgraph: A Decentralized Security Approach …
37
Fig. 2 Transaction phase of admin to user
Fig. 3 Transaction phase of NFT
burn in the NFT blockchain, we’re discussing the destruction of the transaction record in the network. Even if the transaction record is burned, it will be stored on the blockchain network (Fig. 3).
38
P. Divya et al.
3.2 Blockchain Methodology As the word is used in the blockchain world, the hashgraph is 100% efficient. Work is occasionally lost in blockchain mining a block that is subsequently deemed old and abandoned by the community. The equivalent of a “block” in hashgraph never grows stale. Hashgraph makes good use of bandwidth as well. Hashgraph adds just a little cost above and above the amount of bandwidth necessary simply to tell all nodes about a particular transaction (even without obtaining consensus on a timestamp for such transaction).
3.3 Gossip Protocol A gossip protocol, also known as an epidemic protocol, is a computer peer-to-peer communication mechanism or process based on how epidemics propagate. To guarantee that data is delivered to all members of a group, numerous distributed methods utilize peer-to-peer gossip. This is the technique that allows all transactions to commit their transactions via the gossip protocol. This can resolve the error in multicast broadcasting communication by itself. Participants in the gossip protocol on the blockchain transmit new information (called gossip) about transactions, as well as about gossip.
3.4 Hashgraph Hashgraph employs its own consensus method and data structure called hashgraph to perform 100x more transactions per second than blockchain-based alternatives. Hashgraph is incredibly efficient thanks to two features: virtual voting and chatter about gossip. On hashgraph, each event has the following information: Self-parent, Other parent; Transactions, Timestamp, Signature on a computer (Table 1).
3.5 Consensus Algorithm [In Gossip Protocol] Grasping the concept of gossip is the history of how these occurrences are connected to one other through their parent hashes. This history is represented as a directed acyclic graph (DAG), a hashgraph, or a graph of hashes. In a private network, the link between neighboring nodes is established during the network transaction phase. The heart of an NFT network transaction is consensus, which allows nodes to communicate with one another. In the NFT blockchain transaction, it is a privacy and low-cost communication mechanism. App-net provides communication in NFT blockchain
Hashgraph: A Decentralized Security Approach … Table 1 Consensus algorithm components S. no Item 1
Timestamp
2
Transaction
3 4 5
Hash 1: Hash 2: Digital signature
a
39
Description The timestamp of when the member created the event commemorating the gossip sync The event can hold zero or more transactions Self-parent hash Other-parent hash Cryptographically signed by the creator of the event
The table about the consensus algorithm components
Fig. 4 Flow of NFT transaction phases
with privacy, trust, cheap cost, and transparency. It is a decentralized strategy to connecting neighboring nodes in order to communicate with various users in a secure manner. During the transaction phase of a public network, all network blocks and network nodes are trusted and transparent. Any user can access it at any moment. It can provide a secure method of retrieving and sending data to a public network (Fig. 4). For example, “Alice declares,” “Hello, Bob! I’d like you to do service ABC for me at the agreed-upon price of $120 ”. Bob responded by saying, “Hello, Alice! Agreed! At a cost of $120, I’m giving you the service ABC ”. Bob also stated, “Hello, Alice. I require Service DEF from you at the agreed-upon fee of $80 ”. Alice responded by saying, “Hello, Bob! Agreed! At a cost of $80, I’m providing you with DEF service.” Instead of proof of stake application providers, this is how the services were distributed among the different users.
40
P. Divya et al.
3.6 HCS Implementation On the AdsDax platform, a payable event is preserved, which is used to track NFT transactions in the blockchain and is monitored by this platform to manage the HCS implementation to be formed by securely sending and receiving NFT Blockchain transactions via the message queue. Following verification, a job is added to the message queue, which manages all NFT Blockchain transactions and maintains a log for each. The user must recognize the mirror node, which is the network node that establishes the connection to the network communication node. The HCS provider provides information about all messages, such as state proof for order and timestamp, as well as the unique identification of each user from the HCS server’s consensus node (Fig. 5). Each allowed party can get data from the controlled API and check directly to validate the information and integrity of the state of a coupon. The information for the applicable coupon can be obtained through the API, and the hash can be recalculated by the partner application. The transaction ID can be checked on a mirror node to ensure it was recorded appropriately on the public ledger once the hash is confirmed to match. Events are routed to a separate queue for HCS processing, which can process each NFT Blockchain transaction established by the HCs provider, which provides a network service to securely connect network nodes. Because the private and public keys of the NFT Blockchain transaction altered in each transaction phase of HCS, such as main-net, test-net, and mirror-net, events are picked up by workers but reformatted for HCs. These phases can be used to connect each network node with a secure connection provided by the consensus-engine.
4 Performance Evaluation This is a report of the Smart Contract service’s performance. A smart contract service that enables blockchain nodes to connect with unknown parties in order to transfer wallet balance from one to another at the same time. It uses Decentralized systems approach to securely commit transactions in multi-cast way. This adds solidity
Fig. 5 HCS data flow diagram
Hashgraph: A Decentralized Security Approach …
41
between transactions, acting as a bridge between each user throughout the transaction phase.
4.1 Transaction Cost Denomination of Our Network 1 Gigabar = 1 G = 1,000,000,000 1 Megabar = 1 M = 1,000,000 1 Kilobar = 1 K = 1000 1 Hbar = 1 = 1 1 Millibar = 1000 m = 1 1 Microbar = 1,000,000 µ = 1 1 Tinybar = 100,000,000 t = 1 t . On the Blockchain transaction charge, these are the transaction fees for the Token service. The Block chain NFT Transaction fee denominations can divide bulk transactions into these transaction denominations. It assigns a transaction charge to each Tinybar [16]. We can transact the amount of millions of transactions at low cost, low predictable gas fee, low carbon burn.
4.2 Transaction Cost of Network on Blockchain NFT Each function in the solidity programming, which is already pre-defined set of costs, was assigned in the numerous processes conducted by the consensus network (Fig. 6). The network’s currencies must be broadly spread to ensure the network’s security under a trustless design. In our network’s proof-of-stake consensus procedure, coins represent the “stake” of voting power—more coin’s equal greater voting power over consensus. HBARS must be broadly dispersed to ensure network security, with no single attacker or group of attackers possessing more than one-third of the coins. The goal is to combine a “road to permissionless” and a “road to broad currency distribution” in order to preserve the network while also encouraging independence. First, until enough coins are released, this will remain a permissioned network. The network will remain permissioned for the sake of network security until the total value of all circulating coins reaches a level where a hostile user (or group of users) would be unable to purchase a third of them to carry out an assault [15]. The amount of HBARS that can be proxy-staked to a single node will be limited once proxy-staking is enabled. Second, it has a slow and deliberate release timetable, with only 34% of all HBARS scheduled to be released before 2025. This delayed release plan is designed to ensure that the network grows in a stable and orderly manner, allowing it to scale without
42
P. Divya et al.
Fig. 6 Transaction cost of network on Blockchain NFT
losing the security required for a really useful and pervasive network that delivers on the promise of creating a trusted, empowering, and secure online world. The service charges incurred by Network NFT Transactions committee, such as Cryptocurrency service, Consensus service, Token service, Schedule service, Contract service, and so on, are listed in the Table 2(a). When compared to the costs of other network providers, The service provider charges a moderate incidental fee for all operations. Another characteristic of the distributed blockchain NFT transaction is the private network, which aids in identifying the network node in each by its private key. Otherwise, public networks that use a consensus engine to keep track of transactions in each network node [12]. The performance ratio in the graph below represents the amount of CPU power and blocks required by each network and consensus engine. Take, for example, Hbar, which processes millions of transactions per day. CUDOS Miner, which generates millions of blockchains, can be generated at a lower CPU rate by each network node. BRD additionally offers the NFT blockchain service on bitcoin transactions at a million rate. The transaction tones were established on a daily basis in a secure manner.
5 Result and Discussion The account balances are updated simultaneously for each transaction and its related fee payments. The account’s HBAR is only accessible to the person who holds the private key. The 2021 NFT boom proves that blockchains can:
Hashgraph: A Decentralized Security Approach … Table 2 Contains the information about network service cost Service Cost ($) S. no 1 2 3 4 5 6
Cryptocurrency service sum* Consensus service* Token service* Schedule fee* Contract service* Miscellaneous fee*
43
Others ($)
0.026
0.030
0.02 3.16 0.01 1.16 0.00
0.10 3.8 0.0102 1.5 0.10
1. solve real-world problems, 2. be quickly deployed, and 3. generate wealth for both users and underlying networks. There are other competing blockchains that use different consensus algorithms than Ethereum. Venture into NFT technology with the goal of enhancing transaction speed and decreasing fees. The goal is to increase market valuations and expand the user base. NFT sales volume in August 2021. Close to $4 billion was spent on the OpenSea marketplace in February, up from only $8 million on 17 January 2021, a jump of 50,000%. This highlights the industry’s massive revenue potential. On top of existing blockchains, NFT economics is constructed. Various current blockchains began to include NFT as a result of the excess demand curve. As a result of the surplus demand curve, various current blockchains began incorporating NFT infrastructure that enables NFT minting, trading, auctioning, mining, staking, and other operations [22]. It seems reasonable to assume that the NFT craze sparked a significant increase in blockchain adoption, which is an important step toward blockchain technology’s eventual revolution of commerce and financial services. Paul Madesen et al. [4] take out a single trustworthy node at a time with a DDoS attack. Other nodes are unable to sync with a DDoSed node. It also controlled over a limited number of malicious nodes (less than 1/3), who are aware of what is going on and may be able to help me guide the attack to alternate which node is DDoSed. For as long as the attack continues, ability to shut down the network. This is a DDoS attack (Distributed Denial of Service) [23]. The attacker has gained access to a large number of machines on the internet, which he can use to flood a single computer with so many packets that it shuts down for the duration of the attack. When it comes to
44
P. Divya et al.
choosing a DDoS target, it doesn’t matter how intelligent the attacker is. It makes no difference if the attacker uses a malicious node as a spy to assist in the selection of the victim. The attack is still ruled out. An attacker cannot freeze the network for as long as the attack continues unless every pair of honest nodes eventually syncs, and more than 2/3 of the nodes are honest nodes. Because if it went on indefinitely, all four Conditions would hold true, but the Results would not. The ABFT proof rules this out. As a result, the ABFT proof declares this type of liveness attack impossible, and the claim must be incorrect the Leader attack may even contain a mathematical proof that it is ABFT. However, it cannot have an ABFT proof because an ABFT proof would ensure that it is immune to that assault. This is one of the key ways in which ABFT outperforms ABFT. A protocol with a ABFT proof is guaranteed correct, but if the evidence can be updated to an ABFT proof, the correctness guarantee can be increased to provide liveness guarantees as well. Drogos I. Musan et al. [7], When a parent block has numerous children blocks, a “fork” of the blockchain can occur spontaneously. It can be seen when different miners mine the same block at the same time. Nearly at the same moment When one child joins the family, this is usually simple to handle. Drogos I. Musan et al. [7] A soft-fork introduces backward-compatible network standards, allowing nodes that have not upgraded their software to process transactions as long as they meet the new rules. Hard forks, on the other hand, cause changes by weakening the blockchain’s restrictions, so newly minted blocks may not be verified by old nodes. Drogos I. Musan et al. [7], On Ethereum, there are a variety of DApps with which consumers can engage. Decentralized Finance applications, a subclass of DApps that build new protocols for managing and trading blockchain assets, will also be discussed in this study. Our system also executes network operations on the blockchain, allowing DApp transactions to be processed quickly and easily. Our post will discuss the application’s global technique to performing operations on each NFT Transacted blockchain in a secure manner. Our software, which is ERC-20 and ERC-721 compliant, provides a hassle-free transaction while transferring funds from one blockchain to another with regular and safe distributed-ledger technologies. Create the audit thread ID for each transaction in NFT blockchains to manage using the application message ID. Each block in each transaction is generated using the Thread Id. The credit details in the NFT blockchain log will be preserved in a secure way, allowing for trusted and allowed access. You can create PayerName, RecipientName, and ServiceReference during the network phase. CreditDate and CreditTime can be created for each block. The malicious attacker can install malicious files inside middleware and use these logging mechanisms to track the behavior. HCS Settlement Demo Page (a) is a demo that shows the customer’s transaction history on a different blockchain. It will also enable a distributed method of obtaining data using an async parallel notion to receive a callback on each NFT blockchain transaction.
Hashgraph: A Decentralized Security Approach …
45
These options were implemented in the outstanding credits and debits part of the HCS Settlement demo section shown below in HCS—Settlement Demo Page (b). To transfer money from one blockchain network to another, the user can use Thread ID and public key for each transaction on the NFT blockchain. This may assist the user in detecting outliers in the blockchain node, resulting in a hassle-free transaction on NFT Blockchains. For adding credit and debit, the public key and Thread Id for each transaction by the user can be generated. In the NFT transaction phase, it can identify the unique id for the unique user. The alerts in the transaction records should identify any discrepancies in the transaction phase shown in HCS—Settlement Demo Page (b). HCS—Settlement Demo Page (c) and (d), which shows the blockchain transaction that will be used in the NFT transaction phase. It also allows for the monitoring and recording of each user’s settlement, audit, credits, and debits in order to detect and ignore fraud and outliers throughout the transaction phase. On the comparison to Fig. 8 [DeFi—NFT Lend Page] Dragos I. Musan et al. [7] NFT addresses can be produced each time a transaction is established and then used by the DeFi application to manipulate the operation during the NFT transaction phase. This would cause the transaction’s NFT name and NFT address to be produced at each level of the transaction. Figure 9 HCS—Settlement Demo Page, which provides the Thread Id for manipulating and evaluating each transaction in our NFT transaction phase. The names of the payee and the recipient will be transmitted to the blockchain for a secure transaction. Also, for the suspicious and untrustworthy transaction, fields like service reference, additional remarks, generated date, and created time were migrated to created. Each phase’s timestamps will be watched and evaluated based on the timestamps. The timestamps will be assigned depending on the users’ values and the value that will be assigned to each user (Fig. 7).
Fig. 7 Performance evaluation ratings of HCS out of 5
46
P. Divya et al.
Fig. 8 DeFi—NFT lend page
This is the effective evaluation ratio for the blockchain providers listed below, such as HCS, Skuchain, and Alros, which may be examined for time, performance, speed, and security. HCS has a higher level of security than the other blockchain service providers. This ensures the security of nano and micro transactions in each blockchain while they are being executed. This provides an effective evaluation of the NFT blockchain network service to be built in banking and finance for improved accuracy and efficiency in transacting money from one person to another by utilizing a secure blockchain. For quality purposes, each uncompleted transaction should be documented and maintained as a transaction log, implying that each suspicious and malicious transaction in the blockchain should be investigated. HCS application across all domains such as IoT, Edge computing, and Defi to be more secure utilizing smart contracts with high throughput and lower maintenance costs. HCS may be utilized in a variety of fields to maximize resource consumption while ignoring the creation of numerous clusters in network nodes (Figs. 8, 9, 10, 11, 12 and 13).
Hashgraph: A Decentralized Security Approach …
47
Fig. 9 HCS—Settlement demo page
Fig. 10 HCS—Settlement demo page list
6 Conclusion The hashgraph consensus technique enables Decentralized Applications 2.0 developers to set acceptable and predictable gas pricing. Our network can process up to 15 million gas per second, which is the same as Ethereum’s single-block capacity. Smart Contracts 2.0 transactions benefit from our network’s high transfer speeds and security standards. The network uses hashgraph to achieve Asynchronous Byzantine Fault Tolerance (ABFT), the highest level of security for a distributed ledger, imply-
48
Fig. 11 HCS—Settlement demo page add services
Fig. 12 HCS—Settlement demo profile page
Fig. 13 HCS—Settlement demo page audit
P. Divya et al.
Hashgraph: A Decentralized Security Approach …
49
ing that no single person or group can prevent the algorithm from reaching consensus. The Smart Contract Service runs Solidity, a programming language used by 30% of all Web3 developers, and is compatible with the EVM (Ethereum Virtual Machine). The Token Service combines the flexibility of our tokenization infrastructure with the compatibility of Solidity and EVM-compliant smart contracts to support native tokens and NFTs. Developers will be able to assess the utility of smart contracts and incorporate hashgraph-based tokenization into their services, providing customers with more choices.
References 1. Q. Wang, R. Li, Q. Wang, S. Chen, Non-fungible token (NFT): overview, evaluation, opportunities and challenges. arXiv preprint arXiv:2105.07447 (2021) 2. M.J. Haddadin (ed.), Water resources in Jordan: evolving policies for development, the environment, and conflict resolution. Resour. Future (2006) 3. J. Singh, S. Venkatesan, Blockchain mechanism with Byzantine fault tolerance consensus for Internet of Drones services. Trans. Emerg. Telecommun. Technol. 32(4), e4235 (2021) 4. L. Baird, M. Harmon, P. Madsen, Hedera: a public Hashgraph network and governing council. White Paper 1 (2019) 5. D.I. Musan, J. William, A. Gervais, NFT. Finance Leveraging Non-Fungible Tokens (Imperial College London, Department of Computing, 2020) 6. L. Baird, B. Gross, T. Donald, Hedera Consensus Service (Hedera Hashgraph, 2020) 7. L. Ante, The non-fungible token (NFT) market and its relationship with Bitcoin and Ethereum. FinTech 1(3), 216–224 (2022) 8. D.R. Kong, T.C. Lin, Alternative investments in the Fintech era: the risk and return of nonfungible token (NFT). Available at SSRN 3914085 (2021) 9. J.C. Castro, Visualizing the collective learner through decentralized networks. Int. J. Educ. Arts 16(4) (2015) 10. C. Jaag, C. Bach, Blockchain technology and cryptocurrencies: opportunities for postal financial services, in The Changing Postal and Delivery Sector (Springer, Cham, 2017), pp. 205–221 11. D. Dolenc, J. Turk, M. Pustišek, Distributed ledger technologies for IoT and business DApps, in 2020 International Conference on Broadband Communications for Next Generation Networks and Multimedia Applications (CoBCom) (IEEE, 2020), pp. 1–8 12. T. Yang, Q. Guo, X. Tai, H. Sun, B. Zhang, W. Zhao, C. Lin, Applying blockchain technology to decentralized operation in future energy internet, in 2017 IEEE Conference on Energy Internet and Energy System Integration (EI2) (IEEE, 2017), pp. 1–5 13. I.G. Varsha, J.N. Babu, G.J. Puneeth, Survey on blockchain: backbone of cryptocurrency. Int. J. Res. Appl. Sci. Eng. Technol. (IJRASET) (2020) 14. M. Pilkington, Blockchain technology: principles and applications, in Research Handbook on Digital Transformations (Edward Elgar Publishing, 2016) 15. K.S. Kumar, T. Ananth Kumar, A.S. Radhamani, S. Sundaresan, 3 blockchain technology, in Blockchain Technology: Fundamentals, Applications, and Case Studies (2020), p. 23 16. Q.E. Abbas, J. Sung-Bong, A survey of blockchain and its applications, in 2019 International Conference on Artificial Intelligence in Information and Communication (ICAIIC) (IEEE, 2019), pp. 001–003 17. D.Y. Leung, G. Caramanna, M.M. Maroto-Valer, An overview of current status of carbon dioxide capture and storage technologies. Renew. Sustain. Energy Rev. 39, 426–443 (2014) 18. K. Iyer, C. Dannen, Crypto-economics and game theory, in Building Games with Ethereum Smart Contracts (Apress, Berkeley, CA, 2018), pp. 129–141
50
P. Divya et al.
19. B. Sriman, S.G. Kumar, Enhanced transaction confirmation performances without gas by using Ethereum blockchain. Webology 19(1) (2022) 20. I. Tsabary, A. Manuskin, I. Eyal, LedgerHedger: Gas Reservation for Smart-Contract Security (Cryptology ePrint Archive, 2022) 21. K.S. Kumar, T. Ananth Kumar, A.S. Radhamani, S. Sundaresan, Blockchain technology: an insight into architecture, use cases, and its application with industrial IoT and big data, in Blockchain Technology (CRC Press, 2020), pp. 23–42 22. S. Jayalakshmi, N. Sangeetha, S. Swetha, T.A. Kumar, Network slicing and performance analysis of 5G networks based on priority. Int. J. Sci. Technol. Res. 8(11), 3623–3627 (2019) 23. T. Shah, S. Jani, Applications of Blockchain Technology in Banking and Finance (Parul CUniversity, Vadodara, India, 2018)
Med Card: An Innovative Way to Keep Your Medical Records Handy and Safe Abhishek Goel, Mandeep Singh, Jaya Gupta, and Nancy Mangla
Abstract The healthcare sector has advanced a lot from new machines to do effective surgeries and rolling out medicines to even cure diseases which were considered incurable once. But the technical advancements have not reached their records yet. The traditional method of recording everything on paper is still in use. The only difference that is there is that now the printed material is being provided in spite of the handwritten one. But these papers are indeed very costly and hefty things to manage in sectors like health care. Today when everything is related to privacy, why not have a look at our medical records in a similar way? We propose a system where patients can keep their records handy in a medical card issued to them. All their medical history and prescription will be stored over there. All the clinical records are well maintained and kept safe without any breach to them. Our proposed system is cost-effective and very much possible and also quite feasible as no resources are being used at the users’ end. Here, we tried presenting a prototype which is very similar to the actual layout of our project. The application of this will save a lot of resources that are time-consuming and inefficient. We are also planning to have a duration period for which a person wants to opt for the service of this med card. Each individual who will take the med card, contingent upon card esteem, first pays Card Value to the Organization. This data is shared with every one of the parts of the Hospital branches with the goal that they can keep up with worldwide data gathering. In the event that the user of the Med Card is enduring illness, they will be given starting treatment with next to no consultancy charges. Keywords Health care · Blockchain · Machine learning
A. Goel · J. Gupta (B) · N. Mangla IT Department, HMRITM, New Delhi, India e-mail: [email protected] M. Singh CSE Department, HMRITM, New Delhi, India © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 A. Bhattacharya et al. (eds.), Innovations in Data Analytics, Advances in Intelligent Systems and Computing 1442, https://doi.org/10.1007/978-981-99-0550-8_4
51
52
A. Goel et al.
1 Introduction As technologies are advancing day by day, so are we; med cards are a result of one of those advancements. It is a solution to the traditional method of recording medical details on a paper and searching for it at the time of a doctor’s appointment or while claiming some health insurance policies. In today’s time even, many people face complications in their treatment or sometimes things get even worse because they are getting treatment for their current problem without considering their previous health record. Med card is a solution to those; it will be implemented using the latest technology blockchain, and an additional feature which is the recommendation system also plays a crucial role in this project which is implemented using Machine Learning.
1.1 What is Blockchain? As it is known that the majority of the applications are run in a centralized system to which we don’t have any access, all the data that is stored in a central database server can be only accessed by a single authority which can exploit data easily, not reliable and trustable [1, 2]. So here the blockchain comes into consideration that runs in a decentralized system where all the data is shared between every user of the blockchain network, and all data is stored in an immutable ledger that is called a block which contains relevant information, hash value, and a previous hash value of the preceding block so each block is dependent on each other; if any hash value of the block is changed it will change the value of the next block, and this process goes on so it is very hard for hackers to tamper with any data stored in the block [3]. We create smart contracts for faster execution where all the terms and conditions based on the requirements are stored digitally in the form of a programming language called solidity which is a high-level, Turing complete language, and once the contract has been deployed in the Ethereum blockchain, it can’t be changed [4].
1.2 What is Machine Learning? Machine Learning. Machine Learning comes under the branch of Artificial Intelligence (AI), which deals with writing programs or creating algorithms that mimic the way humans learn or gradually improve their performance based on learning [5]. In simple words, it can be termed a study that deals with creating a self-learning algorithm for a machine. The two most widely embraced methods in Machine Learning are supervised learning and unsupervised learning [5].
Med Card: An Innovative Way to Keep Your Medical Records Handy …
53
2 Literature Review Inheritance frameworks normally just offer medical care assets inside the clinical and medical services field and are not completely viable with outside frameworks. Regardless, proof demonstrates various advantages of incorporating these organizations for interconnected and better medical services, calling for interconnection between various associations for wellbeing informatics scientists. Quite possibly, the most basic issue is the multi-hierarchical information trade, which requests that clinical information obtained by a medical services supplier be effectively accessible to different associations, for example, a doctor or examination organization. In numerous medical care executions, blockchain innovation rethinks information handling and administration. This is due to its flexibility and phenomenal division, and security and sharing of clinical information and administrations. In the medical services industry, blockchain innovation is at the bleeding edge of numerous current turns of events. With progresses in electronic information identified with wellbeing, cloud information stockpiling and patient information insurance guidelines, new freedoms are opening up for the administration of wellbeing information, just as comfort for patients to access and share their wellbeing information [6]. Guaranteeing information security, putting away, exchanges, and dealing with their smooth incorporation is hugely significant to any information-driven association, especially in medical services where blockchain innovation can possibly tackle these basic issues in a hearty and successful way. In this part, blockchain-based applications including information sharing, information from the executives, information stockpiling, and EHR, examined in subtleties. Arising blockchain-based medical service advancements, including information sources, blockchain innovation, medical services applications, and partners, are adroitly isolated into a few layers. • Hsuan-Yu Chen et al. have presented a cryptographic-based Integrated Medical Information System (IMIS) which includes three phases. The first one is the Registration Phase, the second is the Diagnosis Phase, and the last is the Collecting Medicine Phase. Each Phase is accessible with proper authentication and security privacy by providing digital signatures containing Proxy and Group Signatures. This interface includes e-patient records and e-prescriptions to reduce the paperwork and the time consuming in it [7]. • Chalumuru Suresh et al. have implemented the android application for maintaining and storing medical records in a cloud database where all the data are maintained or managed in a centralized manner and also authenticate the users in a more efficient way. After authentication and storage of the data, the smart card containing a QR code is generated [8]. • Samiullah Paracha et al. have proposed a smart card-based e-health network in Rwanda that includes four kinds of patient details, (1) Patient Prescription Data, (2) Physician Prescription Data, (3) Medical Laboratory Data, (4) Pharmacy Patient Medication Data, and the PDA device has been introduced to read and update the data on the smart card ID [9].
54
A. Goel et al.
• Xiaolan He et al. have implemented the Electronic Medical Record (EMR) to help patients in finding their medical history and help hospitals to know all the information about their patients, and .NET development is used to connect with the database cache for storing all the data related to patients, hospitals, treatments, laboratory tests, etc. and white-box and black-box testing is used to test their application [10]. • Bimanna Alaladinni et al. have built an application that provides a smart card health security system in which there are different categories of smart cards and the patients can buy according to their needs which contain all the information of patients, and the patients can avail the discounts on their medical treatments according to their requirements and based on the smart card they have chosen [11]. • Asma Khatoon et al. describe the applications of blockchain in the healthcare sector, how the decentralized application works, and how the smart contracts can be built and used for issuing medical prescriptions, and storing laboratory test records of patients, dataflow for reimbursement of the healthcare system, and explain about the EHR of surgical treatment workflow and the cost estimation using dataset [12]. • Aleksandr Kormiltsyn et al. delineate how smart contracts can improve the healthcare management system in which the detailed description of the booking and consultation process with the help of the Business Process Model and Notation diagram was explained and integrated with the smart contract where it identifies the patients’ health diseases and based on that decides whether the patients need to consult with a doctor and book their slots if required and update the smart contract with the current details [13]. The comparative Analysis of Smart contracts and the Recommendation model are illustrated in Tables 1 and 2 respectively.
3 Technologies Used Med Card is a solution for hassle-free treatment, as we are concerned about the problems that we all face while visiting/consulting a doctor for some disease or claiming our health insurance policy, which is to get all details in one secure place. Technology that we think suits best the security and implementation purpose is blockchain and for adding an additional feature which is the recommendation system, we are inclined to machine learning.
3.1 Blockchain in Health Care Blockchain is a disruptive technology as it completely changed the traditional approach into a new enhanced version [14]. It is a distributed immutable ledger which is completely transparent and decentralized, that is, once the data is stored in
Med Card: An Innovative Way to Keep Your Medical Records Handy …
55
Table 1 Comparative analysis of smart contracts Record Keeping
Medical records are kept in the registers or in centralized databases which can be managed by central authority
Now, records are kept in a decentralized system where no central authority is needed to manage the records
Security
Less secure as the central database can easily be tampered with by hackers due to which we may lose the records
More secure as data is stored in a distributed environment where millions of devices are connected to each other so it is very difficult to tamper with any data
Time-Consuming
More time-consuming to maintain or manage all the paper works
Less time-consuming to maintain all the records
Efficiency
Less efficient as data is stored in a central system where data can be changed easily, and there are more chances for data redundancy
More efficient as blockchain is a distributed immutable ledger where data once recorded cannot be changed and reduces data redundancy
Trust
Not trustworthy as it is not transparent, i.e. data is only managed by a single server and can be manipulated
Trustworthy as it is completely transparent, i.e. data is shared with all the members of the network so it can’t be manipulated by anyone
Table 2 Comparative analysis of recommendation model Filtering
Recommendation system is based on users’ previous behaviour
Recommendation system is based on previous users’ behaviour and also other filters that user applies, for example, doctors’ fees, location
Efficiency
Less efficient as data is recommended based on users’ behaviour
More efficient as data is recommended based on users’ behaviour and other filters
Accuracy
Accuracy is quite high
Accuracy lies in the range of 80–85%
the form of a block and deployed in Ethereum blockchain then it remains unchanged and can’t be deleted by anyone. It works on a distributed P2P network instead of a centralized network where there is no central authority; all the information is distributed throughout the network. As we know that in the healthcare sector all the records are maintained in the registers, or may store information in the centralized database which may not be secured as they get easily tampered with or changed by anyone, so, blockchain can help us to minimize all the problems that occur in the traditional approach as it provides greater security, builds trust, is time-saving, cost-effective, and there is no third-party validation required; all the verifications and authorization can be done between the members of the network and provide the efficient way for record-keeping [15]. It is the chain of blocks in which each block is connected to each other, i.e. each block can store the hash value of the current block and previous block; the identical copy of the blocks are kept by every member of the network so it is very difficult to hack millions of systems at the same time [16]. With the help of a decentralized application, the information of the patient stored
56
A. Goel et al.
in one hospital can be easily accessed by other hospitals with the public shareable key that is shared by the patient where each patient has its unique private key and a public key that accredit doctors to know all the medical history about their patients [17, 18]. Blockchain smart contracts can be used to manage and keep all the records in a safer way and can be easily accessible by anyone with all the validations and authorization which can replace the traditional way of keeping the records [19, 20].
3.2 Machine Learning in Health Care Machine Learning is a branch of Artificial Intelligence that deals with creating algorithms that can learn by themselves or in simple terms self-improving algorithms, by experience or through learning with the help of datasets fed to them [21, 22]. Recommender systems are machine learning systems that help users discover new products and services; whether to buy an online product or watch a movie, a recommendation system is present for you to make the best deal out of all. ML is widely used in medical fields due to many reasons such as it is capable of more precisely perceiving a disease at an earlier stage, serving to minimize the number of readmissions in hospitals and clinics. Technology has created many aspects or opened a great range of potentials in order to help patients to deal with complicated conditions [23]. Nowadays, digital information is often available on various web platforms, which creates difficulty for users from finding potential information regarding improvement in wellbeing. In addition to this, many already available drugs, tests, and treatment recommendations aim to solve the hassle of prime appropriate medication for patients. The recommender systems for the medical field should be often used to remove these gaps and create a better platform that supports both patients and medical professionals in creating better decisions regarding healthcare-related problems [24]. Lately, these approaches have been extensively put into the healthcare domain, or in other terms we can say Health Recommender Systems, i.e. HRS for better medical proposals. Unlike Pioneer in the health domain (e.g. medical expert systems), HRS tends to offer a better personification that increases the characteristics of provided recommendations and refines users’ understanding in a much appropriate and easy-to-understand manner of their medical condition [25]. These approaches are proving themselves way better in every aspect from better user experience to improving their health condition, which eventually motivates patients to follow a healthier lifestyle.
4 Proposed Idea We proposed a system where each patient is assigned a medical card and by scanning the QR code associated with it, they can access or see all the medical history which includes treatments, prescriptions, and laboratory tests of the patients which provides an efficient way of record keeping, time-saving, secure, and cost-effective. This
Med Card: An Innovative Way to Keep Your Medical Records Handy …
57
system also provides recommendations for the doctors and the medicines according to their medical history which can help the patients easily find out the doctors for their treatments.
4.1 What is a Med Card? The Med Card contains the QR code and the encrypted key that uniquely identifies each user. By scanning the QR code via desktop or mobile devices, all the medical history of the patients, hospitals they visited, treatment, and doctor details, all the information stored in one single card that is called Med Card. All the information contained in the med card is secure as each med card has its own unique key generated by blockchain smart contracts which can’t be changed. The medical professional staff as well as the patients can read or write the information but can’t delete any data which prevents data loss and data redundancy [26].
4.2 Workflow The work flow of the proposed system (Fig. 1) goes on in this way: first, the user has to show his unique QR code which is associated with a key secured using blockchain technology; that QR code will then connect the system scanning the QR with the user data, or we can say the system which is scanning will get the detailed or brief description about the patient and thier medical history, which provides an ease for the medical professional as well as patients to maintain a well-structured document of their diagnosis, treatment, and health that ultimately leads to fasten the diagnosis process. In addition to this feature, the system also provides a recommendation system for doctors as well as for medicine to the user according to their medical data which is already stored in the system, for their better use. The user can also choose what kind of information they want to show; even they can view their own data, but a particular user is not allowed to delete any data, for now. The flowchart given below describes the working of the system in an abstract manner (Fig. 2).
Fig. 1 Flowchart of the med card
58
A. Goel et al.
Fig. 2 Data flow diagram of the med card
5 Conclusion and Future Scope As technology advances day by day, we can see lots of drastic changes in every sector, including health care. The introduction of med cards would enable these technologies to have a step forward in dynamic and personalized automated health care. The security provided by blockchain and the recommendation system provided using Machine Learning facilitate a more secure, user-friendly, and hassle-free solution for both patients and medical professionals in order to exchange information [27]. The dataset which will be collected by this system will result in a valuable foundation for future research [28].
References 1. Z. Zheng S. Xie, H.-N. Dai, X. Chen, H. Wang, An overview of Blockchain technology: architecture, consensus, and future trends (2017).https://doi.org/10.1109/BigDataCongress. 2017.85 2. M. Hölbl, M. Kompara, A. Kamisalic, L. Nemec Zlatolas, A systematic review of the use of Blockchain in healthcare. Symmetry (2018).https://doi.org/10.3390/sym10100470 3. A. Tsilidou, G. Foroglou, Further applications of the blockchain (2015) 4. C. Parameswari, Devi & Mandadi, Venkatesulu, Healthcare Data Protection Based on Blockchain using Solidity, 577–580 (2020). https://doi.org/10.1109/WorldS450073.2020.921 0296 5. A. Simon, M. Deo, V. Selvam, R. Babu, An overview of machine learning and its applications. Int. J. Electr. Sci. Eng., 22–24 (2016) 6. Goel, A., Gautam, S., Tyagi, N., Sharma, N., Sagayam, M. Securing biometric framework with cryptanalysis. Intell. Data Analyt. Terror Threat Prediction, 181–208 (2021),https://doi.org/10. 1002/9781119711629.ch9 7. H.-Y. Chen, Z.-Y. Wu, T.-L. Chen, Y.-M. Huang, C.-H. Liu, Security Privacy and policy for cryptographic based Electronic Medical Information System (2021). MDPI. Retrieved November 14, 2021, from https://www.mdpi.com/1424-8220/21/3/713 8. B. Alaladinni1, Prashant, H. Kowsar, B. Heena, A.A. A, Smart Card Health Secur. Syst. 7(5), 12156–12159 (n.d.)
Med Card: An Innovative Way to Keep Your Medical Records Handy …
59
9. C. Suresh, C. Chandrakiran, K. Prashanth, K.V. Sagar, K. Priyanka, “Mobile medical card”—an android application for medical data maintenance. Second Int. Conf. Inventive Res. Comput. Appl. (ICIRCA) 2020, 143–149 (2020). https://doi.org/10.1109/ICIRCA48905.2020.9183307 10. S. Paracha, The prospects of Smart Card based e-health networks in Rwanda Integrated Patient Health Record System (IPHRS) (2019) 11. X. He, L. Cai, S. Huang, X. Ma, X. Zhou, The design of electronic medical records for patients of continuous care. J. Infection Public Health, 14 (2019).https://doi.org/10.1016/j.jiph.2019. 07.013 12. A. Khatoon, A blockchain-based smart contract system for healthcare management. Electronics 9(1), 94 (2020). https://doi.org/10.3390/electronics9010094 13. A. Kormiltsyn, C. Udokwu, K. Karu, K. Thangalimodzi, A. Norta, Improving Healthcare Processes with Smart Contracts (2019). https://doi.org/10.1007/978-3-030-20485-3_39 14. J. Yli-Huumo, D. Ko, S. Choi, S. Park, K. Smolander, Where is current research on blockchain technology?—a systematic review. PLoS ONE 11(10), e0163477 (2016). https://doi.org/10. 1371/journal.pone.0163477 15. X. Du, B. Chen, M. Ma, Y. Zhang, Research on the application of Blockchain in smart healthcare: constructing a hierarchical framework. J. Healthcare Eng. 2021, 1–13 (2021). https://doi. org/10.1155/2021/6698122 16. N. Tyagi, S. Gautam, A. Goel, P. Mann, A framework for blockchain technology including features. Adv. Intell. Syst. Comput., 633–645 (2021).https://doi.org/10.1007/978-981-159927-9_62 17. K. Shuaib, J. Abdella, F. Sallabi, M.A. Serhani, secure decentralized electronic health records sharing system based on blockchains. J. King Saud Univ. Comput. Inf. Sci (2021). https://doi. org/10.1016/j.jksuci.2021.05.002 18. P. Zhang, M.A. Walker, J. White, D.C. Schmidt, G. Lenz, Metrics for assessing blockchainbased healthcare decentralized apps. 2017 IEEE 19th International Conference on e-Health Networking, Applications and Services (Healthcom) (2017). https://doi.org/10.1109/health com.2017.8210842 19. M. Ribeiro, A. Vasconcelos, Medblock: using Blockchain in health healthcare applications based on Blockchain and smart contracts. Proceedings of the 22nd International Conference on Enterprise Information Systems (2020). https://doi.org/10.5220/0009417101560164 20. S.N. Khan, F. Loukil, C. Ghedira-Guegan, E. Benkhelifa, A. Bani-Hani, Blockchain smart contracts: applications, challenges, and future trends. Peer-to-Peer Netw. Appl. 14(5), 2901– 2925 (2021). https://doi.org/10.1007/s12083-021-01127-0 21. J. Akhil, S. Samreen, R. Aluvalu, The future of health care: machine learning. Int. J. Eng. Technol. (UAE) 7, 23–25 (2018). https://doi.org/10.14419/ijet.v7i4.6.20226 22. G. Parashar, A. Chaudhary, A. Rana, Systematic mapping study of AI/machine learning in healthcare and future directions. SN Comput. Sci. 2, 461 (2021). https://doi.org/10.1007/s42 979-021-00848-6 23. B.A. Mateen, J. Liley, A.K. Denniston et al., Improving the quality of machine learning in health applications and clinical research. Nat. Mach. Intell. 2, 554–556 (2020). https://doi.org/ 10.1038/s42256-020-00239-1 24. T.N.T. Tran, A. Felfernig, C. Trattner et al., Recommender systems in the healthcare domain: state-of-the-art and research issues. J. Intell. Inf. Syst. 57, 171–201 (2021). https://doi.org/10. 1007/s10844-020-00633-6 25. M. Wiesner, D. Pfeifer, Health recommender systems: concepts, requirements, technical basics and challenges. Int. J. Environ. Res. Public Health 11(3), 2580–2607 (2014). https://doi.org/ 10.3390/ijerph110302580 26. G. Kardas, E. Tunali, Design and implementation of a smart card based healthcare information system. Comput. Methods Programs Biomed. 81, 66–78 (2006). https://doi.org/10.1016/ j.cmpb.2005.10.006 27. A. Goel, B. Bhushan, B. Tyagi, H. Garg, S. Gautam, Blockchain and machine learning: background, integration challenges and application areas. Emerg. Technol. Data Mining Inf. Secur., 295–304 (2021).https://doi.org/10.1007/978-981-15-9774-9_29
60
A. Goel et al.
28. L. Bouayad, A. Ialynytchev, B. Padmanabhan, Patient health record systems scope and functionalities: literature review and future directions [published correction appears in J. Med. Internet Res. 21(9), e15796 (2019)]. J. Med. Internet Res. 19(11), e388 (2017). Published 2017 Nov 15. https://doi.org/10.2196/jmir.8073
Delta Operator-Based Modelling and Control of High Power Induction Motor Using Novel Chaotic Gorilla Troop Optimizer Rahul Chaudhary, Tapsi Nagpal, and Souvik Ganguli
Abstract New chaotic gorilla troop optimizer is developed in this study. Different one-dimensional chaotic maps are used to alter two parameters of the parent algorithm. Thus, the development of new chaotic algorithms is based on ten well-known and widely cited chaotic maps. The suggested method is tested using two varieties of benchmark functions, namely unimodal and multi-modal. Using this proposed algorithm, a 500 hp ac motor model is diminished. Further, the controller is also developed taking advantage of the unified domain method. The controller implementation is done with the help of approximate model matching technique. The methods proposed outclass manifold cutting-edge approaches with respect to the optimal solutions as well as convergence graphs. In order to account for the stability of the proposed algorithms, the standard deviation of the optimal values is examined. Selected results of the proposed methods are displayed in this work. All of the experiments conducted have yielded promising results. Keywords Gorilla troop optimizer (GTO) · Chaos · Induction motor · Delta transform theory · Model order reduction · Controller realization
R. Chaudhary · S. Ganguli (B) Department of Electrical and Instrumentation Engineering, Thapar Institute of Engineering and Technology, Patiala, Punjab 147004, India e-mail: [email protected] R. Chaudhary e-mail: [email protected] T. Nagpal Department of Computer Science and Information Technology, Lingayas University, Faridabad, Haryana, India © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 A. Bhattacharya et al. (eds.), Innovations in Data Analytics, Advances in Intelligent Systems and Computing 1442, https://doi.org/10.1007/978-981-99-0550-8_5
61
62
R. Chaudhary et al.
1 Introduction Owing to their ruggedness and robust nature, induction motors are widely used across a wide range of industries. In fact, more than 80% of motors used in industries are induction motors. Also, they do not need much maintenance as other options. In addition, they are readily available and inexpensive too [1]. Their mathematical models, on the other hand, tend to result in complex systems that are difficult to manage. They obviously require higher-order controllers. It is possible that these higher-order controllers obtained for these systems may necessitate additional hardware and may not be feasible. As a result, it is necessary to reduce the order of the original system [2]. Different from traditional discrete-time systems, delta operator-based modelling and control has a number of distinct advantages. Some of them have been deemed worthy of inclusion in this paper. Due to the discrete-time operators’ inability to deal with fast digital data, numerical ill-conditioning occurs. The delta-domain operator, on the other hand, ensures that computations are fast and accurate. At very small sampling time, discrete-delta systems are converged to their continuous time counterparts, thus providing an integrated modelling approach [3]. Using the delta operator framework, metaheuristic techniques are frequently used to model and control various systems [4–6]. The use of chaotic metaheuristic techniques to solve engineering problems has become increasingly popular in recent years. Metaheuristic techniques can incorporate chaos in a variety of ways. Typically, the update equation is randomly modified to improve the accuracy of the results. Some metaheuristic algorithm parameters can also be tweaked with chaos maps in order to improve their overall performance. Chaos maps can even be used to replace random numbers in algorithms as well [7–9]. A new metaheuristic method coined as gorilla troop optimizer (GTO) that mimics the social intelligence of gorillas has been developed recently [10]. There are several ways to modify the performance of an algorithm. The chaotic version is one such popular choice. An effective chaos embedded gorilla troop optimizer (CGTO) is proposed in this work using popular chaotic maps from the literature [11]. Two controlling parameters of the algorithms are varied chaotically to bring about better performance. The initial validation of these algorithms is done with higherdimensional test functions. There are two categories of experimentations used in this research: unimodal and multi-modal. A practical test system is also used to conduct an additional experiment. Precisely, here we attempt to model and control a highly rated induction motor, using these new CGTO methods. The rest of the paper is arranged in accordance with the information provided. Section 2 explains how to reduce a higher-order system and implement a controller. The novel chaos enhanced gorilla troop optimizer developed by regulating the two controller parameters of the algorithm with the help of chaos maps is shown in Sect. 3. Five benchmark functions and a 500-horsepower induction motor are optimized in Sect. 4 applying the novel CGTO methods. Section 5 sums up the findings of the research in this work.
Delta Operator-Based Modelling and Control of High Power Induction …
63
2 Formulation of the Problem To develop the solution to the problem stated in this paper, two stages are necessary. The initial step is model order reduction, followed by controller synthesis. A model of order two with a predefined configuration is taken up for the order diminution section. This lower-order system has four decision variables that are optimized. The higherorder system is compared to the lower-order system with unknown parameters. Both of the aforementioned systems accept apparently random signals as their inputs. The integral of the time-weighted absolute error (ITAE) is minimized to obtain the unidentified parameters of the diminished model. The constraints that must be satisfied in order to produce this predefined diminished model are as follows: • Equating the steady state gain with that of the original system. • Keeping the minimum phase characteristics intact. • Ensuring stability. To gain the benefit of an unified approach, the reduced order modelling is performed using the delta operator framework. In reference [5], a thorough mathematical expression of the order reduction procedure is illustrated. A lower-order model controller is proposed. The controller synthesis technique used here is approximate model matching (AMM) [12]. Since exact model matching (EMM) [13] is not able to guarantee a feasible controller synthesis, hence AMM is recommended as against EMM. In AMM, a desired reference model is employed to find out the parameters of the controller which are unknown in this case. The controlled plant’s closed loop output is compared to a reference model with predefined design parameters. To identify the parameters of the controller, an error function typically square error is minimized. The widely utilized proportional-integral-derivative (PID) controller is employed for the same. Finally, the controlled plant’s delta-domain step response is matched to the reference model output. Section-3 proposes a new chaos map-based gorilla troop optimizer that employs the optimization of a few benchmark functions, performs the model diminution, and also utilizes the advantage of AMM to find out the parameters of the controller in the unified domain of analysis. For comparison, a host of benchmark heuristic methods, along with the original GTO approach, are utilized. Reference [6] gives the details of the steps for the controller synthesis which can be seen by the interested readers.
3 New Chaotic Gorilla Troop Optimizer The gorilla troop optimization method abbreviated as GTO is a relatively new natureinspired algorithm. It is employed to handle problems with several types of complexities like non-linearity, non-convexity as well as non-smooth nature. The social intelligence of gorilla groups was used to motivate the development of this algorithm. The GTO method employs gorilla-based operators for simulation to define appropriately
64
R. Chaudhary et al.
the exploration and exploitation features of the optimization process. There are three distinct operators which are utilized throughout the exploration phase. The first one is the relocation to an unknown position to enhance the exploration of the GTO. A gorilla movement, the second operator, helps to maintain the equilibrium between the exploration and exploitation phases. The last operator in the exploration part, viz. migration towards a known location, considerably enhances the GTO’s capacity to find the various optimization spaces. In the exploitation phase, however, only two operators are utilized, considerably improving search performance. Reference [10] has a thorough overview of this approach as well as its mathematical modelling. There are two controlling parameters in the parent artificial gorilla troop optimization technique which contribute significantly to the performance of the algorithm. They are denoted in the algorithm by ‘ω’ and ‘P’, respectively. Replacing them by chaotic maps greatly influences the results of the GTO algorithm. Ten chaotic maps, widely accepted in the literature, are considered here. The proposed methods are prescribed as chaos-inspired gorilla troop optimizer denoted by CGTO in this work. Thus, the algorithms are numbered from CGTO-1 to CGTO-20 as per parameter variations and chaos map used. Three fundamental test systems are considered to validate the suggested approach. To demonstrate the performance of the proposed CGTOs, two unimodal test functions and three multi-modal benchmarks are taken up. An induction motor’s higherorder model is diminished to a lower-order model, typically of order two, and its controller of PID type is constructed with the help of the above-mentioned approach. The last experiment is carried out using a unified delta-domain method. In these investigations, the controller is realized matching the reference model approximately.
4 Simulation Ouput and Their Analysis Test-1: It is proposed to experiment with two unimodal test functions first to see if the proposed methods are effective. The mathematical descriptions, search domain, and ideal optimum values of these test functions are found in [14]. In this test, the two functions are labelled F1 and F2. 100 decision variables are optimized in each of these test functions. These test cases have a population size of 30 and a maximum number of iterations of 500. This suggests that the number of function evaluations for the experiment equals 15,000, obtained by the multiplication of the above population size and iterations. This can thus produce a stiff competition with respect to the number of decision variables taken up in this investigation. With the help of our method, the above-mentioned test functions have been verified. Several new metaheuristic techniques with standard notations available in the literature are employed for comparison purposes. Figure 1 displays the convergence graphs of the benchmark test functions (F1-F2). The suggested CGTO approaches outperform a wide range of existing methods, which is evident from the convergence speed and accuracy reflected through Fig. 1.
Delta Operator-Based Modelling and Control of High Power Induction …
65
F1
F1
F2
F2
Fig. 1 Convergence diagrams of the unimodal test functions (F1-F2)
Test-2: Similar experiments are also performed with three standard benchmark functions which are multi-modal in nature. These functions are referred to in this work by F3, F4, and F5. A detailed description of these functions is provided in [14]. These functions are also tested with 100 decision variables considering the same population size and total iterations as in Test-1. A handful of algorithms are applied for the purpose of comparison. Their convergence diagrams are provided selectively in Fig. 2. Selected results of these test functions are presented in this Fig. 2. Test-3: This study also considers a 500 HP induction motor [15]. According to its transfer function, the machine model is fifth order. The input-output relationship of this model is thus given by G(s) =
1930s 3 + 267900s 2 + 8.065e06s + 2.566e08 s 5 + 142s 4 + 149200s 3 + 8.549e06s 2 + 4.026e08s + 7.645e09
(1)
The delta-domain-based model in the γ -domain having a sampling frequency of 400 Hz is represented as
66
R. Chaudhary et al.
F4
F4
F5
F5
Fig. 2 Convergence characteristics of the multi-modal test functions (F4-F5)
G(γ ) =
2.214γ 4 + 2065.32γ 3 + 2.3476e05γ 2 + 7.308e06γ + 2.0052e08 γ 5 + 430.25γ 4 + 1.442e05γ 3 + 8.336e06γ 2 + 3.519e08γ + 5.974e09 (2)
Developing a realizable controller for this higher-dimensional machine model is extremely difficult. As a result, the proposed CGTO-6, CGTO-13, and CGTO20 methods are employed to diminish this system to their respective second-order systems. Only four decision variables are involved in this experiment, so only a population size of twenty with hundred iterations is used. Many latest techniques, including the basic GTO approach, are put to the test in this study. Standard symbols are used for the different algorithms with which the comparison is carried out. Table 1 shows the developed lower-order models. This table also provides the mean and standard deviation of the optimized error value, because only heuristic techniques were used to reduce the model. For the sake of clarity, the best error functions are written in bold letters.
Delta Operator-Based Modelling and Control of High Power Induction …
67
Table 1 Lower-order systems of the 500 hp ac motor in the combined domain of analysis Algorithms
Second-order transfer functions in the unified domain
Mean error
Std. error
CGTO-06
3.533γ +13.997 γ 2 +119.965γ +143.398
0.009986
5.01e−09
CGTO-13
3.523γ +13.996 γ 2 +119.966γ +143.398
0.009945
1.06e−10
CGTO-20
3.523γ +14.044 γ 2 +119.966γ +144.01
0.009986
6.26e−10
GTO
3.857γ +3.52 γ 2 +137γ +2000
0.012519
3.14e−11
BWOA
0.9346γ +2.905 γ 2 +20γ +200
0.012217
2.64e−04
ChOA
4.233γ +20.84 γ 2 +150γ +262.7
0.009949
1.84e−05
ChSA
3.372γ +34.77 γ 2 +100γ +1500
0.010214
2.80e−05
DOA
2.029γ +3.393 γ 2 +50γ +150
0.011065
8.35e−05
MPA
4.167γ +16.6 γ 2 +141.8γ +170
0.009945
3.99e−11
From Table 1, it is observed that the suggested CGTO-13 method surpasses other methods in terms of mean ITAE error minimized. Only the same average value is provided by the MPA method. The parent GTO method as well as the MPA technique provides less standard deviation of the error. Therefore, these algorithms are more stable than the proposed method. Moreover, the convergence diagrams are developed as found in Fig. 3. From the convergence plots of Fig. 3, it is observed that the showcased techniques viz. the CGTO-6, CGTO-13, and CGTO-20 are outperforming with respect to a number of algorithms used for comparing. A PID controller for this motor is also developed using the proposed methodology. In order to evaluate controller parameters, the square error is optimized. The AMM technique in the unified modelling framework is used to determine the controller parameters. It is decided to use a
Fig. 3 Convergence characteristics of the reduced test system
68
R. Chaudhary et al.
Fig. 4 Convergence diagrams of the developed controller models
specific model as a basis for the investigation. For the sake of comparison, a host of brand-new algorithms is employed. In the controller tuning process, there are only three decision variables. As a result, the population size and the maximum number of iterations for this optimization problem are set to 20 and 100, respectively. The intelligent PID controller given by CGTO-06 is represented as G c (γ ) = 13.637 +
100.85 + 0.10413γ γ
(3)
However, the ChSA method produces the minimum square error. The results of the proposed CGTO techniques as well as the MPA and ChOA methods are both nearby. In addition, Fig. 4 depicts the convergence plots for this controller tuning problem using CGTO-06 and CGTO-13 methods. Other standard approaches are used for comparison. From Fig. 4, we can see that proposed CGTO techniques show an appreciable good convergence in comparison to numerous recent techniques reported in the literature. To model and control a 500 hp ac motor, the proposed CGTO approaches are thus successful. It is also possible to create new chaotic GTOs in a variety of ways. Furthermore, the GTO algorithm’s two controlling parameters can be varied in a chaotic manner. One-dimensional chaotic maps can be substituted for some random numbers used in the GTO method to develop even more new chaotic GTO methods.
Delta Operator-Based Modelling and Control of High Power Induction …
69
5 Conclusions This paper thus proposes an effective chaotic-inspired gorilla troop optimizer. The ‘ω’ and ‘p’ parameters of the parent algorithm can be moderated with the aid of ten popular chaotic maps. Both unimodal as well as multi-modal benchmark functions are used to verify the effectiveness of the method suggested in this work. The model reduction of an induction motor model of 500 hp is also performed, as well as its controller is synthesized using this new algorithm, taking the help of the unified transform theory. An approximation model matching framework is used to realize the controller. With respect to convergence speed and accuracy, the proposed methods surpass some of the most recent techniques. The stability of the proposed algorithms is also taken into account when examining statistical measures for optimal values. Not to mention, all of the experiments that have been conducted have yielded promising results. Modelling of induction motor drive motors can also be done using the proposed technique.
References ˙ 1. R. Parekh, AC Induction Motor Fundamentals. Microchip Technology Inc, (DS00887A), 1–24 (2003) 2. P. Sarkar, J. Pal, A unified approach for controller reduction in delta domain. IETE J. Res. 50(5), 373–378 (2004) 3. S. Ganguli, G. Kaur, P. Sarkar, A hybrid intelligent technique for model order reduction in the delta domain: a unified approach. Soft. Comput. 23(13), 4801–4814 (2019) 4. S. Ganguli, P. Nijhawan, M.K. Singla, J. Gupta, A. Kumar, Model order reduction of some critical systems using an ıntelligent computing technique. In: Intelligent Communication and Automation Systems, pp. 221–238 (2021) 5. S. Ganguli, G. Kaur, P. Sarkar, A novel hybrid metaheuristic algorithm for model order reduction in the delta domain: a unified approach. Neural Comput. Appl. 31(10), 6207–6221 (2019) 6. S. Ganguli, G. Kaur, P. Sarkar, An approximate model matching technique for controller design of linear time-invariant systems using hybrid firefly-based algorithms. ISA Trans. 127, 437–448 (2022) 7. G.I. Sayed, A. Tharwat, A.E. Hassanien, Chaotic dragonfly algorithm: an improved metaheuristic algorithm for feature selection. Appl. Intell. 49(1), 188–205 (2019) 8. S. Arora, M. Sharma, P. Anand, A novel chaotic interior search algorithm for global optimization and feature selection. Appl. Artif. Intell. 34(4), 292–328 (2020) 9. A. Ibrahim, H.A. Ali, M.M. Eid, E.S.M. El-kenawy, Chaotic harris hawks optimization for unconstrained function optimization, in 2020 16th International Computer Engineering Conference (ICENCO), pp. 153–158 (2020) 10. B. Abdollahzadeh, F. Soleimanian Gharehchopogh, S. Mirjalili, Artificial gorilla troops optimizer: a new nature-inspired metaheuristic algorithm for global optimization problems. Int. J. Intell. Syst. 36(10), 5887–5958 (2021) 11. S. Saremi, S. Mirjalili, A. Lewis, Biogeography-based optimisation with chaos. Neural Comput. Appl. 25(5), 1077–1097 (2014) 12. J. Hammer, Tracking and approximate model matching for non-linear systems. Int. J. Control 77(14), 1269–1286 (2004) 13. S. Tzafestas, P. Paraskevopoulos, On the exact model matching controller design. IEEE Trans. Autom. Control 21(2), 242–246 (1976)
70
R. Chaudhary et al.
14. M. Jamil, X.S. Yang, A literature survey of benchmark functions for global optimisation problems. Int. J. Math. Model. Numer. Optim. 4(2), 150–194 (2013) 15. O. Wasynezuk, Y.M. Diao, P.C. Krause, Theory and comparison of reduced order models of induction machines. IEEE Trans. Power Appar. Syst. 3, 598–606 (1985)
An Enhanced Optimize Outlier Detection Using Different Machine Learning Classifier Himanee Mishra and Chetan Gupta
Abstract Data mining (DM) is an efficient tool used to mine hidden information from databases enriched with historical data. The mined information provides useful knowledge for decision makers to make suitable decisions. Based on the applications, the knowledge required by the decision makers will differ and thus need different mining techniques. Hence, an ample set of mining techniques like classification, clustering, association mining, regression analysis, outlier analysis, etc. are used in practice for knowledge discovery. These mining techniques utilize various Machine Learning (ML) algorithms. ML algorithms assume the normal objects as highly probable and the outliers as low probable. The global outliers which occur very rarely will deviate totally from the normal objects and can be easily distinguished by unsupervised ML algorithms. Whereas, the collective outliers which occur rarely as groups will deviate from the normal objects and can be distinguished by ML algorithms. This paper analyzes the outliers and class imbalance for diabetes prediction for different ML algorithms, i.e. logistic regression (LR), decision tree (DT), random forest (RF), K-neighbors (K-NN), and XG-Boosting (XGB). Keywords Outlier detection · Data mining · Machine learning · Decision tree · Accuracy · Random forest
H. Mishra (B) Department of Computer Science and Engineering, SIRTS, Bhopal, India e-mail: [email protected] C. Gupta Department of Computer Science and Engineering, SIRT, Bhopal, India e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 A. Bhattacharya et al. (eds.), Innovations in Data Analytics, Advances in Intelligent Systems and Computing 1442, https://doi.org/10.1007/978-981-99-0550-8_6
71
72
H. Mishra and C. Gupta
1 Introduction Databases are enriched with operational data. Even though the operational data reflect the current data, it is not utilized in the decision-making process. The historical information is always used for understanding the facts and for future decisionmaking. Hence, the historical data have been stored for data analysis, from which the knowledge is obtained to arrive at decision-making. The quick-growing tremendous quantum of information stored in several kinds of data repositories is collectively named as “Data Warehouse”. The data in the warehouse are enriched with information for decision-making [1, 2]. Though decision makers are skilled enough in decision-making, they suffer from the valuable information from the abundant repositories. Hence, the decision makers started seeking for an effective tool that will help in recovering the information from the data warehouse. A potent tool that satisfies the decision makers is “DM”. DM is the process of extracting explicitly unknown, but implicitly known information from databases for data analysis and decision-making. The foundation for data mining frames from the interdisciplinary scientific fields such as ML. The DM tasks include clustering, classification, association rule mining, sequential pattern mining, and anomaly/OD. Among these, this research work focuses on OD also called anomaly detection or rare event mining [3, 4]. “An outlier is an observation which deviates so much from the other observations to arouse suspicions that a different mechanism generated it” stated Hawkins (1980). In practice, outliers are the unexpected or unusual objects that differ greatly from other normal objects in the database. Outliers are also referred to as anomalies, noise, novelties, and deviations [5]. Outliers, which rarely occur in the database, comprise a few exceptional characteristics which can help in alerting the system to take precautionary actions. Detection of such exceptional characteristics endows us with useful application-specific insights. A few of them are credit card fraud detection, adverse drug reaction during medical treatment, financial crisis identification, crime detection, weak student identification, fault detection in safety-critical systems, and voting irregularity analysis [6]. The outliers present in such domains are of different types and possess different characteristics.
2 Outlier Detection Outliers occur as individuals or as a group. Based on the occurrence, behavior, and the level of deviation from other normal objects, outliers fall into three major categories as follows [7, 8]: (i) Global outliers, (ii) Collective outliers, and (iii) Contextual outliers. A database may have one or more of these types of outliers. Also, an outlier may belong to more than one of these types. The presence of these types of outliers in a
An Enhanced Optimize Outlier Detection …
73
Fig. 1 Types of outliers—in general, and in diabetes database a Outliers in general and b Outliers in a diabetes database
database is understood from the scatter plots depicted in Fig. 1. Figure 1 consists of the general scatter plot to understand the different types of outliers. It also presents the sample outliers observed in the diabetes database. The glucose level and the Body Mass Index (BMI) of randomly selected patients are considered by normalizing the values in percentage [9].
74
H. Mishra and C. Gupta
The global outliers occur very rarely and are the simplest form of outliers to be identified. For example, the intrusions in computer networks, frauds in credit card transactions, tsunami and earthquake hazards in weather monitoring systems, machinery breakdowns in companies, and the product defects during manufacturing belong to global/point outliers [10, 11]. The global outliers have exceptional values compared to the normal objects. The patients with more deviation in the values of glucose and BMI levels (“O1” and “O2”) are categorized as global outliers [12].
3 Literature Review Yu et al. [1] designed an Incremental algorithm for streaming data Outlier Detection by utilizing sliding window and kernel function to deal with the streaming information on the web and reports its outcomes showing high throughput on handling constant streaming information, executed in a CUDA system on Graphics Processing Unit (GPU). According to Cai et al. [13], since anomalies are the main considerations that influence exactness in information science, numerous exception discovery approaches have been proposed for really recognizing the verifiable anomalies from static datasets, along these lines working on the dependability of the information. As of late, information streams have been the primary type of information, and the information components in an information stream are not generally of equivalent significance. Nonetheless, the current exception identification approaches don’t consider the weight conditions; consequently, these strategies are not appropriate for handling weighted information streams. According to Qin et al. [14], nearby anomaly methods are known to be powerful for recognizing anomalies in slanted information, where subsets of the information display different dissemination properties. Notwithstanding, existing strategies are not greatly prepared to help present-day high-speed information streams due to the high intricacy of the location calculations and their unpredictability to information refreshes. Hamlet et al. [15], incremental adjustment to the Local Outlier Probabilities calculation, which is normally utilized for abnormality location, to permit it to distinguish anomalies almost in a split second in information streams. The proposed steady calculation’s solidarity depends on keeping the inclusion from getting gradual focus into the informational index. This blocks the peculiarity scores of different focuses refreshing (saving important computational time) while bringing about a modest quantity of mistake, when contrasted with a precise methodology. According to Yang et al. [16], anomaly distinguishing calculation has great exactness in recognizing worldwide and nearby exceptions. Nonetheless, the calculation needs to cross the whole dataset while working out the neighborhood exception component of every element, which adds additional time upward and makes the calculation execution wasteful. Also, assuming the K-distance neighborhood of an
An Enhanced Optimize Outlier Detection …
75
exception point P contains a few anomalies that are inaccurately decided by the calculation as typical places, then, at that point, P might be misidentified as an ordinary point.
4 Proposed Methodology A few decades back, major research work was done along the frequent pattern mining. Hence, initially, outlier detection methods were used as noise removal techniques to improve the prediction accuracy of the ML algorithms. Later, the researchers realized the use of the rare/abnormal patterns and started throwing light on detecting outliers. Hence, the outliers are helpful in alarming the decision makers during unusual/exceptional happenings. The ML algorithms employed in practice to detect outliers can be categorized in different ways as density-based, proximity-based, clustering-based, wavelet-based, classification-based, etc. Since this research work makes use of the labeled data for outlier detection, the target attribute value based category of ML algorithms. Steps as follow: 1. Access dataset https://data.world/datasociety/pima-indiansdiabetes-database 2. Load and visualize the dataset.
3. Information of data.
76
H. Mishra and C. Gupta
Fig. 2 Outcomes
4. Describe data.
5. 6. 7. 8.
Plot outlier in dataset column “Outcome” (0 = negative, 1 = Positive) (Fig. 2). Outlier values percentage (Fig. 3). Performing Exploratory Data Analysis (Figs. 4, 5, 6, 7, 8, 9 and 10). Correlation matrix (Fig. 11).
5 Simulation Result The model generated by ML measures Recall, Precision, F-Score, and other parameters of results (Fig. 2, 3, 4, 5, 6, 7, 8, 9, 10, and 11). The algorithms make use of the outlier imbalance data for diabetes disease. The performance of the model generated is tabulated in Table 1 to assess the quality of the model.
An Enhanced Optimize Outlier Detection …
Fig. 3 Outcomes percentage
77
78
Fig. 4 Attributes for diabetes disease
Fig. 5 Plot for pregnancies versus density
H. Mishra and C. Gupta
An Enhanced Optimize Outlier Detection …
Fig. 6 Insulin for diabetes and no diabetes disease
Fig. 7 Plot for age versus outcomes
79
80
Fig. 8 Plot for BP versus outcomes
Fig. 9 Plot for BP versus age
H. Mishra and C. Gupta
An Enhanced Optimize Outlier Detection …
Fig. 10 Diabetes pedigree function
Fig. 11 Correlation of attributes
81
82
H. Mishra and C. Gupta
Table 1 Simulation result Ml proposed N-estimators
Accuracy
Naïve Bayes
71
Precision 67
Recall 82
F1-Score 74
Support 68
SVM
70
66
81
73
68
LR
78.9
80
89
84
75
DT
100
100
100
100
100
100
100
100
100
84
87
85
75
100
100
100
125
RFC
100
K-NN
81.25
XG-Boost
100
Table 2 Cross validation result
S. No
CrossVal means
CrossVal errors
Algorithm
0
0.783091
0.035100
Logistic regression
1
0.701270
0.046378
Decision tree
2
0.750121
0.048499
Random forest
3
0.757139
0.053968
K_Neighbors
4
0.772626
0.052496
XG boost
Preparing data for training and testing in a ratio of (75:25)%. Cross Validation Result is presented in Table 2.
Proposed work with N-estimators ML models and comparison with the base paper algorithm.
An Enhanced Optimize Outlier Detection …
83
Training and testing Accuracy
K-Fold cross validation score
6 Conclusion Outliers are rare patterns which will give useful alarming information for decisionmaking. Dissimilar types of outliers like collective, global, and contextual outliers give variable information. Existing ML algorithms are incapable of anticipating all the types of outliers from a skewed database (due to a class imbalance in the training set) at a stretch. The contextual outliers, in particular, are more useful and are misclassified by most of the ML algorithms. If an ensemble of ML algorithms is used to address this issue, it will be time-consuming and not cost-effective. Hence, there is a need for a hybrid ML algorithm to predict all types of outliers.
84
H. Mishra and C. Gupta
References 1. K. Yu, W. Shi, N. Santoro, Designing a streaming algorithm for outlier detection in data mining—anincremental approach. Sensor, MDPI (2020) 2. K. Yu, W. Shi, N. Santoro, X. Ma, Real-time outlier detection over streaming data, in Proceedings of the IEEE Smart World Congress (SWC 2019), Leicester, UK, 19–23 August 2019 3. Q. Wang, Z. Luo, J. Huang, Y. Feng, Z. Liu, A novel ensemble method for imbalanced data learning: Bagging of extrapolation-SMOTE SVM. Comput. Intell. Neurosci. 2017, 1827016 (2017) 4. C. Tantithamthavorn, A. Hassan, K. Matsumoto, The impact of class rebalancing techniques on the performance and interpretation of defect prediction models. arXiv (2018), arXiv:1801. 10269 5. K. Yu, W. Shi, N. Santoro, Designing a streaming algorithm for outlier detection in data mining—an incremental approach. 26 February 2020, in Proceedings of 5th IEEE Smart World Congress 2019 (SWC 2019), Leicester, UK, 19 August 2019 6. A. Karale, M. Lazarova, P. Koleva, V. Poulkov, A Hybrid PSO-MiLOF Approach for Outlier Detection in Streaming Data. 978–1–7281–6376–5/20/$31.00 ©2020 IEEE TSP 2020 7. O. Alghushairy, R. Alsini, T. Soule, X. Ma, A review of local outlier factor algorithms for outlier detection in big data-streams. Big-Data-Cogn. Comput. (2021), 10.3390 8. Z.-M. Wang, G.-H. Song, C. Gao, An Isolation-based distributed outlier detection framework using nearest neighbor ensembles for wireless sensor networks. IEEE Access 7, 96319–96333 (2019) 9. S. Rajendran, W. Meert, V. Lenders, S. Pollin, SAIFE: unsupervised wireless spectrum anomaly detection with interpretable features. Proceedings of IEEE International Symposium on Dynamic Spectrum Access Networks (DySPAN), 22–25 October 2018 10. T. Cooklev, V. Poulkov, I. Iliev, D. Bennett, K. Tonchev, Enabling RF data analytics services and applications via cloudification. IEEE Aerosp. Electron. Syst. Mag. 33(5–6), 44–55 (2018) 11. O. Alghushairy, R. Alsini, X. Ma, T. Soule, Improving the efficiency of genetic based incremental local outlier factor algorithm for network intrusion detection, in Proceedings of the 4th International Conference on Applied Cognitive Computing, Las Vegas, NV, USA, 27–30 July 2020 12. N. Reunanen, T. Räty, J.J. Jokinen, T. Hoyt, D. Culler, Unsupervised online detection and prediction of outliers in streams of sensor data. Int. J. Data Sci. Anal. 9, 285–314 (2019) 13. S. Cai, Q. Li, S. Li, G. Yuan, R. Sun, An efficient maximal frequent-pattern-based outlier detection approach for weighted data streams. Inf. Technol. Control 48, 505–521 (2019) 14. X. Qin, L. Cao, E.A. Rundensteiner, S. Madden, Scalable kernel density estimation-based local outlier detection over large data streams. EDBT 2019, 421–432 15. C. Hamlet, J. Straub, M. Russell, S. Kerlin, An incremental and approximate local outlier probability algorithm for intrusion detection and its evaluation. J. Cyber Secur. Technol. 1, 75–87 (2017) 16. P. Yang, D. Wang, Z. Wei, X. Du, T. Li, An outlier detection approach based on improved self-organizing feature map clustering algorithm. IEEE Access 7, 115914–115925 (2019)
Prediction of Disease Diagnosis for Smart Healthcare Systems Using Machine Learning Algorithm Nidhi Sabre and Chetan Gupta
Abstract In the field of clinical conclusion, Machine learning (ML) strategies are broadly taken on for expectation and grouping tasks. The point of ML strategies is to arrange the illness all the more precisely in a proficient way for the determination of sickness. There is steady development in tolerant life care machines and frameworks. Thus, this development builds the typical existence of individuals. Be that as it may, these medical services frameworks face a few difficulties and issues like deluding patients’ data, security of information, absence of exact information, absence of medico data, classifiers for expectation, and some more. The point of this study is to propose a model in view of ML to determine patients to have diabetes and coronary illness in brilliant clinics. In this sense, it was underlined that by the portrayal for the job of ML models is important to advances in shrewd clinic climate. The exact pace of the conclusion (order) in view of research center discoveries can be improved through light ML models. Three ML models, in particular, support vector machines (SVM), Decision Tree (DT), and Gradient Boosting (GB), will prepare and test based on lab datasets. Three primary systemic situations of diabetes and coronary illness analyzed, for example, in light of unique and standardized datasets and those in view of component choice, were introduced. The proposed model in view of ML can be filled in as a clinical choice emotionally supportive network. Keywords Smart healthcare systems · Machine learning · Diabetes disease · Heart disease
1 Introduction Many years before there was steady development in quiet’s life care machine and frameworks. Thus, this development builds the typical existence of individuals. However, these medical care frameworks face a few difficulties and issues like N. Sabre · C. Gupta (B) Department of Computer Science and Engineering, SIRT, Bhopal, India e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 A. Bhattacharya et al. (eds.), Innovations in Data Analytics, Advances in Intelligent Systems and Computing 1442, https://doi.org/10.1007/978-981-99-0550-8_7
85
86
N. Sabre and C. Gupta
deluding patients’ data, security of information, absence of precise information, absence of medico data, classifiers for expectation, and some more. To deal with these issues, an enormous number of infection conclusion and forecast frameworks have been grown, for example, master framework, clinical expectation framework, choice emotionally supportive network, and individual wellbeing record framework [1–3]. The point of proposals framework is to give assistance to specialists and doctors for precise analysis of infections. The conclusion of illness can be depicted as finding the side effects of the sickness all the more precisely. Once the side effects are recognized accurately, then, at that point, relieving the disease is simple. Yet, it is seen that these clinical frameworks need huge handling power and assets. It is additionally seen that clinical frameworks are computationally broad. Diabetes is one of the non-transmittable infections all throughout the world [4] and influences 300 million individuals till 2030 [5]. Numerous scientists address the issue of missing worth in clinical information, either recognizing the missing worth and erasing the particular information examples from the dataset or taking on some default strategies, for example, mean, middle, neighbor, and so on, for filling the missing worth. Notwithstanding, the two strategies are missing to create ideal outcomes. Besides, anomalies are likewise introduced in information and corrupted the exhibition of classifiers. Scarcely any specialists additionally center around the exception discovery in the clinical dataset; however, it isn’t completely investigated till date. This work additionally considers the two notable issues of information, i.e., I) missing worth attribution, and ii) exception. The missing worth attribution issue is tended to through K-Mean++-based information ascription method. This strategy likewise approves the information through grouping and furthermore figures the qualities for missing information. Diabetes is a condition wherein the glucose level in the human body is excessively high and the body isn’t delivered such a lot of insulin to control the degree of glucose [6]. Yet, these medical care frameworks face a few difficulties and issues like misdirecting patients’ data, security of information, absence of precise information, absence of medico data, classifiers for expectation, and some more. To deal with these issues, a huge number of infection analysis and expectation frameworks have been grown, for example, master framework, clinical forecast framework, choice emotionally supportive network, and individual wellbeing record framework [7]. The point of propositions framework is to give assistance to specialists and doctors for the exact conclusion of infections. The analysis of illness can be depicted as finding the side effects of the sickness all the more precisely. Once the side effects are distinguished accurately, then restoring the disease is simple.
2 Heart Disease Heart disease (HD) refers to conditions that affect the functioning of the heart and blood vessels. This section will discuss the various heart diseases, symptoms of heart disease, and risk factors for heart disease [8].
Prediction of Disease Diagnosis for Smart Healthcare Systems …
87
Table 1 Types of HD S. no Heart disease
Description
1
Coronary artery disease
Blockage occurs in arteries that supply oxygenated blood to the heart
2
Cerebrovascular disease
Blockage occurs in blood vessels that supply blood to the brain
3
Peripheral artery disease Blockage occurs in arteries that supply blood to the limbs
4
Congenital heart disease
Birth defects affect the heart due to improper development of the structure in the fetus
5
Heart failure
The heart becomes weak and is unable to pump blood through the circulatory system
6
Arrhythmia
The rate of heartbeat is not normal
There are several types of HD, as shown in Table 1. The various risk factors that contribute to HD are discussed below: i. Gender Men have a higher risk of CAD, while women have a higher risk of cerebrovascular disease. Women have a lower risk of CAD because of the cardioprotective effects of estrogen. Estrogen raises HDL cholesterol levels and brings down LDL levels subsequently diminishing the gamble of coronary vein infection. Estrogen diminishes with menopause; hence, the gamble of CAD in ladies increments with age [9]. ii. Age The gamble of CAD increments with age. Be that as it may, this element is more applicable for ladies than for men. Cholesterol and pulse increase with increasing age. In ladies, the expansion in pulse with age is greater than in men. Age is a significant risk factor for heart disease, but its effects can be reduced with a better lifestyle [10]. iii. Obesity Heftiness alludes to an expansion in muscle versus fat to such a level that it hurts wellbeing. One reason for corpulence can likewise be a low metabolic rate, because of which fat gathers in the body. Another explanation can be an ill-advised way of life like the utilization of an unfortunate eating regimen. Obesity is not directly related to disease, but it does increase other risk factors. Obesity affects cholesterol and blood pressure. Cholesterol and blood pressure are one of the main causes of heart disease [11]. iv. High blood pressure On the off chance that the circulatory strain in the veins turns out to be high, the courses start narrowing. In narrowed arteries, there is a higher chance of plaque accumulation, which increases the risk of HD [12].
88
N. Sabre and C. Gupta
v. Smoking Smoking constricts blood vessels. Smoking increases the viscosity of the blood, which leads to the formation of blood clots. Smoking reduces oxygen in the blood. Due to less oxygen in the blood, the heart needs more effort to pump the blood. Smoking causes high blood pressure which is another risk factor [13]. vi. Diabetes People who have diabetes have a higher gamble of experiencing HD. Insulin hormone is produced by the pancreas organ of the human body. Insulin is needed by the body’s cells to use sugar from food. Diabetes happens when the pancreas doesn’t create sufficient insulin. Elevated degrees of glucose in the blood can cause the hardening of the blood vessels, which increases the risk of HD [14].
3 Diabetes Mellitus Diabetes mellitus (DM) is a tireless infection and a fundamental worldwide collective wellbeing stand up to. It happens when a body isn’t in that frame of mind to answer or outcome proper to insulin, which is wished to hold the charge of glucose [15]. Diabetes can be repressed with the guidance of insulin infusions, a sound weight reduction plan, and ordinary practicing yet there is no all out treatment realistic. Diabetes prompts different assorted afflictions, for instance, visual hindrance, beat, coronary sickness, kidney illness, and nerve hurt. Outrageous diabetes may in like manner brief development extravagant gamble justification behind passing on. There are three essential sorts of diabetes mellitus as follows [16]: ‘Type 1 DM’ is likewise called ‘Insulin Dependent DM’ (IDDM) or ‘adolescent diabetes’. In this situation, the pancreas has an inability to deliver insulin. The main driver of this sort is obscure—roughly 10% of diabetes individuals are impacted by Type 1 DM. Youngsters are generally impacted by this sort. It should be overseen just with insulin infusions [17]. ‘Type 2 DM’ is likewise alluded to as ‘Non-Insulin Dependent DM’ (NIDDM) or ‘grown-up beginning diabetes’. The pancreas doesn’t make sufficient insulin or the body cells fail to respond to insulin conveyed [18, 19]. The most generally perceived reason is exorbitant weight and inadequate activity. It might be treated by legitimate activity, diet, and by medicine regardless of insulin. Around 90% of diabetes individuals are of this kind.
4 Proposed Methodology The distribution in machine learning builds the module based on the training dataset with a classification algorithm. This learning can be categorized into all three possible
Prediction of Disease Diagnosis for Smart Healthcare Systems …
89
classification algorithms. In a supervised learning class, labeled data is present at the beginning. In semi-supervised learning, some of the class labels are known. Whereas in unsupervised learning no class label for the entire dataset. Once the training phase is finished, features are extracted from the data based on term frequency, and then the classification technique is applied. The classifiers that we have utilized are SVM, DT, and GB (Fig. 1). Algorithm steps:
Fig. 1 Flow chart of the proposed methodology
90
N. Sabre and C. Gupta
Input: D= {( ,), ( ,) ,…, ( , )}, ( , ( )) Where: ( ,( ) ) is the approximate loss function. Begin Initialize: ( ) = = :
Train weak learner Calculate
on training data
Update : End for End Output:
DT: A DT is a choice help instrument that utilizes a tree-like model of choices and their potential results, including chance occasion results, asset expenses, and utility. It is one method for showing a calculation that just holds back restrictive control explanations. DTs are ordinarily utilized in task research, explicitly in choice examination, to assist with recognizing a technique probably going to arrive at an objective, but at the same time are a well-known device in ML. GB: GB calculation is one of the most remarkable calculations in the field of AI. As we realize that the blunders in AI calculations are extensively characterized into two classifications, for example, Inclination Error and Variance Error. As inclination support is one of the helping calculations, limiting predisposition mistakes of the model is utilized. SVM: In ML, SVM are directed learning models with related learning calculations that examine information for grouping and relapse examination. To isolate the two classes of data of interest, there are numerous conceivable hyperplanes that could be picked. Our goal is to find a plane that has the greatest edge, for example, the greatest distance between data of interest of the two classes. Boosting the edge distance gives some support so future information focuses can be grouped with more certainty.
Prediction of Disease Diagnosis for Smart Healthcare Systems …
91
Step for DT, GB, and SVM Step 1
Importing the libraries and packages
Step 2
Initializing the parameters
Step 3
Reading the path of input files and initializing the output data
Step 4
Pre-processing the heart disease data for giving them as the input to the model
Step 5
Converting the heart and diabetic disease data to matrix form; flattening each heart disease data into an array vector
Step 6
Assigning the labels to the heart and diabetic disease data classes
Step 7
Rearranging the information to forestall overfitting and speculation of preparing
Step 8
Isolating the train information and test information
Step 9
Normalizing the heart and diabetic illness information
Step 10
Characterizing a model and its individual layers
Step 11
Ordering the model
Step 12
Squeezing the information into the ordered model, i.e., preparing the model utilizing the at-first characterized boundaries
Step 13
Plotting the Accuracy bends of the preparation interaction
Step 14
Printing the Classification Report and Confusion Matrix of the preparation interaction
5 Simulation Results Python is a general programming language and is comprehensively used in many disciplines like general programming, web improvement, programming headway, data assessment, AI, etc. Python is used for this endeavor since it is genuinely versatile and easy to use and besides documentation and neighborhood is very colossal (Tables 2 and 3). Table 2 Simulation parameters N_estimators
9, 16, 25, 36, 49, 64, 81
n_jobs
−1
random_state
42, none
Table 3 Training SVM, decision tree, and gradient boost classifiers on heart dataset Algorithms
Accuracy
Precision
Recall
F score
Romany et al. [1]
96.16
94.30
96.38
–
SVM
100
100
100
100
Decision tree
96
95
96
95
Gradient boost
94
100
100
100
92
N. Sabre and C. Gupta
Fig. 2 Load data and convert it into a pandas data frame
Fig. 3 Label encoding for changing string value to Numeric
NumPy is a mind-blowing pack that enables us for intelligent enlisting. It goes with present-day limits and can perform N-layered show, polynomial math, Fourier change, etc. NumPy is used everywhere in data examination, picture getting ready, and moreover, remarkable various libraries are worked above NumPy and NumPy goes probably as a base stack for those libraries. The graphical Representation of Heart Dataset is presented in Fig. 7. Graphical Representation of Diabetic Dataset is presented in Fig. 8. Results are presented as follows (Figs. 2, 3, 4, 5, 6, 7 and 8). 1. 2. 3. 4. 5. 6. 7. 8. 9. 10.
See Fig. 2. See Fig. 3. See Fig. 4. See Fig. 5. See Fig. 6. Remove Outliers Feature selection and data splitting into train and test data in the ratio of 80:20 Choosing model as random forest for primary data matrices check Perform standard scaling Use synthetic minority oversampling technique (SMOTE) for oversampling to solve imbalance data 11. Perform cross-validation Hyperparameters (Figs. 7 and 8 and Table 4).
Prediction of Disease Diagnosis for Smart Healthcare Systems …
93
Fig. 4 Find Correlation among data
6 Conclusion Most researchers have depended on ML classifiers for developing their classifier or forecaster models. In this examination, a few AI models have been executed to foresee and group diabetes and coronary illness. This classifier endeavored to tackle two issues, for example, sorting patients concerning diabetic sorts and to anticipate a diabetic and non-diabetic. The crossover approaches yield improved results than single classifiers. The goal of this work is to assess the Prima Indian Diabetic and heart dataset in view of AI calculations and to arrange the diabetic and heart dataset. A examination shows that the proposed work takes less execution time.
94
Fig. 5 Visualization
N. Sabre and C. Gupta
Prediction of Disease Diagnosis for Smart Healthcare Systems …
Fig. 6 Outlier detection in data
95
96
N. Sabre and C. Gupta
Fig. 7 Graphical representation of heart dataset
Fig. 8 Graphical representation of diabetic dataset Table 4 Training SVM, decision tree, and gradient boost classifier on diabetic dataset Algorithms
Accuracy
Precision
Recall
F score
Romany et al. [1]
97.27
96.94
98.62
–
SVM
94
85
96
95
Decision tree
91
92
89
91
Gradient boost
100
100
100
100
Prediction of Disease Diagnosis for Smart Healthcare Systems …
97
References 1. R.F. Mansour, A. El Amraoui, I. Nouaouri, V.G. Díaz, D. Gupta, S. Kumar, Artificial intelligence and internet of things enabled disease diagnosis model for smart healthcare systems. IEEE Access (2021) 2. G. Muhammad, M.S. Hossain, N. Kumar, EEG-based pathology detection for home health monitoring. IEEE J. Sel. Areas Commun. 39(2), 603610 (2021) 3. A.A. Mutlag, M.K.A. Ghani, M.A. Mohammed, M.S. Maashi, O. Mohd, S.A. Mostafa, K.H. Abdulkareem, G. Marques, I. de la Torre Díez, MAFC: multi-agent fog computing model for healthcare critical tasks management. Sensors 20(7), 1853 (2020) 4. M.S. Hossain, G. Muhammad, Deep learning based pathology detection for smart connected healthcare. IEEE Netw. 34(6), 120125 (2020) 5. M.S. Hossain, G. Muhammad, A. Alamri, Smart healthcare monitoring: a voice pathology detection paradigm for smart cities. Multimedia Syst. 25(5), 565–575 (2019) 6. V. Krishnapraseeda, M.S. Geetha Devasena, V. Venkatesh, A. Kousalya, Predictive analytics on diabetes data using machine learning techniques. 7th International Conference on Advanced Computing and Communication Systems (ICACCS), pp. 458–463, IEEE (2021) 7. V. Mounika, D.S. Neeli, G.S. Sree, P. Mourya, M.A. Babu, Prediction of type-2 diabetes using machine learning algorithms. International Conference on Artificial Intelligence and Smart Systems (ICAIS), pp. 167–173, IEEE (2021) 8. I.K. Mujawar, B.T. Jadhav, V.B. Waghmare, R.Y. Patil, Development of diabetes diagnosis system with artificial neural network and open source environment. International Conference on Emerging Smart Computing and Informatics (ESCI), pp. 778–784, IEEE (2021) 9. C. Saint-Pierre, F. Prieto, V. Herskovic, M. Sepúlveda, Team collaboration networks and Multidisciplinarity in diabetes care: implications for patient outcomes. IEEE J. Biomed. Health Inf. 14(1), 319–329 (2020) 10. M.S. Hossain, G. Muhammad, Emotion-aware connected healthcare big data towards 5G. IEEE Internet Things J. 5(4), 23992406 (2018) 11. M. Pham, Y. Mengistu, H. Do, W. Sheng, Delivering home healthcare through a cloud-based smart home environment (CoSHE). Future Gener. Comput. Syst. 81, 129140 (2018) 12. A. Kaur, A. Jasuja, Health monitoring based on IoT using raspberry PI, in Proceedings International Conference Computer, Communications. Autom. (ICCCA), Greater Noida, India, pp. 13351340 (2017) 13. U. Satija, B. Ramkumar, M. Sabarimalai Manikandan, Realtime signal quality-aware ECG telemetry system for IoT-based health care monitoring. IEEE Internet Things J. 4(3), 815823 (2017) 14. O.S. Alwan, K. Prahald Rao, Dedicated real-time monitoring system for health care using ZigBee. Healthcare Technol. Lett. 4(4), 142144 (2017) 15. P. Kakria, N.K. Tripathi, P. Kitipawang, A real-time health monitoring system for remote cardiac patients using smartphone and wearable sensors. t’l. J. Telemed. Appl. (2015) 16. G. Villarrubia, J. Bajo, J. De Paz, J. Corchado, Monitoring and detection platform to prevent anomalous situations in home care. Sensors 14(6), 99009921 (2014) 17. M.A. Alsheikh et al. Machine learning in wireless sensor networks: algorithms strategies and applications. IEEE Commun. Surv. Tutorials 16(4), 1996–2018 Fourth Quarter (2014) 18. F.F. Gong, X.Z. Sun, J. Lin, X.D. Gu, Primary exploration in establishment of China’s intelligent medical treatment (in Chinese). Mod. Hos Manag. 11(2), 28–29 (2013) 19. A. Krizhevsky, I. Sutskever, G.E. Hinton, ImageNet classification with deep convolutional neural networks. Proceedings 25th International Conference Neural Information Processing Systems, pp. 1097–1105 (2012)
Optimization Accuracy on an Intelligent Approach to Detect Fake News on Twitter Using LSTM Neural Network Kanchan Chourasiya, Kapil Chaturvedi, and Ritu Shrivastava
Abstract Fake identity is a critical problem nowadays on social media. Fake news is rapidly spread by fake identities and bots that generate the trustworthiness issue on social media. Identifying profiles and accounts using soft computing algorithms is necessary to improve the trustworthiness of social media. The Recurrent Neural Network (RNN) categorizes each profile based on training and testing modules. This work focuses on classifying bots or human entries according to their extracted features using machine learning. Once the training phase is completed, features are extracted from the dataset based on the term frequency on which the classification technique is applied. The proposed work is very effective in detecting malicious accounts from an imbalanced dataset in social media. The system provides maximum accuracy for the classification of fake and real identities on the social media dataset. It achieves good accuracy with RNN long short-term memory (LSTM). The system improves the classification accuracy with the increase in the number of folds in crossvalidation. In experiment analysis, we have done testing on real-time social media datasets; We achieve around 96% accuracy, 100% precision, 99% recall, and 96% F1 score on the real-time dataset. Keywords Fakes news · RNN · LSTM · Accuracy
1 Introduction Virtual entertainment stages are currently step by step the space of our lives and incorporate different everyday cultural action features. Individuals’ requests play out specific social situations such as the media favoring what they do in genuine life. K. Chourasiya (B) Sagar Institute of Research and Technology and Science, Bhopal, India e-mail: [email protected] K. Chaturvedi · R. Shrivastava Sagar Institute of Research and Technology, Bhopal, India e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 A. Bhattacharya et al. (eds.), Innovations in Data Analytics, Advances in Intelligent Systems and Computing 1442, https://doi.org/10.1007/978-981-99-0550-8_8
99
100
K. Chourasiya et al.
Ways of behaving, discernments, acts, and propensities in these frameworks people hold (social culture in online entertainment). When the worth of such media rises, this idea is more critical to research and test Contain. This climate gives scholarly consideration to the issue by zeroing in on provincial social parts of Google application errors, among the most widely recognized virtual entertainment. Informal communication is rapidly a level higher nowadays, which is crucial for promotion systems and famous people that attempt to help it by expanding their number of devotees and fans [1]. By and by, bogus records, created obviously for associations or people, can annihilate their believability and decline their number of companions and remarks [2]. They are as yet experiencing false notices and unnecessary equivocalness with others. Counterfeit profiles of different types create adverse consequences that neutralize virtual entertainment’s expected advertising and advancements for organizations and lay the foundation for online provocation [3]. In a web-based world, customers have explicit inquiries concerning their security. A couple of online entertainment destinations incorporate Twitter, Google+, YouTube, Instagram, Flickr, Facebook, and Snapchat [4, 5]. There were 823 million people who utilized virtual entertainment on their cell phones each day, which further developed over the past financial quarter to 654 million of these customers. Informal communication destinations, for example, Facebook, can’t yet offer ongoing updates to misleading records. For semi-actually progressed purchasers, separating between obvious and misleading accounts is incomprehensible. Other enormous information issues, in particular information assortment, overseeing information streams, and conveying immediate client answers, must be tended to while running on immense amounts of information simultaneously to acquire dependable profile acknowledgment execution. Prior work on counterfeit records handles trial and error that thinks about preventive estimates in contrast to counterfeit client ways of behaving [6]. A facebook social media control research project utilizing the Google Maps programming interface examination quantifiable data about the quantity of people companions, information access reports, grouping calculations of shared colleagues, insights regarding training and work, area Facebook client’s information, and common interests. The safety efforts to shield clients from assailants incorporate information on information, protection regulations, procedures to upgrade security, and mindfulness-raising preparation [7]. This exploration will likewise handle the test of the rising measure of information of high speed and reach. Figure 1 shows the movement of normal methods in the identification of genuine as well as phony profiles [4, 8]. The enormous data gathered by interpersonal organizations can properly be used to isolate counterfeit information from genuine profiles as opposed to proposing requests just from real profiles. This is imperative to non-expert shoppers, teens, and children who don’t perceive the security settings. The examination ought to improve by assisting the client with separating between genuine and counterfeit profiles continuously.
Optimization Accuracy on an Intelligent Approach …
101
Fig. 1 General process flow in detection of the real or fake profiles on social media
2 Road Map of Methodology At first, the framework gathers information from the Twitter account utilizing the Twitter Programming interface that separates the information from different jokes or remarks as of late seen by clients. The fundamental issue of online entertainment applications can’t distinguish bots or phony records. In this examination, we completed the blend of NLP and AI calculations to take out such issues in existing frameworks [9]. In the first place, we accumulate information from different virtual entertainment sources. Whenever information has been gotten, it will be put away in the information stores and informational index records. The information has been gathered from different web sources like Twitter, so it ought to be unstructured once in a while. It is compulsory to pre-process such information with a particular inspecting strategy and information filtration procedures [10]. The precise testing method is utilized for information partition, and the blossom channel is utilized to wipe out miss-grouped occasions. Electrical investigation ought to give sentence discovery and tokenization, separately; tokenized words have been put away into a string exhibit that gives brother free string checking. Extra NLP calculations are called stop word expulsion and lemmatization; this component goes under regular language handling. After such a cleaning cycle framework plays out some element extraction methods TF-IDF, co-event relationship is another strategy. This political decision has been finished utilizing different quality limits, and those elements pass to the order calculation. A comparative cycle has been followed for information preparing as well as testing, separately. We utilized different AI
102
K. Chourasiya et al.
and profound learning calculations like Fluffy Arbitrary Timberland [11], Innocent Bayes [12], Support Vector Machine [13, 14], and RNN [15] to classify the entire system. The proposed system’s outcome can detect the suspected entries’ run time environment and eliminate them using a deep learning algorithm. It also provides the highest accuracy for synthetic as well as real-time streaming data from various sources. It also reflects a low error rate on homogeneous and heterogeneous datasets successively.
2.1 Data Collection and Filtration The data collected from the Twitter account using the search query for data cleaning and filtration of null values from records also removes those records associated with misclassified instances or values [16]. Data cleaning can be done interactively by transaction processing or systematic sampling technique. The systematic sampling techniques have been used for data filtering; once filtration has been done, it becomes balanced data that eliminates the normalized dataset’s misclassified instances.
2.2 Pre-processing In the pre-processing phase, various methods have been used which are described below • Tokenization: translate complete sentences to separate words. • Eliminate the excessive punctuation and tags. • Stop words removing: To remove the common words like ‘the’, ‘is’, etc. that do not have semantic meaning in the real world. • Stemming: Tokens are compact to a root by eliminating variation using removing the unnecessary characters, typically a suffix. • Lemmatization: This approach eliminates the inflection by defining the part of speech and exploiting a thorough NLP linguistic dataset. This process is also called lemma feature generation in NLP.
2.3 Feature Extraction Various techniques have been used to extract the features like TF-IDF, correlation co-occurrence, relational features, and dependency features from the entire dataset [17, 18].
Optimization Accuracy on an Intelligent Approach …
103
2.4 Feature Selection Once feature extraction has been done, using a few quality thresholds, we optimized a feature set called feature selection [4, 6]. The weighted term frequency technique has been used to optimize features and forwarded to the training module.
2.5 Classification At last, the framework recognizes every exchange, either phony or genuine utilizing a directed characterization procedure. Besides, the framework likewise exhibits a phony news order of web-based entertainment datasets. In this work, we did RNN and LSTM as a managed grouping calculation. Then directed AI is applied to prepare the classifier. Here, class named information is available toward the start. RNN calculation for Profound learning is applied to decide character duplicity in informal communities where various choice trees are made utilizing haphazardly chosen highlights from the list of capabilities, and the larger part yield class of all the choice trees is taken as a result of the RNN [19]. The outcomes have been assessed by the disarray framework and the F1 score and PR-AUC are created [9].
3 Proposed Methodology NN ordinary capability is like a black box where the choices are made in view of given inputs. It involves static memory as loads to store data about growth opportunities. To give an express portrayal of memory in RNNs, the LSTM network was presented. The memory unit is portrayed as a ‘cell’ in the organization, and these models are transformations of RNNs and are the most appropriate for successive information. In this Proposed research, we need to examine the viability of LSTM for feeling grouping of short messages with the dispersed portrayal in online entertainment utilizing Developmental calculation. The working principle is given in Fig. 3. As per Fig. 2 the LSTM network takes three contributions at ‘St’, ‘At-1’, and ‘Pt1’. ‘St’ is the information vector for the ongoing time step. ‘At-1’ is the result or secret state moved by the past LSTM unit. Furthermore, ‘Pt-1’ is the memory component or cell condition of the past unit. It has two results, for example, ‘At’ and ‘Pt’, where ‘At’ is the result of the ongoing unit and ‘Pt’ is the memory component of the ongoing unit. Each choice is made subsequent to thinking about current information, past results, and past memory data. At the point when the ongoing result is acquired, the memory is refreshed. The ‘S’ demonstrates the ‘Forget’ component of increase. At the point when the incentive for the fail to remember component is given as ‘0’, it fails to remember a
104
Fig. 2 Working of LSTM
Fig. 3 Flowchart of proposed methodology
K. Chourasiya et al.
Optimization Accuracy on an Intelligent Approach …
105
lot of old memory. For any remaining qualities like 1, 2, and 3, a small portion of old memory is permitted by the unit. The steps are as follows (Figs. 4, 5, 6, 7, 8, 9, and 10): 1. 2. 3.
Load library and dataset Process data with pandas data frame Perform EDA.
4.
As we can see, the classes are imbalanced, so we can consider using some kind of resampling. We will study it later. Anyway, it doesn’t seem to be necessary. Data pre-processing
Fig. 4 Process data with pandas data frame
Fig. 5 Perform EDA
106
K. Chourasiya et al.
Fig. 6 Data pre-processing
Fig. 7 Apply stop words
Fig. 8 Stemming and lemmatization
5. 6. 7. 8. 9.
Now we are going to engineer the data to make it easier for the model to classify. This section is very important to reduce the dimensions of the problem. Apply stop words Stemming and lemmatization Encoding target classes Token visualization Vectorization data
Optimization Accuracy on an Intelligent Approach …
Fig. 9 Encoding target classes
Fig. 10 Token visualization
107
108
10. 11. 12. 13.
K. Chourasiya et al.
Splitting data for training and testing Perform count vectorizatione Perform TF-IDF Word embedding using GloVe.
4 Simulation Results In view of this disarray grid, the accuracy (Table 1), review, F-measure, and precision have been determined (Figs. 11 and 12, 13 and 14) to assess the exhibition of the proposed model. Accuracy (P) is the proportion of various occasions accurately characterized by the complete number of examples. Review (R) is the proportion of various occasions accurately characterized by the complete number of anticipated examples. Exactness alludes to the proportion of various cases accurately anticipated to the absolute of occurrences anticipated by the model. F-measure is the symphonious mean between accuracy and review and can be determined. Since F-measure is a value that sums up both the accuracy and review, thus, it is considered more steady in assessing the classifier’s exhibition than the previous two measures. The results of GloVe-LSTM are represented in Table 2. GloVe-LSTM for Twitter Disaster dataset (Figs. 11 and 12). 1. 2. 3. 4.
Load dataset and make pandas data frame EDA Cleaning data and text Remove stop words (Figs. 15 and 16).
5 Conclusion The performance of the system varies with the classification technique and dataset used. Furthermore, the experimental result shows that the proposed approach achieved satisfactory results evaluated with the current state-of-the-art methods. We demonstrate an innovative way to detect fake accounts on Online Social Networks (OSNs). With the progress in the production of fake profiles, current systems have become much more efficient. The variables that depend on the current system are unpredictable. We used balancing factors in this analysis, such as pre-processing, data Table 1 Modeling accuracy on spam dataset
Algorithm
Train accuracy
Test accuracy
Multinomial NB
97
95
XGBoost
98
96
LSTM
98
97
BERT
99
98
Optimization Accuracy on an Intelligent Approach …
109
Fig. 11 Loss versus variable loss
Fig. 12 Accuracy
sampling, data filtration, training, and testing modules to improve the predictions’ precision.
110
Fig. 13 Load dataset and make pandas data frame
Fig. 14 EDA
K. Chourasiya et al.
Optimization Accuracy on an Intelligent Approach …
111
Table 2 Modeling GloVe-LSTM Algorithm
Accuracy
Precision
Recall
F1 Score
GloVe-LSTM
96
100
99
96
Fig. 15 Loss versus variable loss
Fig. 16 Accuracy
112
K. Chourasiya et al.
References 1. E. Elsaeed, O. Ouda, M.M. Elmogy, A. Atwan, E. El-daydamony, Detecting fake news in social media using voting classifier. IEEE Access (2021) 2. H. Saleh, A. Alharbi, S.H. Alsamhi, OPCNN-FAKE: optimized convolutional neural network for fake news detection. IEEE Access 9, 129471129489 (2021) 3. K. Chaturvedi et al., A fuzzy inference approach for association rule mining. IOSR J. Comput. Eng. 16, 57–66 (2014) 4. S. Hakak, M. Alazab, S. Khan, T.R. Gadekallu, P.K.R. Maddikunta, W.Z. Khan, An ensemble machine learning approach through effective feature extraction to classify fake news. Future Gener. Comput. Syst. 117, 4758 (2021) 5. D. Keskar, S. Palwe, and A. Gupta, Fake News Classification on Twitter Using Flume, N-Gram Analysis, and Decision Tree Machine Learning Technique. Springer, Singapore, pp. 139147 (2020) 6. Y. HaCohen-Kerner, D. Miller, Y. Yigal, The inuence of preprocessing on text classification using a bag-of-words representation. PLoS ONE 15(5), 122 (2020) 7. P. Ksieniewicz, M. Chora, R. Kozik, M.Wo1 niak, Machine learning methods for fake news classification, in Intelligent Data Engineering and Automated Learning (Lecture Notes in Computer Science), vol. 11872. Manchester, U.K.: Springer (2019) 8. Y. Madani, M. Erritali, B. Bouikhalene, Using articial intelligence techniques for detecting COVID-19 epidemic fake news in Moroccan Tweets. Results Phys. 25, Art. no. 104266 (2021) 9. J.A. Nasir, O.S. Khan, I. Varlamis, Fake news detection: a hybrid CNN-RNN based deep learning approach. Int. J. Inf. Manage. Data Insights 1(1) (2021) 10. M.S. Hossain, G. Muhammad, Emotion-aware connected healthcare big data towards 5G. IEEE Internet Things J. 5(4), 2399–2406 (2018) 11. A. Kaur, A. Jasuja, Health monitoring based on IoT using raspberry PI, in Proceedings International Conference Computer, Communication Autom. (ICCCA), Greater Noida, India, pp. 1335–1340 (2017) 12. O.S. Alwan, K. Prahald Rao, Dedicated real-time monitoring system for health care using ZigBee. Healthcare Technol. Lett. 4(4), 142–144 (2017) 13. U. Satija, B. Ramkumar, M. Sabarimalai Manikandan, Realtime signal quality-aware ECG telemetry system for IoT-based health care monitoring. IEEE Internet Things J. 4(3), 815–823 (2017) 14. K. Chaturvedi, R. Patel, D.K. Swami, An efficient binary to decimal conversion approach for discovering frequent patterns. Int. J. Comput. Appl. 75(12), 29–34 (2013) 15. G. Villarrubia, J. Bajo, J. De Paz, J. Corchado, Monitoring and detection platform to prevent anomalous situations in home care. Sensors 14(6), 9900–9921 (2014) 16. F.F. Gong, X.Z. Sun, J. Lin, X.D. Gu, Primary exploration in establishment of China’s intelligent medical treatment (in Chinese). Mod. Hos Manag. 11(2), 28–29 (2013) 17. K. Crammer, O. Dekel, J. Keshet, S. Shalev-Shwartz, Y. Singer, Online passive- aggressive algorithms. J. Mach. Learn. Res. 7, 551585 (2006) 18. K. Chaturvedi, R. Patel, D.K. Swami, an inference mechanism framework for association rule mining. Int. J. Adv. Res. Artif. Intell. 3(9) (2014) 19. K. Chaturvedi, R. Patel, D.K. Swami, Fuzzy c-means based inference mechanism for association rule mining: a clinical data mining approach. Int. J. Adv. Comput. Sci. Appl. 6(6), 103–110 (2015)
Advance Computing
Mining User Interest Using Bayesian-PMF and Markov Chain Monte Carlo for Personalised Recommendation Systems Bam Bahadur Sinha and R. Dhanalakshmi
Abstract It is easy and beneficial to use low-rank matrix approximation techniques in collaborative filtering systems. Model of this kind is often fit by obtaining MAP estimates of parameters, a technique that is efficient even for exceptionally large datasets. This method, however, is tedious, and prone to overfitting unless the regularisation parameters are properly adjusted, since it discovers a single estimation of the parameters. To manage model capacity automatically, this research work integrates all model hyperparameters, and parameters in the Bayesian approach of the probabilistic matrix factorization (PMF). The MovieLens-100K dataset, which contains over 100K movie ratings, shows that Bayesian-PMF can be effectively trained using MCMC (Markov chain Monte Carlo) technique. The proposed model achieves better efficacy as compared to PMF-MAP models. Keywords Recommender system · Matrix factorization · Collaborative filtering · Probabilistic model · MAP
1 Introduction In collaborative filtering, low-dimensional factor models are one of the most prevalent methods. They are based on the notion that a limited number of unseen variables influence a user’s opinion or preferences. User preferences are represented in linear factor models by linearly mixing item factors using user coefficients. According to [1] in the case of [X] users and [Y] movies, the preference matrix [Z ](X ×Y ) is obtained B. B. Sinha (B) Computer Science and Engineering, Indian Institute of Information Technology Ranchi, Ranchi, Jharkhand, India e-mail: [email protected] R. Dhanalakshmi Computer Science and Engineering, Indian Institute of Information Technology Tiruchirappalli, Tiruchirappalli, Tamil Nadu, India e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 A. Bhattacharya et al. (eds.), Innovations in Data Analytics, Advances in Intelligent Systems and Computing 1442, https://doi.org/10.1007/978-981-99-0550-8_9
115
116
B. B. Sinha and R. Dhanalakshmi
by multiplying a user coefficient matrix ‘U T ’ by the factor matrix V(D×Y ) . Under the defined loss function, developing such a model involves determining the best rank-D estimate to the specified matrix [Z]. In recent years, a number of probabilistic models have been proposed [2–4]. We could think of them as graphs in which the hidden factors have direct connections with user ratings. Due to the interactable nature of exact inference, the posterior distribution in such models over hidden factors must be computed using potentially slow estimations. Rating variables and factor variables are considered to be conditionally and marginally independent in these models respectively. These models have a major flaw: inferring the distribution (posterior) of a factor provided the ratings is a challenging task. For example, several of the current techniques use MAP estimation [5] to estimate the parameters of the model. As a result of this, even extremely large datasets may be effectively trained by maximising the log-posterior across model parameters. SVD (Singular Value Decomposition) [6] may be used to find low-rank estimates based on reducing the sum-squared variation. Z = U T V of the defined rank is found by using the SVD algorithm. Most of the entries in [Z] will be missing since real-world datasets are often sparse. As a result, only the observed elements of the target matrix [Z] are taken into account for computing the sum-squared variation. Srebro and Jaakkola [7] show that this apparently little change leads to a complex non-convex optimization issue that cannot be addressed with conventional SVD implementations. By penalising U and V norms, Shanthakumar et al. [8] proposes a method to put constraint to rank the approximation matrix [Z = U T V ]. In this paradigm, learning, on the other hand, involves solving a sparse semidefinite programme (SDP), which makes this method problematic for real datasets that contain millions of entries. For two reasons, none of the methods described above has been very effective. (i) Except for matrix-factorization-based methods—neither one of these approaches scales effectively to large datasets. (ii) Most algorithms have difficulty generating accurate predictions for people with relatively few ratings. People who utilise collaborative filtering often delete individuals who have less than a certain amount of ratings. Because the most challenging instances have been eliminated, results published on common datasets, such as MovieLens-100K, seem remarkable. This paper provides an analysis of the Probabilistic Matrix Factorization (PMF) [9] using a Bayesian approach [10]. Markov chain Monte Carlo (MCMC) [11] techniques are used to approximate inference in the Bayesian-PMF model. As a result of their apparent slowness, MCMC techniques are seldom used for large-scale issues in practise. As demonstrated in this paper, the MovieLens-100K dataset, which contains over 100K ratings, can be effectively analysed using MCMC. Additionally, the model’s prediction accuracy has increased significantly, particularly for intermittent users, as compared to the conventional PMF-MAP model with regularisation parameters that have been properly adjusted on the validation set. In the past, variational approximations have been utilised to obtain the inference whilst using BayesianPMF techniques for collaborative filtering. A simplified, factorised distribution, in which user factors are independent of item/movie factors is used to estimate a true distribution (posterior). A variational posterior distribution [12] is a combination of two multivariate Gaussians: (i) the consumer factor, and (ii) the item/movie factor. As
Mining User Interest Using Bayesian-PMF and Markov Chain Monte Carlo …
117
per our experiments, such models have non-Gaussian factor distributions. BayesianPMF models beat their PMF-MAP counterparts by a considerably greater margin than variational models. Following are the major contribution of our research work: • Increased significance of Bayesian-PMF by introducing Markov Chain Monte Carlo for efficient training. • Cutting back sparsity of MovieLens-100K dataset using regression imputation technique. • Testing the Bayesian-PMF model after applying other baseline sparsity reduction techniques such as Uniform random baseline, Global mean baseline, and Mean of Means Baseline. • Monitoring the stability and convergence of latent variables via graphical representation for Bayesian-PMF and PMF-MAP. The remaining section of the paper is structured as follows: Sect. 2 emphasises the related work and background. Experimental methodology is discussed in Sect. 3. Section 4 delineates the flow model of the proposed architecture. Section 5 highlights the experimental outcomes and performance comparison with other baseline models. Finally, the closing section of the paper highlights the conclusion and future direction.
2 Related Work and Background Non-Negative factoring [13], and Principal Component Analysis (PCA) [14] are well-known methods of factoring matrices in the machine learning domain. In the field of recommendation [15], these methods have also been used successfully. A matrix representation of items has also been developed using similar methods, where each component X i, j represents the number of times the word wi occurs in the description of item ‘j’. There are no unknown words as there are in the collaborative filtering matrix, even if it includes a lot of zeros. Now let’s have a look at how to decompose it. Different approaches that can be used to perform decomposition include: Latent Semantic Analysis (LSA) [16], Probabilistic LSA [17], Latent Dirichlet Allocation (LDA) [18], non-negative matrix factorization [13], etc. In LSA, by breaking the descriptive matrix into two, latent variables describing the occurrence of each word in the descriptions of each item are taken into consideration. Words are represented in one of these matrices, whereas descriptions of items are represented in the other. But since elements of these matrices may have any real values, it may be difficult to comprehend, such as in [19] for example. Non-negative factorization of item description may be used in conjunction with Probabilistic LSA [17]. This method differs from LSA because the elements of the matrices are restricted to take positive values. Using this method, we can predict the likelihood that a term will occur in a document’s description. The significance of these matrices can be facilitated due to the restriction that elements of the matrix are not negative. Each
118
B. B. Sinha and R. Dhanalakshmi
matrix may be divided into two parts: the chance that a word occurs in each cluster description, and the likelihood that each item corresponds to each item cluster. On the other hand, Bayesian probabilistic models used in LDA [18] avoid the overfitting problems of probabilistic LSA. Also, this method estimates the likelihood of the occurrence of a word in the document, just as a probabilistic LSA. Each item is assigned a probability that it is part of a cluster of items, and each word is assigned a probability that it occurs in the description for that cluster of items. All the aforementioned techniques have been widely used for content-based recommender systems and have also been used to leverage the accuracy of collaborative filtering systems. Because of the highly sparse nature of the matrix, neither of these techniques has yielded promising results in terms of rating prediction accuracy (MAE, and RMSE). Whilst using the methods like LSA, probabilistic LSA, or LDA, we implicitly fill the matrix with ‘0’ for unrated items, making it such that the user hates these items. The effectiveness of these methods, however, would be substantially reduced if one attempted to use them for forecasting the user’s rating, because such methods are only somewhat effective compared to other metrics such as the number of times a user likes an item. Poisson Matrix Factoring [20] works very well in this context since they were capable to accurately predict how often a user will like an item. This paper provides a successful alternative for the efficient decomposition of the rating matrix after sparsity reduction using the regression imputation technique. The Bayesian framework helps in minimising the possibility of overfitting the model.
3 Experimental Methodology This section discusses the different techniques and performance measures used in our research work. Complete testing of the proposed model was done on MovieLens100K dataset.
3.1 Probabilistic Matrix Factorization (PMF) Consider a scenario where we have ‘Nu ’: Users and ‘M I ’ items. The rating scale lies in the range of [1, R]. Let Z i j denote the rating received by item ‘i’ from user ‘i’. The mathematical representation of latent users and latent items is given by U ∈ Z D×N and V ∈ Z D×M respectively. Ui and V j represents the feature vector for users and items. The conditional distribution for predicted ratings is defined using Eq. 1, and prior distribution over ‘U’ and ‘V’ is given by Eqs. 2, and 3 respectively. Pd (Z |U, V , α) =
Nu ⊓ MI ⊓ i=1 j=1
[F(Z i j |UiT V j , α −1 )] Q i j
(1)
Mining User Interest Using Bayesian-PMF and Markov Chain Monte Carlo …
119
Fig. 1 Graphical representation of PMF
where, F(Z i j |UiT V j , α −1 ): Gaussian distribution with precision α and mean μ Q: 1; (user ‘i’ rated item ‘j’) Q: 0; (user ‘i’ has not rated item ‘j’) Pd : Probability density function. Pd (U |αU ) =
Nu ⊓
F(Ui |0, αU−1 Q)
(2)
F(V j |0, αV−1 Q)
(3)
i=1
Pd (V |αV ) =
MI ⊓ i=1
The objective function given by Eq. 4 is minimised. The minimization of SSE (Sum of squared error) is equivalent to maximisation of posterior distribution (Fig. 1). E=
Nu Σ Nu MI λu Σ λV Σ 1Σ M I Q i j (Z i j − UiT V j )2 + ||Ui ||2F + ||V j ||2F 2 i=1 j=1 2 i=1 2 j=1
(4)
where ||.||2F represents Frobenius norm [21]. Ensuring the model generalise effectively, especially on sparse and unbalanced datasets, requires manual complexity adjustment, which is a disadvantage. Searching for suitable values of the regularisation parameters [U] and [V] is one approach to manage the model complexity. An acceptable set of parameter values may be used to train a model for each parameter setting, and then the model that excels on a
120
B. B. Sinha and R. Dhanalakshmi
validation set could be chosen. Due to the fact that a wide variety of models must be trained instead of just one, this method is computationally expensive. One way to control the complexity is by using a Bayesian-PMF that uses MCMC for integrating model parameters and hyperparameters.
3.2 Bayesian-PMF with MCMC The predictive distribution over [U] and [V] over model parameters and hyperparameters is obtained using Eq. 5 [22]. Pd (Z i∗j |Z , β0 ) =
∫ ∫
Pd (Z i∗j |Ui V j )
(5)
Pd (U, V |Z , βU , βV )Pd (βU , βV |β0 )dU, V dβU , βV β0 : {μ0 , V0 , W0 } V0 : D W0 : Identity Matrix. Our only option is to use approximation inference since the complexity of the posterior makes precise assessment of this predictive distribution mathematically impossible. Variational methods scale well but they tend to generate inaccurate results because of overly simplistic approximations. On the other hand, MCMC-based methods make use of Monte Carlo estimations for predictive distribution as represented by Eq. 6. K 1 Σ ∗ Pd (Z i∗j |Uik , V jk ) (6) Pd (Z i j |Z , β0 ) = K k=1 It is possible to produce the samples Uik and V jk using the stationary distribution of a Markov chain, which corresponds to a posterior distribution. Gibbs sampling is a simple, yet powerful, method for Bayesian inference. Typically, when conditional distributions can be readily sampled, Gibbs sampling is used.
3.3 Sparsity Reduction 3.3.1
Uniform Random
Our initial baseline is about as incomprehensible as it gets. Each location in ‘Z’ where a value is absent, we’ll just populate it with a number chosen at random from the range [1, 5]. We anticipate that this approach will perform the poorest by far. Mathematically we can represent it via Eq. 7. Z i∗j = U ni f or m
(7)
Mining User Interest Using Bayesian-PMF and Markov Chain Monte Carlo …
3.3.2
121
Global Mean
This technique is just somewhat superior than the previous one. Wherever there is an absent value, we will substitute the mean of all collected scores. Mathematical representation for the same can be given by Eq. 8. Nu Σ MI Σ 1 Q i j (Z i j ) G= Nu ∗ M I i=1 j=1
Z i∗j 3.3.3
(8)
=G
Mean of Means
Now we’re going to start improving our intelligence. We believe that certain users may be readily entertained and therefore motivated to give all items a higher rating. Others may be the polar opposite. Additionally, certain items may be more amusing than others, and as a result, all users may score some items higher than others in aggregate. We’ll try to capture these broad changes using user and movie rating data. Additionally, we’ll include the global mean to help smooth things out. Therefore, if a value is lacking in cell Z i j , we will combine the global mean with the means of Ui and I j and then use that result to fill it in. Equation 9 represents the mathematical formula for computing the mean of means rating. Z i∗j =
3.3.4
M N 1 Σ 1 Σ Q i j (Z i j ) + Q i j (Z i j ) + G M I j=1 Nu i=1
(9)
Regression Imputation
The regression imputation [23] procedure is designed to be used with a model fitted to a variable that has missing information. This regression model’s predictions are used to provide forecasts for the missing values. Following are the steps involved in using a linear regression approach for handling the missing data in the rating matrix. i. In estimating a linear regression model, factors ‘X’ are taken into consideration alongside data on the dependent variable ‘Y’. ii. Y’s predicted values are utilised to forecast the missing instances. Predictions are used to fill in missing values of ‘Y’. Due to the fact that imputed values are relied on regression models, the correlations between ‘X’ and ‘Y’ are retained. It demonstrates better efficacy over more straightforward imputation approaches, such as zero substitution, or mean imputation.
122
B. B. Sinha and R. Dhanalakshmi
3.4 Performance Measure In this paper, in order to test the performance of the proposed model, two error measures, namely, Root Mean Square Error (RMSE) and Mean Absolute Error (MAE) [24] are used. The root mean square error (RMSE) is a quadratic scoring method that indicates the error’s average magnitude. Equation 10 provides the mathematical formula for the root mean square error. To put it another way, the difference between predicted and actual values is squared and then averaged across the sample. Finally, the average’s square root is calculated. Because the errors are squared prior to being averaged, the RMSE provides a disproportionate amount of weight to large errors. This implies that the RMSE is most helpful when large errors are especially unwelcome. The MAE is used to determine the average degree of forecasting mistakes in a collection of predictions without consideration of their direction. It quantifies the precision of continuous variables. In simpler terms, the MAE is the average of the absolute values of the discrepancies between the prediction and the associated observation across the verification sample. Because the MAE is a linear score, all individual differences are equally weighted in the average. Equation 11 provides the mathematical formula for the MAE. /Σ (P − A)2 (10) RMSE = N M AE =
|(P − A)| N
(11)
where P: Predicted rating, A: Actual rating, and ‘N’ denotes the total number of predictions made by the model. The MAE and RMSE may be used in conjunction to diagnose the variance in a set of prediction mistakes. The RMSE will be always greater than or equal to MAE; the bigger the gap between them, the more variation in the sample’s individual mistakes. If the RMSE equals the MAE, then all errors are of the same size. MAE and RMSE both have a range of 0 to ∞. They are negative-valued scores, with lower values being preferable.
4 Proposed Architecture The flow of proposed architecture can be categorised into four phases as illustrated via Fig. 2. The first phase deals with transforming the dataset into a proper structure representation such that the complete rating data can be moulded into the form of a utility matrix. The rows of the utility matrix denote the set of users present in the system and the column denotes the item set. The second phase of the model is inclined towards reducing the sparsity using different baseline techniques such as Uniform Random, Global Means, Mean of Means, and regression imputation. After cutting
Mining User Interest Using Bayesian-PMF and Markov Chain Monte Carlo …
123
Fig. 2 Blueprint of proposed architecture
back the sparsity of the dataset, the dataset is divided into 70/30 (Train/Test ratio). The model is checked for training and test error on PMF-MAP and PMF-MCMC model in the third phase of the proposed architecture. In fourth phase, the efficacy of the model is tested using error measures such as RMSE, and MAE. Lower value of RMSE and MAE obtained by the proposed model indicates the supremacy of performance in terms of efficient rating predictions.
5 Experimental Outcomes and Investigation This section highlights the experimental setup and results obtained at each phase of the proposed model. Performance comparison with other baseline models has also been discussed in this section.
124
B. B. Sinha and R. Dhanalakshmi
Fig. 3 Frequency of rating given by users
5.1 Dataset Description In this paper, MovieLens-100K dataset is used. The dataset comprised 943 users, 1682 items and around 1,00,000 rating data. Ratings lie in the range of [+1, +5], where +1 represents least interest of user in the item and +5 represents that the user really liked the item. Overall sparsity of the dataset is around 93.70%. The exploratory data analysis for the MovieLens dataset is illustrated via Figs. 3, and 4.
Fig. 4 Top-30 and Bottom-30 movies in dataset
Mining User Interest Using Bayesian-PMF and Markov Chain Monte Carlo … Table 1 Error scores obtained on ML-100K Approach RMSE Uniform random baseline Global mean baseline Mean of means baseline
125
MAE
1.71238 1.13083 1.01743
1.40608 0.94894 0.83898
After transforming the dataset into an appropriate utility matrix structure, we reduced the sparsity using three baseline methods and checked for the obtained error scores. The obtained RMSE and MAE values are discussed via Table 1. It’s quite obvious from the error scores that the model doesn’t perform good enough for the real-world applications. After performing the train-test split, we implemented PMF-MAP estimation on sparse dataset and PMF-MCMC estimation on the newly generated set which has reduced sparsity level. The reduction in sparsity percentage is done by using the regression imputation technique. The results obtained by PMF-MAP and PMFMCMC for different values of α and dimension (Dim) on sparse dataset is given by Table 2. Higher values of ‘Dim’ indicates greater number of latent factors. It can be clearly observed from Table 2 that there is a slight difference between the obtained error values of training set and the test set. A slight difference is always expected from the model but if the difference is too big, for example, if the training error is very low and test error is very high. Such formulation indicates the overfitting in the model. We tried to prevent the model from getting stuck in the problem of overfitting by testing the model accuracy on different values of α and latent factors. The optimal α and ‘Dim’ value used by the PMF-MCMC model after reducing sparsity using regression imputation is 4, and 10, respectively. The model achieved minimal train-test difference for the drawn optimal hyperparameters. On comparing the results obtained by PMF-MAP with the mean of means baseline model we can say that the PMF-MAP is not performing well. But at the same time when we performed posterior distribution using MCMC sampling, the performance of the model showed promising results. We used the Frobenius norm to monitor the stabilisation of latent variables ‘U’ and ‘V’. The formula for computing the Frobenius norm is given by Eq. 12. In order to get some idea about the number of samples that can be discarded, we have graphically represented a traceplot for 100 samples in Fig. 5. Similar testing can be done for a different number of samples. ⎡ | Nu D |Σ Σ 2 |Uud |2 ||U || Fr = √ u=1 d=1
||V ||2Fr
⎡ | MI D |Σ Σ =√ |Vid |2 i=1 d=1
(12)
126
B. B. Sinha and R. Dhanalakshmi
Table 2 Performance measure of PMF-MAP and PMF-MCMC α=2 (Dim = 10) (Dim = 10) (Dim = 20) (Dim = 20) (Dim = 50) (Dim = 50) MAE RMSE MAE RMSE MAE PMF-MAP RMSE Train Test PMFMCMC Train Test α=4 PMF-MAP Train Test PMFMCMC Train Test α=6 PMF-MAP Train Test PMFMCMC Train Test α=8 PMF-MAP Train Test PMFMCMC Train Test α = 10 PMF-MAP Train Test PMFMCMC Train Test
1.01289 1.14163 RMSE
0.79325 0.89145 MAE
0.93702 1.16075 RMSE
0.73121 0.90448 MAE
0.84275 1.18571 RMSE
0.65267 0.92549 MAE
0.78209 0.92321 (Dim = 10) RMSE 0.85692 1.05569 RMSE
0.62781 0.72703 (Dim = 10) MAE 0.67038 0.82085 MAE
0.71041 0.91877 (Dim = 20) RMSE 0.74995 1.11289 RMSE
0.57261 0.72793 (Dim = 20) MAE 0.58348 0.86405 MAE
0.61411 0.93933 (Dim = 50) RMSE 0.58881 1.14986 RMSE
0.49810 0.74173 (Dim = 50) MAE 0.45279 0.89535 MAE
0.71599 0.92475 (Dim = 10) RMSE 0.79843 1.04377 RMSE
0.57188 0.72488 (Dim = 10) MAE 0.62377 0.80747 MAE
0.59594 0.9481 (Dim = 20) RMSE 0.67443 1.11539 RMSE
0.47880 0.74219 (Dim = 20) MAE 0.52363 0.85822 MAE
0.42318 0.93845 (Dim = 50) RMSE 0.47881 1.15846 RMSE
0.34492 0.73713 (Dim = 50) MAE 0.36705 0.90284 MAE
0.69551 0.94505 (Dim = 10) RMSE 0.76443 1.03583 RMSE
0.55164 0.73462 (Dim = 10) MAE 0.59557 0.79557 MAE
0.55419 0.98689 (Dim = 20) RMSE 0.62980 1.12976 RMSE
0.44275 0.76278 (Dim = 20) MAE 0.48669 0.86477 MAE
0.33953 0.95153 (Dim = 50) RMSE 0.41878 1.19216 RMSE
0.27621 0.75465 (Dim = 50) MAE 0.31970 0.92712 MAE
0.68535 0.95823 (Dim = 10) RMSE 0.74779 1.02642 RMSE
0.55164 0.73462 (Dim = 10) MAE 0.58224 0.78514 MAE
0.53446 1.02204 (Dim = 20) RMSE 0.60337 1.13834 RMSE
0.42346 0.78388 (Dim = 20) MAE 0.46465 0.86879 MAE
0.29384 0.99255 (Dim = 50) RMSE 0.37734 1.21367 RMSE
0.23830 0.78537 (Dim = 50) MAE 0.28678 0.93965 MAE
0.68535 0.95823
0.55164 0.73462
0.52167 1.02778
0.41076 0.78895
0.26448 1.02789
0.21336 0.80383
Mining User Interest Using Bayesian-PMF and Markov Chain Monte Carlo …
127
Fig. 5 Frobenius norm of ‘U’ and ‘V’ Table 3 Performance measure of PMF-MAP and PMF-MCMC after sparsity reduction α=4 (Dim = 20) (Dim = 10) RMSE MAE PMF-MAP Train Test PMF-MCMC Train Test
0.91291 0.93511 RMSE 0.69871 0.70733
0.72272 0.73887 MAE 0.55168 0.61375
After testing the same set of configuration setup on the newly generated dataset after sparsity reduction using regression imputation, it is found that the model yield efficient performance. The obtained RMSE and MAE value in such setup is discussed via Table 3. The RMSE value of 0.70733 and MAE value of 0.61375 obtained by PMF-MCMC + regression model on MovieLens-100K dataset demonstrates the efficacy of the model. Although the time complexity for training the model is high. We need to further improvise the model for reducing the complexity of sampling without compromising with the accuracy of predictions.
6 Conclusion and Future Work In this paper, significance of Bayesian approach with respect to probabilistic matrix factorization is presented. To approximate the inference in Bayesian-PMF, we have used Markov Chain Monte Carlo method. We demonstrated the Bayesian-PMF along with MCMC can work even better if the datasets are less sparse in nature and can be
128
B. B. Sinha and R. Dhanalakshmi
implemented to large datasets. The proposed model suffers from the problem of sampling overhead whilst training the model but with appropriate regularisation parameters, the model yields much better results as compared to other base approaches such as PMF-MAP estimates. In the future we need to look for an alternative for improvising the Markov chain approach as its a tedious task to decide when the MCMC has converged to the desired distribution.
References 1. M. Mohammadian, Y. Forghani, M.N. Torshiz, An initialization method to improve the training time of matrix factorization algorithm for fast recommendation. Soft Comput. 25(5), 3975– 3987 (2021) 2. A. Pujahari, D.S. Sisodia, Pair-wise preference relation based probabilistic matrix factorization for collaborative filtering in recommender system. Knowl. Based Syst. 196, 105798 (2020) 3. L. Cheng, X. Tong, S. Wang, Y.C. Wu, H.V. Poor, Learning nonnegative factors from tensor data: probabilistic modeling and inference algorithm. IEEE Trans. Signal Process. 68, 1792– 1806 (2020) 4. F. Ortega, R. Lara-Cabrera, Á. González-Prieto, J. Bobadilla, Providing reliability in recommender systems through Bernoulli matrix factorization. Inf. Sci. 553, 110–128 (2021) 5. X. Bui, H. Vu, O. Nguyen, K. Than, MAP estimation with Bernoulli randomness, and its application to text analysis and recommender systems. IEEE Access 8, 127818–127833 (2020) 6. X. Yuan, L. Han, S. Qian, G. Xu, H. Yan, Singular value decomposition based recommendation using imputed data. Knowl. Based Syst. 163, 485–494 (2019) 7. N. Srebro, T. Jaakkola, Weighted low-rank approximations, in Proceedings of the 20th International Conference on Machine Learning (ICML-03) (2003), pp. 720–727 8. V.A. Shanthakumar, C. Barnett, K. Warnick, P.A. Sudyanti, V. Gerbuz, T. Mukherjee, Item based recommendation using matrix-factorization-like embeddings from deep networks, in Proceedings of the 2021 ACM Southeast Conference (2021), pp. 71–78 9. H. Ma, H. Yang, M.R. Lyu, I. King, SoRec: social recommendation using probabilistic matrix factorization, in Proceedings of the 17th ACM Conference on Information and Knowledge Management (2008), pp. 931–940 10. R. Salakhutdinov, A. Mnih, Bayesian probabilistic matrix factorization using Markov chain Monte Carlo, in Proceedings of the 25th International Conference on Machine Learning (2008), pp. 880–887 11. D. Van Ravenzwaaij, P. Cassey, S.D. Brown, A simple introduction to Markov Chain MonteCarlo sampling. Psychon. Bull. Rev. 25(1), 143–154 (2018) 12. F. Zhang, C. Gao, Convergence rates of variational posterior distributions. Ann. Stat. 48(4), 2180–2207 (2020) 13. D.D. Lee, H.S. Seung, Learning the parts of objects by non-negative matrix factorization. Nature 401(6755), 788–791 (1999) 14. A.J. Landgraf, Y. Lee, Generalized principal component analysis: projection of saturated model parameters. Technometrics 62(4), 459–472 (2020) 15. B.B. Sinha, R. Dhanalakshmi, Evolution of recommender paradigm optimization over time. J. King Saud Univ. Comput. Inf. Sci. 34(4), 1047–1059 (2019) 16. S. Kim, H. Park, J. Lee, Word2vec-based latent semantic analysis (W2V-LSA) for topic modeling: a study on blockchain technology trend analysis. Expert Syst. Appl. 152, 113401 (2020) 17. L. Huang, W. Tan, Y. Sun, Collaborative recommendation algorithm based on probabilistic matrix factorization in probabilistic latent semantic analysis. Multimed. Tools Appl. 78(7), 8711–8722 (2019)
Mining User Interest Using Bayesian-PMF and Markov Chain Monte Carlo …
129
18. M. Uto, S. Louvigné, Y. Kato, T. Ishii, Y. Miyazawa, Diverse reports recommendation system based on latent Dirichlet allocation. Behaviormetrika 44(2), 425–444 (2017) 19. A. Hassani, A. Iranmanesh, N. Mansouri, Text mining using nonnegative matrix factorization and latent semantic analysis. Neural Comput. Appl. 1–22 (2021) 20. L. Charlin, R. Ranganath, J. McInerney, D.M. Blei, Dynamic Poisson factorization, in Proceedings of the 9th ACM Conference on Recommender Systems (2015), pp. 155–162 21. H. Knirsch, M. Petz, G. Plonka, Optimal rank-1 Hankel approximation of matrices: frobenius norm and spectral norm and Cadzow’s algorithm, in Linear Algebra and Its Applications (2021) 22. J. Liu, C. Wu, W. Liu, Bayesian probabilistic matrix factorization with social relations and item contents for recommendation. Decis. Support Syst. 55(3), 838–850 (2013) 23. A. Ngueilbaye, H. Wang, D.A. Mahamat, S.B. Junaidu, Modulo 9 model-based learning for missing data imputation. Appl. Soft Comput. 103, 107167 (2021) 24. B.B. Sinha, R. Dhanalakshmi, R. Regmi, TimeFly algorithm: a novel behavior-inspired movie recommendation paradigm. Pattern Anal. Appl. 23(4), 1727–1734 (2020)
Big Data and Its Role in Cybersecurity Faheem Ahmad, Shafiqul Abidin, Imran Qureshi, and Mohammad Ishrat
Abstract Big Data Analytics (BDA) is defined as the process of processing, storing, and acquiring enormous volumes of data for future analysis. Data is being generated at an alarmingly rapid rate. The Internet’s fast expansion, the Internet of Things (IoT), social networking sites, and other technical breakthroughs are the primary sources of big data. It is a critical characteristic in cybersecurity, where the purpose is to safeguard assets. Furthermore, the increasing value of data has elevated big data to the status of a high-value target. In this study, we look at recent cybersecurity research in connection to big data. We discussed how big data is safeguarded and how it could be utilized as a cybersecurity tool. We also discussed cybersecurity in the age of big data as well as trends and challenges in its research. Keywords Big data · Big data analytics · Cybersecurity · Machine learning
1 Introduction For the last two decades, data has expanded exponentially in different applications which have driven the big data period. Big data has shown a few unconventional
F. Ahmad · I. Qureshi Department of Information Technology, University of Technology and Applied Science Al Musanna, Muladdah, Sultanate of Oman e-mail: [email protected] I. Qureshi e-mail: [email protected] S. Abidin Department of Computer Science, Aligarh Muslim University Aligarh, Aligarh, UP, India M. Ishrat (B) Department of Computer Science and Engineering, Koneru Lakshmaiah Education Foundation, Vaddeswaram, AP, India e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 A. Bhattacharya et al. (eds.), Innovations in Data Analytics, Advances in Intelligent Systems and Computing 1442, https://doi.org/10.1007/978-981-99-0550-8_10
131
132
F. Ahmad et al.
Fig. 1 Top industries using big data
highlights that could be utilized in different areas (Fig. 1). One of the most important features is the utilization of big data for identifying threats. “As our mechanical powers increase, the side impacts and potential dangers have also escalated” as cited by Alvin Toffler. Hacking was, to begin with, associated with vandalism. Programmers hacked to enjoy and for reputation. In any case, currently, attacks are more premeditated as well as propelled. The countries blame each other for the cyberattack. It has been observed that surveillance has been increased critically which could be country in an attempt to accumulate data required. Big data has penetrated into every industry and business domain. Furthermore, it can be observed in businesses whether it is health care or retail or state or education, etc. Likewise, high vulnerability and hacking progressed, and cybersecurity was given much attention. Cybersecurity points to minimize the attacks to a minimum. Huge volume of data is easily available these days that makes cybersecurity to go beyond the conventional methods as it was centering on anticipation to a highly advanced model that is called Prevention, Detection, and Response (PDR). It is anticipated that Big Data will play a major role in this advanced model. Due to the development of the Internet, IOT, social networking, etc. gigantic sums of data are being produced at a disturbing rate. Different sources of big data have been shown with volume (Fig. 3). Laney [1] discussed the three Vs in relation to big data which are volume, velocity, and variety. Later on, another V that veracity was added (Fig. 2). Volume speaks to the reality that the information being created is gigantic, velocity speaks to the truth that data is being produced at a disturbing rate, and veracity speaks to the truth that the information is being produced comes in all sorts of forms, value is usability and importance of data while the fifth term veracity is related to reliability of data. It could be clarified essentially at rest, agreeing to Miloslavskaya et al. [2]. The generated data could be structured or unstructured, which is more often than not handled in the right way. Huge information is being produced, and they are changing from terabytes to petabytes, and consequently getting difficult to handle [3]. In this way, it is required to find other ways of obliging the data, and new and advanced models are required to create which can empower to manage this data and get experience (Fig. 3). With the help of this paper, we have tried to explain the research which has been done about the security of big data.
Big Data and Its Role in Cybersecurity
133
Fig. 2 Features of big data
Fig. 3 Sources of big data
In spite of the fact that there are interrelated overview papers on big data security, we display more up-to-date approaches, bits of knowledge, points of view, and later patterns on the quickly progressing investigation of the domain of big data within the cybersecurity space [4–10]. Our methodology discusses inquiry about how big data can be utilized to provide the security to our data and the rise of big data as a very useful resource coming about inquiry about work that has been completed and how to enhance the use of big data. Particularly, the most commitments of this article incorporate the following: • Providing a complete examination of big data security viewpoints by classifying them into two parts: securing big data and it can provide security. • Providing an outline of big data threats and responses and their comparison. • Providing a dialog of difficulties in research, current patterns, bits of knowledge, and unresolved cybersecurity problems or big data in cybersecurity. Remaining portion in this research is structured as follows. We begin by classifying our efforts divided into two main segments. We offer an inclusive consideration
134
F. Ahmad et al.
of security utilizing big data along with securing the big data. One of the segments centers around the utilization of big data as a security instrument while another segment handles how big data is being secured. Another segment highlights major research publications related to this paper, as well as the difference of this work in comparison to remaining studies. Next segment presents a few investigating difficulties and prospects heading in this zone. In the last section, a summary of the article was presented.
2 Security Using Big Data Topmost security firms banded together to exchange information to acquire insight by exchanging the data they have (SecIntel Exchange). It’s targeted to furnish a dependable security tool to the customers. They needed to learn as maximum as possible from the increasing dangers that were generated every day. They recognized the value of working together for the significant benefit. It was required owing to the emergence of different types of malware and associated risks. These companies need a large amount of data which are related to these threats in order to properly grasp the threats they are having, and they should be able to combat these. Conventional techniques to malware classification were increasingly ineffective. SecIntel Exchange data enabled them to generate useful insights from massive amounts of data. Human study and conventional approaches were unable to keep up with the rate at which data was created [11, 12]. There was a need to use modern techniques. “A typical Security Information and Event Management (SIEM) system would take between 20 min and an hour to query a month’s worth of security data” Zions Bancorporation [13]. Though, if tools based on Hadoop technology were used, the equivalent findings could be obtained in around 1 minute. As a result, Big Data Analytic (BDA) has emerged as a critical instrument in cybersecurity. Numerous studies have proven that conventional techniques are incapable of keeping up with large data. BDA is one of the most effective strategies for dealing with these challenges.
2.1 Use of Big Data Analytic (BDA) as Defense Tool Figure 4 depicts an example of an exploit that may be mitigated using big data analytics. “If we recognize the opponent and ourselves, we do not worry about the outcome of a hundred battles,” according to Sun Tzu, the legendary Chinese general. In other words, while we may not be able to learn enough about our opponent, we can certainly learn everything we can do and the assets we guard. To do so, we must first obtain information on the item. This is enabled by the data it creates. This data must be examined, and conclusions must be reached. BDA can assist in the preparation, cleaning, and querying of diverse data with incomplete information that would be difficult for people to achieve [14]. Data analysis is difficult when it is diverse [15].
Big Data and Its Role in Cybersecurity
135
Fig. 4 Use of BDA as a security tool
In their research, they have shown OwlSight, an architecture aimed at delivering actual detection of cyberthreats. The architecture is made up of many parts like data sources, big data analytics, and is capable of collecting enormous volumes of data from different types of sources, analyzing it, and displaying the results. They did have several concerns with the data’s heterogeneity. However, in order for computers to perform properly, they must have some type of human aspect. The authors well said “Identifying a problem is only half the fight” and solve it by developing a methodology that blended big data analytics with semantic methodologies in order to obtain additional understanding upon this diverse data by comprehending [16]. According to Verizon’s 2019 Data Breaches Detection Study, assaults tend to come from a variety of sources. Hacking was employed in 62% of the assaults, malware was used in 51%, and social attacks were used in 43%. Human mistakes accounted for 14% of the total. The most popular types of these assaults are email scams and phishing. According to recent research, nearly half of effective attacks come through email and acquire their targets on clicking in just an hour, and 30% get potential targets to click in 10 min [17, 18]. Another work suggested a big dataenabled system with the goal of protecting from spamming emails by leveraging a honeypot [19]. For analysis, their approach gathers data from many sources like pcap, honeynet, blacklisted sites, and social networks. The framework processed the obtained diverse data, which was kept in the Hadoop Distributed File System, by utilizing Hadoop and Spark (HDFS). Nevertheless, this platform does not support real-time large data analysis. Advanced Persistent Threats (APT) are another type of attack that is sophisticated and well planned [20]. APTs are extremely difficult to detect, and big data analysis may provide a solution to the difficulty of identifying and combating advanced persistent threats. These approaches might play an important role in detecting dangers at an initial stage, particularly during using advanced pattern analysis on heterogeneous data sources. The suggested framework combines deep and three-dimensional protection techniques. The technology classifies data depending on its degree of secrecy to fight against APT attacks. Botnet attacks are another application of big data as well as machine learning techniques. Many methods were researched for reducing botnet attacks through the use of BDA [21]. The paper offered a platform to overcome the existing botnet detection problem [22]. Another use of BDA to combat
136
F. Ahmad et al.
fraud is the financial industry. Big data is being used in the financial sector investigating a unique cybersecurity-based architecture that is called Network Functions Virtualization (NFV) to offer as a security tool in an advanced telecom environment [23]. SHIELD is the name of the framework. This framework makes use of BDA to detect and eliminate risks on a real-time basis. The study offered BDA-based Security Information Management (SIM) upgrades [24]. They created a template to obtain an enhanced SIM that is used in big data and tested it in the field with actual network security records which presented a big data analytics methodology for cloud computing [25]. The HDFS was utilized to gather and store the logs from applications and networks from a guest virtual machine. SIEM is a useful tool for cybersecurity analytics and a valuable data source. The technology created examines large data (from a Fortune 500 companies SIEM) to gather understandings related to security vulnerabilities [26]. BDA has been utilized as a cybersecurity strategy to minimize assaults since they are diversified and come in a variety of ways. Intrusion detection and prevention systems (IDS/IPS) study is one area of cybersecurity where big data is heavily exploited. IDSs are being used to assess if there is any type of security compromise [27]. After authentication, firewall, encryption, and authorization mechanisms, an intrusion detection system (IDS) is frequently considered as the second security solution.
2.2 Cybersecurity and Machine Learning BDA and ML algorithms complement one another. To provide security by generating valuable intelligence, ML algorithms must learn from data. There are typically three types of algorithms used in ML. These are supervised, unsupervised, and semi-supervised learning. Depending upon the type of datasets, machine learning algorithms are divided into two: supervised learning and unsupervised learning. On data when the result of each training set is unknown, unsupervised learning algorithms are utilized. Malware detection is a classic example. To do this, we extract characteristics from the malware dataset and look for malware groupings or commonalities. The model finds its own groups by utilizing the dataset’s properties. Clustering methods and principal component analysis are commonly employed for unsupervised learning malware analysis (PCA). Some supervised learning approaches include linear regression, support vector machines (SVM), random forests, and neural networks.
3 Cybersecurity Approaches in Big Data Encryption is typically used to protect the secrecy of information. Data is accessed by either trusted or untrustworthy people. Encryption guarantees that only authenticated
Big Data and Its Role in Cybersecurity
137
users have access to data. Access control restricts the access to data. Encryption solutions must be more powerful than access control mechanisms. Encryption imposes a high level of constraints on data secrecy. Encryption, on the other hand, is a difficult process. Access control is more adaptable and easier to deploy. A security risk occurs when big data is transferred to the cloud. Nobody wants his data to be in the hands of others, which is why encryption is required. The use of data masking technologies is a common practice. The data is not encrypted while it is transferred since the methods used to send the data required decryption of data, which may expose them to threats. Because the most significant threat to big data is a breach of confidentiality. The Spark framework was utilized. Attribute-based encryption is another type of encryption that is widely utilized (ABE). In another study that used ABE encryption, Yang et al. [28] talked about some of the difficulties of ABE encryption that is used for cloud storages. They suggested a form of ABE that is a unique distributed, scalable, and perfectly alright access control technique based on the cloud storage object’s categorization features. The objective was to improve ABE’s weaknesses by considering the links between the qualities. Another type of large data encryption approach presented is an encrypted MongoDB that employs a homomorphic asymmetric cryptosystem that may be utilized for user data encryption and privacy protection [29] (Table 1). As a result, the FPE and ABE are the most well-studied big data encryption algorithms which provided a paradigm to encrypt both symmetrical as well as asymmetrical data, with the goal of overcoming the limitations of asymmetric encryption techniques and the limiting quantity of data, which rendered it useless for large data applications [30]. Their projected solution, BigCrypt, employs a probabilistic method to Pretty Good Privacy Technique (PGP). BigCrypt uses a symmetric key to encrypt the message, which is subsequently encrypted with a public receiver key and appended to the text. The text is then delivered. The symmetric key is retrieved at the receiver end, subsequently asymmetrically decrypted and utilized to decode the main message (Table 2). Sharma and Sharma suggested the use of neural and quantum cryptography to safeguard massive data [31]. Furthermore, it introduced a novel encryption strategy for huge data that employs double hashing rather than a single hash [32]. They believe that double hashing disregards the danger of existing cryptanalysis attacks (Fig. 5). The research offered a framework that was hybrid for access control and privacy of large data that makes and implements and enforces privacy rules to encapsulate privacy needs during an access control [45]. Gao et al. [46] introduced a big data-based tool that can work on cloud security. Cloud computing has been shown to increase the volume of data. As a result, significant data breaches and losses occurred. As a result, there had been a requirement to offer the requisite level of protection. To that purpose, they did a big data study, analyzing the existing big data environment. Gupta et al. [47] developed a large data security compliance methodology. The approach offers secured access in big data architecture.
138
F. Ahmad et al.
Table 1 Big data security tool Data sources
Tools and technologies
A big data architecture Spam and phishing to secure data [24] protection
Pcap, log files
Hadoop, spark
Methods for detecting malicious executables via data mining [33]
Malicious malware detection
Executable es
Algorithms for data mining and machine learning
A viable strategy for enhancing cybersecurity [34]
Tool for surveillance
Network dataset
High-functioning autistic grads, data mining techniques
Identifying and predicting security flaws in BDA implementation [26]
Enhance SIEM by including key features
Typical SIEM application
Data mining, graph analysis
Protection of big data network information [35]
Advanced persistent threat
Data set from a network
BDAs, network event gathering methods, and big data correlation analysis are all examples of BDAs
Machine learning and graph analytics for big data [36]
For the sake of efficiency, batch and stream data operations are being combined
Diverse big data
Lambda architecture
Owl sight: a platform for real-time cyberthreat identification and visualization [15]
Threat detection and Diverse network visualization in real time data
Web services, big data analytics, and data visualization
Classification model for machine learning [37]
Network intrusion detection for smart devices
Data from smart devices
ML algorithms
Anomaly detection and IDS network security [43]
Network flow
Spark, ML algorithms
Big data-driven security analytics used in cloud computing [35]
The BDA paradigm is used to safeguard in cloud computing
Network and application logs
ML algorithms
An approach to big data privacy assessment based on the scientific intelligence methods [38]
Determine the desktop’s Log files of security state as well as firewall the origins and reasons of security breaches
Method
Objective
Techniques of computational intelligence
Big Data and Its Role in Cybersecurity
139
Table 2 Investigation of big data access control and encryption techniques Method/Reference
Problem
Solution
Masked data computing: an advanced strategy to enhancing big data authenticity [39]
Data is not encrypted while being sent
Enhancing FHE by lowering the amount of the public key
In big data, a quicker fully homomorphic encryption scheme [40]
Data is not encrypted while being sent
Improving FHE by lowering the amount of the public key
A digital envelope solution for Enhancing big data safe data sharing in IoT settings security [41]
Improving the security by utilization of flexibility of attribute-based encryption with symmetric cryptography
CryptMDB: an encrypted tool for big data [42]
Encrypting user data and ensuring privacy protection
MongoDB is encrypted with an asymmetric cryptosystem
BigCrypt is a big data encryption tool [30]
Overcome asymmetric encryption systems’ limitations
BigCrypt is a probabilistic technique in pretty good privacy technique
A new group key transmission Big Data group mechanism for big data security communication that is [43] secure
Diffie–Hellman key agreement is used in an efficient group key transmission protocol
Encryption using a Cryptanalysis attacks double-hashing operation mode [32]
Use of double hashing
Big data security policy [44]
Fig. 5 Security approaches in big data
Big data requires a privacy Material is analyzed, privacy policy laws are extracted, sensitive data is detected, and a fragmentation technique is applied to sensitive data
140
F. Ahmad et al.
4 Research Trends and Challenges The cybersecurity environment has altered since the discovery of a virus that is called the “creeper,” while the very first anti-virus designed for it is called the “Reaper.” The largest internal attack in history took place/occurred from 1976 to 2006, when a retired employee of Boeing stole the intellectual property and passed to China. The Edward Snowden case that involved the revelation of secret NSA documents, causing widespread mistrust of the government, was another very famous internal attack. Following that, another big cybersecurity breach occurred when Yahoo failed to inform them that more than 2 billion accounts had been compromised. In the year 2019, Facebook announced that more than 5 billion accounts were compromised due to bugs in their social networking site. After the year 2015, the assault environment evolved from data theft to the retention of data for ransom. So, a huge number of ransomwares were released. The security environment is evolving and research trends must adapt to counteract new cybersecurity threats [48]. Notable trends, new challenges, and difficulties discussed (Fig. 6).
4.1 Use of Big Data in Defense The data collected by IoT devices, smart phones, Internet, social media, and cloud has rendered organizations vulnerable to a variety of attack vectors. Data is generated by all of these devices. As a result, enterprises are beginning to adopt BDA as part of their security strategy. Observation of the network data is critical for organizational security. However, because big data analytics is a costly endeavor, several businesses are still hesitant to implement it. BDA is likewise a complicated area that needs specialized knowledge. Employees are also uneasy about the collection of personal information since it may include tracking user behavior. There are outstanding difficulties about how to distinguish between IoT system data and other critical data as well as the security by adopting BDA.
Fig. 6 Research challenges and problems
Big Data and Its Role in Cybersecurity
141
4.2 Laws Regulating to Big Data Due to the consequences of the huge amount of data compromised which happened recently, new legislation in many countries was passed to protect the data. The right to be forgotten is an important feature of the regulations enacted since it allows an individual to compel the authority for the erasure of any information belonging to him/her. Self-destructing data is a research trend that we anticipate. There are still issues with big data laws and policy, such as when data leaves an enterprise to be stored in the cloud.
4.3 Distribution of Data for Storage Right now, data is the most precious commodity. The big companies like Apple, Facebook, and Amazon have monopolized data and hence generate the greatest income. New blockchain businesses are now focusing on how to break these monopolies by emphasizing the importance of data to the general population. Distributed data storage attempts to remove data out of these, hence removing numerous security threats. Furthermore, the combination of blockchain as well as big data will ensure that the data created by the blockchain is reliable. Lots of research and discussion are going on to distribute the storage of data.
4.4 Security Technique Scalability in the Big Data It is difficult to safeguard anything with massive data. The simpler method is to identify and defend what is important. Traditional techniques to data security may not function in a straightforward manner. As a result, finding an appropriate scalable solution for big data-driven applications remains an ongoing research area.
5 Conclusions In this work, we reviewed the most recent research on big data and its application in cybersecurity. We divided the assignment into two sections. The initial stage of the project was to study this application of big data for cybersecurity objectives. In Sect. 2, the different aspects of the security of big data were discussed. We have also discussed recent trends in the usage of BDA as a security tool. In addition, we discussed the relevance of machine learning within that context. Furthermore, we examined recent research on large data security strategies. Because large data
142
F. Ahmad et al.
confidentiality is frequently the primary focus, encryption technique is the primary study topic while considering big data security. Moreover, it was also anticipated that there will be lots of research and development on the security of big data as well as its use as a security tool in the coming years.
References 1. D. Laney, 3d data management: controlling data volume, velocity and variety. META Group Res. Note 6(70), 1 (2001) 2. N. Miloslavskaya, A. Tolstoy, Application of big data, fast data, and data lake concepts to information security issues, in 2016 IEEE 4th International Conference on Future Internet of Things and Cloud Workshops (FiCloudW) (IEEE, 2016), pp. 148–153 3. D. Rawat, K.Z. Ghafoor, Smart Cities Cybersecurity and Privacy (Elsevier, Amsterdam, The Netherlands, 2018) 4. E. Bertino, Big data-security and privacy, in 2015 Proceedings on IEEE International Congress on Big Data (IEEE, 2015), pp. 757–761 5. S. Abidin, V.R. Vadi, V. Tiwari, Big data analysis using R and hadoop, in Springer 2nd International Conference on Emerging Technologies in Data Mining and Information Security (IEMIS 2020), Kolkata, 2–4 July 2020 (Publication in Advances in Intelligent System and Computing, Springer AISC, ISSN: 2194-5357), pp. 50–53 6. S. Abidin, V.R. Vadi, A. Rana, On confidentiality, integrity, authenticity and freshness (CIAF) in WSN, in 4th Springer International Conference on Computer, Communication and Computational Sciences (IC4S 2019), Bangkok, Thailand, 11–12 October 2019 (Publication in Advances in Intelligent Systems and Computing, ISSN: 2194-5357), pp. 952–957 7. T. Mahmood, U. Afzal, Security analytics: big data analytics for cybersecurity: a review of trends, techniques and tools, in 2013 2nd National Conference on Information Assurance (NCIA) (IEEE, 2013), pp. 129–134 8. S. Rao, S. Suma, M. Sunitha, Security solutions for big data analytics in healthcare, in 2015 2nd International Conference on Advances in Computing and Communication Engineering (IEEE, 2015), pp. 510–514 9. I. Olaronke, O. Oluwaseun, Big data in healthcare: prospects, challenges and resolutions, in 2016 Future Technologies Conference (FTC) (IEEE, 2016), pp. 1152–1157 10. H.-T. Cui, Research on the model of big data serve security in cloud environment, in 2016 First IEEE International Conference on Computer Communication and the Internet (ICCCI) (IEEE, 2016), pp. 514–517 11. E. Damiani, Toward big data risk analysis, in 2015 IEEE International Conference on Big Data (Big Data) (IEEE, 2015), pp. 1905–1909 12. Sinclair, L. Pierce, S. Matzner, An application of machine learning to network intrusion detection, in Proceedings 15th Annual Computer Security Applications Conference (ACSAC’99) (IEEE, 1999), pp. 371–377 13. E. Chickowski, A Case Study in Security Big Data Analysis, vol. 9. (Dark Reading, 2012). https://www.darkreading.com/analytics/security-monitoring/a-case-study-in-securitybig-data-analysis/d/d-id/1137299 14. M.C. Raja, M.A. Rabbani, Big data analytics security issues in data driven information system. IJIRCCE 2(10), 6132–6134 (2014) 15. V.S. Carvalho, M.J. Polidoro, J.P. Magalhaes, Owlsight: platform for real-time detection and visualization of cyber threats, in 2016 IEEE 2nd International Conference on Big Data Security on Cloud (BigDataSecurity), IEEE International Conference on High Performance and Smart Computing (HPSC), and IEEE International Conference on Intelligent Data and Security (IDS) (IEEE, 2016), pp. 61–66
Big Data and Its Role in Cybersecurity
143
16. Y. Yao, L. Zhang, J. Yi, Y. Peng, W. Hu, L. Shi, A framework for big data security analysis and the semantic technology, in 2016 6th International Conference on IT Convergence and Security (ICITCS) (IEEE, 2016), pp. 1–4 17. S. Abidin, Encryption and database security. Int. J. Comput. Eng. Appl. 11(8), 116–121 (2017). ISSN: 2321-3469 18. T. Zaki, M.S. Uddin, M.M. Hasan, M.N. Islam, Security threats for big data: a study on enron e-mail dataset, in 2017 International Conference on Research and Innovation in Information Systems (ICRIIS) (IEEE, 2017), pp. 1–6 19. P.H. Las-Casas, V.S. Dias, W. Meira, D. Guedes, A big data architecture for security data and its application to phishing characterization, in 2016 IEEE 2nd International Conference on Big Data Security on Cloud (BigDataSecurity), IEEE International Conference on High Performance and Smart Computing (HPSC), and IEEE International Conference on Intelligent Data And Security (IDS) (IEEE, 2016), pp. 36–41 20. A.A. Cardenas, P.K. Manadhata, S. Rajan, Big data analytics for security intelligence. Technical Report by (Big Data Working Group of CloudSecurity Alliance, 2013), pp. 1–22. https://downloads.cloudsecurityalliance.org/initiatives/bdwg/Big_Data_Analytics_ for_Security_Intelligence.pdf 21. B.G.-N. Crespo, A. Garwood, Fighting botnets with cyber- security analytics: dealing with heterogeneous cyber-security information in new generation siems, in 2014 9th International Conference on Availability, Reliability and Security (IEEE, 2014), pp. 192–198 22. D.C. Le, A.N. Zincir-Heywood, M.I. Heywood, Data analytics on network traffic flows for botnet behaviour detection, in 2016 IEEE symposium series on computational intelligence (SSCI) (IEEE, 2016), pp. 1–7 23. G. Gardikis, K. Tzoulas, K. Tripolitis, A. Bartzas, S. Costicoglou, A. Lioy, B. Gaston, C. Fernandez, C. Davila, A. Litke, et al., SHIELD: a novel NFV-based cybersecurity framework, in 2017 IEEE Conference on Network Softwarization (NetSoft) (IEEE, 2017), pp. 1–6 24. F. Gottwalt, A.P. Karduck, SIM in light of big data, in 2015 11th International Conference on Innovations in Information Technology (IIT) (IEEE, 2015), pp. 326–331 25. T.Y. Win, H. Tianfield, Q. Mair, Big data-based security analytics for protecting virtualized infrastructures in cloud computing. IEEE Trans. Big Data 4(1), 11–25 (2017). (March 2018) 26. C. Puri, C. Dukatz, Analyzing and predicting security event anomalies: Lessons learned from a large enterprise big data streaming analytics deployment, in 2015 26th International Workshop on Database and Expert Systems Applications (DEXA) (IEEE, 2015), pp. 152–158 27. S. Mukkamala, A. Sung, A. Abraham, Cyber security challenges: designing efficient intrusion detection systems and antivirus tools, in Enhancing Computer Security with Smart Technology, ed. by V. Rao (CRC Press, USA, 2005, ISBN 0849330459), pp.125–161 28. T. Yang, P. Shen, X. Tian, C. Chen, A fine-grained access control scheme for big data based on classification attributes, in 2017 IEEE 37th International Conference on Distributed Computing Systems Workshops (ICDCSW) (IEEE, 2017), pp. 238–245 29. S. Pérez, J.L. Hernández-Ramos, D. Pedone, D. Rotondi, L. Straniero, A.F. Skarmeta, A digital envelope approach using attribute-based encryption for secure data exchange in IoT scenarios, in 2017 Global Internet of Things Summit (GIoTS) (IEEE, 2017), pp. 1–6 30. A. Al Mamun, K. Salah, S. Al-Maadeed, T.R. Sheltami, BigCrypt for big data encryption, in 2017 4th International Conference on Software Defined Systems (SDS) (IEEE, 2017), pp. 93–99. 31. A. Sharma, D. Sharma, Big data protection via neural and quantum cryptography, in 2016 3rd International Conference on Computing for Sustainable Global Development (INDIACom) (IEEE, 2016), pp. 3701–3704 32. S. Almuhammadi, A. Amro, Double-hashing operation mode for encryption, in 2017 IEEE 7th Annual Computing and Communication Workshop and Conference (CCWC) (IEEE, 2017), pp. 1–7 33. M.G. Schultz, E. Eskin, F. Zadok, S.J. Stolfo, Data mining methods for detection of new malicious executables, in Proceedings 2001 IEEE Symposium on Security and Privacy (IEEE, 2001), pp. 38–49
144
F. Ahmad et al.
34. V. Patel, A practical solution to improve cyber security on a global scale, in 2012 3rd Worldwide Cybersecurity Summit (WCS) (IEEE, 2012), pp. 1–5 35. W. Jia, Study on network information security based on big data, in 2017 9th International Conference on Measuring Technology and Mechatronics Automation (ICMTMA) (IEEE, 2017), pp. 408–409 36. H.H. Huang, H. Liu, Big data machine learning and graph analytics: current state and future challenges, in 2014 IEEE International Conference on Big Data (Big Data) (IEEE, 2014), pp. 16–17 37. S. Kumar, A. Viinikainen, T. Hamalainen, Machine learning classification model for networkbased intrusion detection system, in 2016 11th International Conference for Internet Technology and Secured Transactions (ICITST) (IEEE, 2016), pp. 242–249 38. N. Naik, P. Jenkins, N. Savage, V. Katos, Big data security analysis approach using computational intelligence techniques in R for desktop users, in 2016 IEEE Symposium Series on Computational Intelligence (SSCI) (IEEE, 2016), pp. 1–8 39. J. Kepner, V. Gadepally, P. Michaleas, N. Schear, M. Varia, A. Yerukhimovich, R.K. Cunningham, Computing on masked data: a high-performance method for improving big data veracity, in 2014 IEEE High Performance Extreme Computing Conference (HPEC) (IEEE, 2014), pp. 1–6 40. D. Wang, B. Guo, Y. Shen, S.-J. Cheng, Y.-H. Lin, A faster fully homomorphic encryption scheme in big data, in 2017 IEEE 2nd International Conference on Big Data Analysis (ICBDA) (IEEE, 2017), pp. 345–349 41. S. Perez, J.L. Hernandez-Ramos, D. Pedone, D. Rotondi, L. Straniero, A.F. Skarmeta, A digital envelope approach using attribute-based encryption for secure data exchange in IoT scenarios, in 2017 Global Internet of Things Summit (GIoTS) (IEEE, 2017), pp. 1–6 42. G. Xu, Y. Ren, H. Li, D. Liu, Y. Dai, K. Yang, Cryptmdb: a practical encrypted mongodb over big data, in 2017 IEEE International Conference on Communications (ICC) (IEEE, 2017), pp. 1–6 43. C. Zhao, J. Liu, Novel group key transfer protocol for big data security, in 2015 IEEE Advanced Information Technology, Electronic and Automation Control Conference (IAEAC) (IEEE, 2015), pp. 161–165 44. A. Al-Shomrani, F. Fathy, K. Jambi, Policy enforcement for big data security, in 2017 2nd International Conference on Anti-Cyber Crimes (ICACC) (IEEE, 2017), pp. 70–74 45. A. Samuel, M.I. Sarfraz, H. Haseeb, S. Basalamah, A. Ghafoor, A framework for composition and enforcement of privacy-aware and context-driven authorization mechanism for multimedia big data. IEEE Trans. Multimed. 17(9), 1484–1494 (2015) 46. F. Gao, Research on cloud security control mechanism based on big data, in 2017 International Conference on Smart Grid and Electrical Automation (ICSGEA) (IEEE, 2017), pp. 366–370 47. A. Gupta, A. Verma, P. Kalra, L. Kumar, Big data: a security compliance model, in 2014 Conference on IT in Business, Industry and Government (CSIBIG) (IEEE, 2014), pp. 1–5 48. E. Damiani, C. Ardagna, F. Zavatarelli, E. Rekleitis, L. Marinos, Big Data Threat Landscape, (European Union Agency For Network And Information Security, 2017). https://www.enisa. europa.eu/publications/bigdata-threat-landscape. (Online)
QR Code-Based Digital Payment System Using Visual Cryptography Surajit Goon, Debdutta Pal, Souvik Dihidar, and Subham Roy
Abstract Quick response code (QR code) is employed due to its benefits, especially in the area of mobile payments. In the transaction process, however, there is an unavoidable risk. It is challenging to identify the attacker’s tampering with the QR code carrying the recipient account for the merchant. As a result, verifying QR codes is critical. In this study, we propose a methodology to secure the payment method. This method encrypts the payment process using the visual cryptography scheme (VCS) and AES secret key techniques. The first step is to use the (2,2) VCS to divide the original QR code into two shares; secondly, we are applying the AES algorithm in one share and adding a secret key to this share. Then both shares are distributed. First one sent to the user using a cloud server and second one directly to the applicant desk. The two shares can then be properly stacked, and the same QR code is recreated using the specified VCS. Keywords Visual cryptography · AES algorithm · Secret key · QR code · Encryption and decryption
S. Goon (B) · D. Pal Department of Computer Science and Engineering, Brainware University, Barasat, West Bengal, India e-mail: [email protected] D. Pal e-mail: [email protected] S. Dihidar · S. Roy Department of Computer Science, Eminent College of Management and Technology, Barasat, West Bengal, India e-mail: [email protected] S. Roy e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 A. Bhattacharya et al. (eds.), Innovations in Data Analytics, Advances in Intelligent Systems and Computing 1442, https://doi.org/10.1007/978-981-99-0550-8_11
145
146
S. Goon et al.
1 Introduction Users’ daily life and work are becoming more comfortable because to the rapid expansion of global wireless networks and the rising popularity of mobile devices like cell phones, tablet computers, and portable PCs. In several emerging economies, mobile payment as a rapid payment method is common. Furthermore, QR codes, a novel technique for storing, transmitting, and recognizing information, may be deciphered by a mobile phone from any location, making them popular in securitysensitive applications like payments. QR code payment has now become one of the most popular methods of mobile payment. However, using this method of payment has certain security concerns. One business, for example, has a QR code on the wall that reflects the merchant’s beneficiary account. QR codes do not include an anticounterfeiting feature. Because of this, a single fraudster can alter a QR code’s secret data without being noticed by using his own bank account. As a result, with each successive transaction, the money is transferred to the attacker’s account. The suggested method in this work is applied in a real scenario as shown to further reduce the financial loss to merchants brought on by the aforementioned behavior and to boost the security of the general consumer’s mobile payment authentication procedure. The three essential entities to take into account are the store, cloud server, and mobile device. Naor and Shamir [1] first suggested a model for the visual cryptography scheme (VCS), in which a secret image creates two or more binary transparent share images, but we can’t obtain the original secret image until we stack a specific number of those shares one after the other. Though simple, it was quite tough to break. Most crucially, just the human visual system is needed to reveal the secret; no computing power is needed at the decryption end. The fundamental principle of k-out-of-n VCS is to create n shadow pictures from a concealed code, each of which must have at least k shadows in order for the secret to be revealed during decryption. In this visual cryptography method, AES and modified (k, n) [2] share generation methods are employed. To give an extra security measure to the input secret image, AES is used at encryption level and VCS produces the shares. In this paper, VCS separates the original secret from the QR code into shares 1 and 2. Then we use the AES algorithm to add a secret key to share 1 and store it in the cloud server, while share 2 is pasted to the shop’s wall in a traditional manner. The customer scans share 2 after taking a picture of it in order to get share 1 from the distributed cloud server. The cell phone then replicates VCS’s vision characteristics to stack share 1 and share 2. Now that the original QR code has been recovered, it can be scanned. In Sect. 2, we conducted a literature review of the present systems. After that, in Sect. 3, we presented our model. We described the implementation of our concept in Sect. 4. Here, the original QR code was divided into two pieces using visual cryptographic multiple rules, and one of the shares is being encrypted using AES. The results of our experiments are presented in Sect. 5, and these values are also validated using PSNR, MSE, CC, and SSIM. Finally, Sect. 6 concludes with ideas for future growth.
QR Code-Based Digital Payment System Using Visual Cryptography
147
2 Literature Review Visual data is encrypted using VCS, which is a cryptography technique [3], such that only the human visual system can decrypt it. The most fundamental method is the (k, n) VCS, sometimes known as the k-out-of-n VCS [4]. It takes k or more of the n images created by the VCS encryption to reassemble the original image. Using the (2,2) VCS, each pixel P of the original picture is split into two additional pixels known as shares. The shares of a white-and-black pixel are shown in Fig. 1. Ordered dithering is a technique for swiftly and concurrently transforming a gray-level image into an identical-sized binary image [5]. After being transformed from a grayscale to a halftone image Zhou et al. [6] used binary VC schemes to construct grayscale shares. Extended visual cryptography (EVC), a technique created by Ateniese et al. [7], allows shares to contain significant visuals in addition to private data. Quantization errors are randomized using a form of noise called halftoning. Blue noise is another name for it. The Floyd–Steinberg diffusion technique is where the concept of tracking errors first appeared [8]. A secure image encryption scheme is proposed by Kalubandi et al. [9] that secures the image by fusing visual cryptographic and AES methods. The image is AESencrypted, and a visual secret sharing-based encoding scheme has been developed to turn the key into shares. Lu et al. [10] and colleagues suggested a visual cryptography scheme (VCS) and attractive QR code-based method with three core schemes for varying concealing degrees. The original QR code is reinstated in accordance with the specified VCS, and the embedded result is replaced with the same carrier QR codes. A novel visual cryptography approach based on AES and a modified (K, N) share generation algorithm were proposed by Sapna and Sudha [2]. To give strong security to the input secret picture, AES is utilized to encrypt it. Fu et al. [11] propose a (k, n)-VCS combined with QR codes. A probabilistic sharing model is used to increase the maximum size of the secret picture that may be shared. It is used to illustrate a secret sharing mechanism with a large relative difference. We also use encoding redundancy to insert the first shares onto the cover QR codes. Fig. 1 Portions of a white-and-black pixel
148
S. Goon et al.
Yuqiao et al. [12] propose a new QR code with two levels of information storage for protecting private communications. Meanwhile, any ordinary QR reader can instantly decode the public level. In contrast to previous research, the suggested scheme’s computational cost is decreased by merging it with the theory of visual cryptography (VCS). In their article, Goon et al. [13] made the suggestion that minimizing computation, noise, or distortion in the recreated image aids in the provision of updated contracts. They claim that the proposed method establishes a contract for the decrypted image using particular pixel patterns. The white pixel pattern or fresh pixel patterns may be used as the contract for the recursive visual cryptography system. The privacy and security of the image are ensured using elliptic curve cryptography. According to a suggestion given by Xiaoyang et al. [14], the security performance of data stored in two-dimensional code (quick response code), a two-dimensional code encryption and decryption method based on elementary cellular automata state rings, was enhanced. Simple dynamical systems can be used by cellular automata to simulate complicated phenomena. Additionally, there are other parallels between cryptography and cellular automata, including diffusivity and integrated chaos. Based on this characteristic, the approach encrypts and decrypts QR code binary pictures using cellular automata with the following parameters: length is 8, boundary condition is cyclic, and state space is {0,1}. A secure QR code schema based on visual cryptography was suggested by Xiaohe et al. [15]. Two independent sharing pictures that may be sent separately make up the QR code. The pixels in the two share pictures are decided by the corresponding values in the pseudo-random matrix since the production of the two share images is dependent of the pseudo-random matrix. To simply restore the information, stack the two shared pictures. The results of the simulation demonstrate that it is possible to successfully hide and restore the QR code picture.
3 Methodology In our method, first (Fig. 2), using Visual Cryptographic multiple rules, it divides a single original QR code into two parts. After that, we apply the AES encryption to one share and add a secret key to it. Then store in the cloud server and another share is pasted to the shop’s wall in a traditional manner. Customers download the first share in the cloud server and apply the AES description using the same secret key. After completing these steps, customers get the share 1 and scan another share from the shop’s wall. Then apply the visual cryptography, and finally the two shares to be perfectly stacked and the original QR code to be recovered. Then customers scan the QR code and pay. This is the more secure process for digital payment methods.
QR Code-Based Digital Payment System Using Visual Cryptography
149
Fig. 2 Proposed method
4 Implementation The two phases of our suggested model’s implementation: 1. QR code encryption and 2. decryption done by the user.
4.1 Encryption Step 1: Create a QR code. Step 2: Generate a halftone image out of it using Floyd–Steinberg technology. Step 3: Create shares with the 2-out-of-2 VCS. Step 4: Share 1 is encrypted using AES algorithm using a secret key. Step 5: Store Share 1 in the cloud. Step 6: Share 2 is pasted to the shop’s wall in traditional manner. An image that uses discrete dots rather than continuous tones is referred to as a halftone or halftone image. The dots blend together when seen from a distance, giving the appearance of continuous lines and forms. Less ink is used to print a picture when it is halftoned (changed from a bitmap to a halftone). Therefore, halftoning is often used in newspapers and magazines to print pages more effectively. Basically, halftoning was done mechanically by printing through a screen with a grid of holes on printers. Ink was forced through the screen’s perforations during printing, leaving dots on the paper. One method for reducing errors is the Floyd–Steinberg dithering algorithm (Figs. 3 and 4). It is intended to provide simple threshold dithering to every pixel while
150
S. Goon et al.
Fig. 3 Floyd–Steinberg error-diffusion matrix
Fig. 4 Floyd–Steinberg error diffusion
appropriately accounting for the brightness mistakes it generates. As a basic inspiring explanation, think about a 50% grayscale image (an image with every pixel exactly halfway between black and white in brightness). The final dithered image should be half black and half white, ideally with a pixel-level black-and-white checker pattern. Using solely black-and-white pixel intensities, this most closely replicates the 50% gray aspect. Because each pixel in a 50% grayscale image begins with the same intensity, the outcome of a thresholding comparison must be the same for each pixel. The approach uses error diffusion to achieve dithering. It distributes the quantization error 7 ··· ∗ 16 3 5 1 ··· · · · 16 16 16 The pixel being scanned at any instance is the one with a star (∗) and the yet to be scanned pixels are those that are blank. This process quantizes each pixel as it scans the image from left to write and top to bottom. We use a symmetric encryption method called AES (Fig. 5). The same encryption key is used for both encryption and decryption. Prior to receiving material, the recipient must have access to an encryption key, which can be obtained from a trusted and secure source. Plaintext is found in unencrypted data that is accessible to everyone. While ciphertext is private material that has been encrypted, AES-128, AES-192, and AES-256 are the three block ciphers that make up the AES algorithm. The 128-bit data block is encrypted using these block ciphers. The AES-128 block cipher uses the 128-bit encryption key, which has 10 rounds. The 192-bit encryption key used
QR Code-Based Digital Payment System Using Visual Cryptography
151
Fig. 5 AES algorithm
by the AES-192 block cipher has 12 rounds. The 256-bit encryption key used by the AES-256 block cipher has 14 rounds.
4.2 AES Algorithm Step1: From the cipher key, create the set of round keys. Step2: Set the block data as the state array’s initial value (plaintext). Step 3: The beginning state array should now include the initial round key. Step 4: Carry out nine rounds of state modification. Step 5: Execute the tenth and last state manipulation. Step 6: The encrypted data is copied out as the final state array (cipher text). In our research, we use the AES algorithm because it is the most reliable security protocol, is available in both hardware and software, and is one of the most wellknown and widely used open-source solutions worldwide. About 2^128 tries are required to overcome a 128-bit encryption. This makes it extremely hard to attack, making it a very safe protocol. It employs longer key sizes for encryption, including 128, 156, and 256 bits.
152
S. Goon et al.
4.3 Decryption Step 1: Encrypted Share 1 is downloaded from cloud and decrypted using AES algorithm using the same secret key. Step 2: Scan Share 2 manually using mobile or Tab. Step 3: Overlapping decrypted Share 1 and Share 2. Step 4: Applying a filter to the output. Step 5: Resize the filtered output and get the original QR code. When we use the visual cryptography method to encrypt a secret image, each secret pixel is multiplied by two subpixels in a shadow image. Probably we should make these subpixels rectangles so that the blocks may be arranged closely together. However, if the aspect ratio is regarded as essential information about the hidden image, the distortion occurs when a picture is not square. In order to solve the subpixel placement, an aspect ratio invariant VCS was suggested. It was recommended to use an aspect ratio invariant VCS to resolve the subpixel positioning issue. So, we use image filtering which is shown in Fig. 8g and after filtering we resize our final output which is shown in Fig. 8h. In our project, we use a morphological filter. Basically, morphological filters are processes of shrinkage and expansion. The term “shrink” refers to the process of rounding off huge structures and removing tiny ones using a median filter before growing back the remaining structures by the same amount. Each matrix element in a morphological filter is referred to as a “structuring element”. Through the use of image filtering, the morphological operators dilate, erode, open, and close can be used to expand or contract picture areas as well as remove or add pixels to image region boundaries. In our project, we used two types of operators: one is open and another one is close. Opening is the important morphological operations of erosion and dilation. Opening generally tends to remove part of the foreground (bright) pixels from the margins of areas of foreground pixels, similar to how erosion tends to do. A simple definition of closing is a dilation followed by an erosion utilizing the same structural element for both operations. Closing is essentially executed in reverse. Figure 6 depicts the GUI of the AES encryption. Here we can select share 1 as select cover image option and then press the encryption button. Then share 1 is encrypted. After completing this step, we can save the encrypted share 1 image using the save encryption image option. In Fig. 7, we can see all the outcomes of our project. Here we can see the original QR code split into two shares and share 1 is encrypted or decrypted using AES algorithm. At last, two shares are overlapping and the original QR code will be restored. In Fig. 8, we can see all the outcomes of our project. Here we can see the original QR code split into two shares and share 1 is encrypted or decrypted using AES algorithm. At last, two shares are overlapping and the original QR code will be restored.
QR Code-Based Digital Payment System Using Visual Cryptography
153
Fig. 6 Share 1 encryption using AES algorithm
5 Result Using the parameters of MSE, PSNR, CC, and SSIM, we compared the decrypted images generated by classic (2,2) VCS with the decrypted images produced by our method. The peak signal-to-noise ratio (PSNR) is used to determine the relationship between two pictures and is represented by PSNR = 10 ∗ log 2552 /MSE
(1)
The mean squared error (MSE), which is the average of the squared difference between the original and replicated data, is calculated with respect to zero (the smaller the number, the better). 2 I(i, j) − I ∼ (i, j) MSE = 1 (M ∗ N)
(2)
154
S. Goon et al.
Fig. 7 Share 1 decryption using AES algorithm
A measurement of the relationship between two or more variables is correlation. When there are two variables, the original image and the decrypted image, they are closely associated and the correlation coefficient (CC) is almost 1 if they are almost identical. The quality of the original and rebuilt images is evaluated using the structural similarity index (SSIM). We compared both decrypted pictures in our suggested model and found MSE values of less than 0.05 (Fig. 10), PSNR values of 61–63 dB, CC values of 0.91–0.93 (Fig. 11), and SSIM values of 0.71–0.76 (Fig. 12). This outcome demonstrates that even after the attack, our final results are better than those of the other VCSs. These data imply that our final outcomes are good even after they are good in contrast to the other VCS. Table 1 presents the results. Figures 6, 7, 8, and 9 show that there is a lot of information loss and that the quality of the reconstructed pictures is good.
6 Conclusion With the help of visual cryptography and AES-based encryption, we have put into place a digital payment system. It is basically protecting our QR code-based digital payment system. Now in this time, it is difficult to detect when an attacker tampers with or alters the QR code containing the merchant’s beneficiary account. As a result,
QR Code-Based Digital Payment System Using Visual Cryptography
155
Fig. 8 From top left to right a QR code, b Halftone image, c Share 1, d Share 2, e AES-encrypted share 1, f Decrypted share 1, g Overlapping output, h Filtered output, i Resized output Table 1 Comparisons of reconstructed QR images Input QRs
Reconstructed QRs using proposed method
MSE
PSNR
CC
SSIM
0.0388
62.2480
0.9157
0.7283
0.0337
62.8505
0.9266
0.7530
0.0415
61.9479
0.9199
0.7110
156 Fig. 9 MSE values of reconstructed QRs
Fig. 10 PSNR values of reconstructed QRs
Fig. 11 CC values of reconstructed QRs
S. Goon et al.
QR Code-Based Digital Payment System Using Visual Cryptography
157
Fig. 12 SSIM values of reconstructed QRs
we partitioned the QR code carrying the original secret into two shares in this work using VCS. In our proposed method, we guarantee that no tampering will be possible by the hacker or intruders. The existing visual cryptography scheme can be optimized using suitable bio-inspired algorithms like grasshopper algorithm [13]. In the future to select optimized big patterns, we can also apply convolution neural networks (CNN). By doing this we can achieve better PSNR and MSE values. The only disadvantage to this hybrid method is that due to the additional security measures, its time complexity is marginally higher with respect to other digital payment systems. In the future, more emphasis can be given toward QR code transactions. Color visual cryptography may be used to secure any color QR code system directly.
References 1. M. Naor, A. Shamir, Visual cryptography, in Proceedings of Euro Crypt 1994, Lecture Notes in Computer Science, vol. 950, pp. 1–12 (1994). http://dx.doi.org/https://doi.org/10.1007/bfb 0053419 2. B.K. Sapna, K.L. Sudha, Efficient visual cryptographic algorithm using AES and modified (K, N) share generation technique. Int. J. Sci. Technol. Res. 8(12), 2135–2141 (2019). ISSN 2277-8616 3. S. Goon, D. Pal, S. Dihidar, Distribution of internet banking credentials using visual cryptography and watermarking techniques, vol. 291, in Cyber Intelligence and Information Retrieval. Lecture Notes in Networks and Systems, eds. by J.M.R.S. Tavares, P. Dutta, S. Dutta, D. Samanta (Springer, Singapore, 2022), pp. 59–67. https://doi.org/10.1007/978-98116-4284-5_6 4. S. Goon, Major developments in visual cryptography. Int. J. Eng. Adv. Technol. 9(1S6), 81–88 (2019). https://doi.org/10.35940/ijeat.A1016.1291S619 5. B. Bryce, An optimum method for two-level rendition of continuous-tone pictures (PDF). IEEE Int. Conf. Commun. 1, 11–15 (1973) 6. Z. Zhou, G.R. Arce, G. DiCresenzo, Halftone visual cryptography. IEEE Trans. Image Process. 15(8), 2441–2453 (2006)
158
S. Goon et al.
7. G. Ateniese, C. Blundo, A. Santis, D.R. Stinson, Extended capabilities for visual cryptography. ACM Theor. Comput. Sci. 250, 143–161 (2001) 8. R.W. Floyd, L. Steinberg, An adaptive algorithm for spatial grey scale. Proc. Soc. Inf. Disp. 17, 75–77 (1976) 9. V. Kalubandi, H. Vadd, V. Ramineni, A. Loganathan, A novel image encryption algorithm using AES and visual cryptography, in International Conference on Next Generation Computing Technologies (NGCT-2016) Dehradun, India 14–16 (2016), pp. 808–s813. https://doi.org/10. 1109/NGCT.2016.7877521 10. J. Lu, Z. Yang, L. Li, W. Yuan, L. Li, C. Chen, Multiple Schemes for Mobile Payment Authentication Using QR Code and Visual Cryptography. Hindawi Mob. Inf. Syst. 2017(Article ID 4356038), 12 (2017). https://doi.org/10.1155/2017/4356038 11. Z. Fu, Y. Cheng, B. Yu, Visual cryptography scheme with meaningful shares based on QR codes. IEEE Access 6, 1–1 (2018). https://doi.org/10.1109/access.2018.2874527 12. C. Yuqiao, Z. Fu, B. Yu, S. Gang, A New Two-Level QR Code with Visual Cryptography Scheme, vol. 77 (Springer Science Business Media, LLC, part of Springer Nature 2017, 2018), pp. 20629–20649. https://doi.org/10.1007/s11042-017-5465-4 13. Goon S, Pal D and Dihidar S (2022), Primary Color Based Numerous Image Creation in Visual Cryptography with the Help of Grasshopper Algorithm, Artificial Neural Network and Elliptic Curve Cryptography, Journal of Computer Science 2022, 18(5):322.338, https://doi.org/10. 3844/jcssp.2022.322.338 14. Y. Xiaoyang, S. Yang, Y. Yang, Y. Shuchun, C. Hao, G. Yanxia, An encryption method for QR code image based on ECA. Int. J. Secur. Its Appl. 7(5), 397–406 (2013) 15. C. Xiaohe, F. Liuping, C. Peng, H. Jianhua, Secure QR code scheme based on visual cryptography. Adv. Intell. Syst. Res. 133, 433–436 (2016). https://doi.org/10.2991/aiie-16.201 6.99
A Study of Different Approaches of Offloading for Mobile Cloud Computing Rajani Singh, Nitin Pandey, Deepti Mehrotra, and Devraj Mishra
Abstract In today’s world, mobile computing is growing rapidly. Smartphones, notebooks, computers, gaming consoles, smartwatches, and other gadgets have grown at an exponential rate. These devices are assembled with data resources such as sensors and cameras, as well as user interface features such as speakers and touchscreens. Online communication and gaming are possible because of the Internet, allowing people to connect. These functionalities require intensive computational operations and must be handled by the latest mobile devices. But mobile devices have a limitation in data storage and energy. Cloud computing gives services to users over the Internet, anything can be delivered via the cloud. Mobile Cloud Computing (MCC) is a technique where the application process and data storage can be done outside the mobile device. It is a combination of mobile computing, cloud computing, and wireless network that collaborate to provide rich computational resources to mobile users. The procedure of migration of data and computation process from a mobile device toward the cloud is known as offloading. The focus of this paper is to investigate the different algorithms and frameworks and experiment surroundings that are used for offloading data and processes from mobile devices to cloud systems. Keywords MCC · Computational offloading · Virtual machine
R. Singh (B) · N. Pandey · D. Mehrotra Amity University, Noida, India e-mail: [email protected] N. Pandey e-mail: [email protected] D. Mehrotra e-mail: [email protected] D. Mishra ICAR-IIPR, Kanpur, India © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 A. Bhattacharya et al. (eds.), Innovations in Data Analytics, Advances in Intelligent Systems and Computing 1442, https://doi.org/10.1007/978-981-99-0550-8_12
159
160
R. Singh et al.
1 Introduction In recent years, mobile applications have been developed in different categories, including e-banking, news, business, health, social media, travel, and news, which have become more common [1]. MCC is a leading cloud computing technique for ensuring proper application execution across a wide range of applications. For using cloud services, a wireless connection is established between mobile devices and clouds [2]. Cloud computing offers clienteles entirely over the world adaptable, resource-rich, scalable, and cost-effective outcomes. Cloud computing also offers great processing power, expensive hardware, multi-vendor platforms, and access to millions of applications over the Internet. The number of people who save their data and resources online is growing every day [3]. However, due to the restricted data storage, battery life, and processing speed of mobile devices, computational and data offloading and scheduling solutions for mobile devices are proposed to support and provide the help of multiple virtual computing resources by remote servers [4]. Offloading at the data, code, and application levels, cloud benefits smart devices by absolving them of power-hungry CPU-intensive applications whose demands exceed supply due to new emergent applications. Computational tasks require access to a large amount of data stored on the cloud and also be outsourced to a cloud service provider that is relatively close to the cloud [5]. There are numerous technical researches addressing extremely connected difficulties to propose new approaches that achieve the required level of offloading satisfaction. In this literature, various approaches of offloading such as machine learning, energy-efficient, and serverbased offloading are offered that can be used for offloading to improve the quality of the mobile devices.
2 Research Method • The goal of the research review is to summarize MCC offloading by suggesting answers to the following questions: Q1. What is offloading in mobile cloud computing? Q2. What types of applications are suitable for offloading? Q3. What are different approaches existing for mobile cloud offloading? • For information sources, we access broadly in the electronic source. Most of the information is covered from the educational website, relevant research papers, and journals. The websites are Springer IEEE Explore Elsevier Journals Google Scholar.
A Study of Different Approaches of Offloading for Mobile Cloud …
161
Fig. 1 Process of offloading [4]
3 Offloading for Mobile Cloud Computing Mobile devices increase the use of cloud computing technologies via the Internet. In the present scenario, very few people in the world are not using mobile devices. Almost everyone uses mobile devices for their day-to-day life. But mobile devices have some restrictions like processing power, storage capacity, and battery life. Due to these limitations, some computationally heavy applications cannot run on time [6]. People are embracing cloud computing to store their data in the cloud to solve the storage capacity problem, but there are other problems with mobile devices. Some applications or jobs are so large that they take a long time to load. Mobile devices consume so much time and energy that it is problematic to run other programs in parallel on them, there should be a method in place to execute these tasks on the cloud and then get the results, this process is known as offloading [7]. To optimize effectiveness, a computational process is moved from a device to a server via the cloud, improving device performance and helping to save energy (Fig. 1).
4 Applications Suitable for Offloading in MCC MCC provides the platform for applications of mobile devices which can run on the cloud environment. Cloud computing is used by people to store their data in the cloud to reduce the storage capacity. Sometimes, some applications are too heavy to execute and take more energy and time on mobile devices, making it challenging to run parallel with other applications. So there has to be some mechanism to run these
162
R. Singh et al.
Fig. 2 Applications of mobile cloud computing
tasks on the cloud. That mechanism is known as data offloading and computational offloading. Some of the examples of applications that require a lot of processing resources and battery power are as follows: • • • •
Playing games that need quick calculations. Downloading and installing large apps. Image manipulation. Application of file search.
So, it has to be a normal practice for mobile devices to send computationally intensive activities to the cloud, which is typically more powerful. Hardware constraints on mobile devices necessitate this. Offloading sophisticated modules to the cloud allows us to get around to overcoming the limitations of mobile devices (Fig. 2).
5 Different Offloading Approaches Aldmour et al. [8] proposed an approach of offloading to maximize the performance of mobile devices based on energy usage and processing time. They used alternate servers for offloading and both are maintained by a decision system. The factors considered for study include file size, time and battery capacity. The two new servers reduce processing timing while simultaneously saving energy, as per the results of the experiments. Ramasubbareddy et al. [9] describe mobile application processing on a remote cloud server, this minimizes the battery draining and time of execution. In MCC, virtual machine (VM) load balancing among cloudlets enhances application performance with reaction time. It uses two algorithms to reduce the failure of the VM and also raises the response time. The used aspects for the experiments
A Study of Different Approaches of Offloading for Mobile Cloud …
163
are bandwidth, distance, energy, and memory. When the suggested genetics algorithm VM load balancing techniques are compared to other popular scheduling algorithms in MCC, it is clear that the proposed genetic algorithm, VM computational offloading techniques, outperforms the others. In [10], Rani and Pounambal addressed the issue of task management in mobile devices, which is related to excellence. It reduces latency by using a deep learning-based technique for dynamic job offloading in mobile computing. The algorithm is in place to determine which cloudlet, cloud server, or device should perform which subtasks. The given method improved performance for a mobile device by employing a deep learning algorithm, according to the overall analysis and comparison. Tian et al. [11] describe computational offloading using reinforcement learning. For learning, it uses nodes and computational offloading based on history. Offloading nodes are viewed as agents that learn to reduce computation transfer policy. Lee and Lee [12] proposed the task offloading that has been done in the heterogeneous mobile cloud computing environment, which may cause task offloading failures, as per task delay and energy delay requirements. The relationship between cloud server, cloudlet, and task offloading is taken into account. The results were used to prove the power of failure constraints and scheduling. It also demonstrates that employing cloudlets is a good way to enhance outage probability. Adopting cloudlets is a trade-off between the added costs and revenue. Guo et al. [13] discussed the dynamic energy-efficient offloading uses for MCC, it can help mobile devices for saving energy and minimizing energy usage. It is still a problem to achieve energy efficiency for computing offloading while adhering to strict completion for time limitations. The frequency of CPU clocks for smart devices in local computation uses the eDors algorithm to reduce energy usage. The computational offloading strategy can be determined by considering the task computing workload and maximizing the task’s assignment to immediate predecessors. Ko et al. [14] used the strategy for online prefetching for large measure programs for many tasks that smoothly merge computational prediction for task-level real-time prefetching inside the program runtime that allows parallel fetching and computing prediction. The stochastic optimization technique is used to make online fetching policies that optimize power consumption of mobile for transmitting fetched data across the low network below a limiting restriction by modeling the sequential task to the transmission for offloaded software as a Markov chain. For slow and rapid fading feature, a threshold framework chooses candidates for the coming job on a set criterion. An architecture for novel online prefetching of mobile computational offloading enabled the prediction of computation at task level and its simultaneous operation with cloud computing. Mehta and Kaur [15] describe the problem of optimization for the computational offloading problem which utilizes a nature-inspired algorithm for determining whether a task should be executed locally on a smart mobile device or migrated to the cloud. To prove this optimization, four different algorithms are considered. The better performance is demonstrated by algorithms in terms of processing time and power consumption. Chen et al. [16] give a wellplanned model that uses a new semidefinite relaxation offloading methodology for a new randomization mapping method. It included a scenario of mobile computing that had several independent jobs, a computing access point (CAP), and a remote
164
R. Singh et al.
server. The offload task is computed by the access point received from the user to the cloud. It enhances the offloading decision by decreasing the weighted cost of consumption, energy, and delay. The proposed model is very helpful in resolving the issues related to mobile devices. The developed model with randomized iteration and CAP enhances the standard mobile computing system’s compute performance. Xia et al. [17] proposed a Phone2Cloud for computational offloading to offload the application at running time from a mobile device to the cloud. For refining the request performance and energy efficiency of smartphones, they suggested the three important methods that are resource monitor, CPU workload prediction and bandwidth monitor, and decision-making offloading algorithm. This algorithm determines which application processing should be offloaded to enhance energy, improve application performance, and save energy. Wu and Wolter [18] propose queueing models for showing reduced response time and energy using an energy-response time-weighted sum. It used different policies for offloading. When a job comes it should be processed locally or remotely on the cloud. Offloading could be done on a wireless network or cellular network. They discover the dynamic offloading policy from trade-off offloading that beats alternative policies by a large margin because the dynamic offloading technique takes the enhancement in the queue and changes a metric for the new jobs brought in if they are assigned. Comparative study for computational offloading is represented in Table 1.
6 Conclusion Mobile devices should allow real-time applications including e-commerce, ebanking, health care, social media, gaming, and health care. Users of mobile devices demand the same degree of quality of service as a user of desktop programs. Offloading is a reducing technique for the number of tasks of mobile by moving all or some of them to rich resources like cloud servers. Eventually, we will talk about what way they need for data computing on mobile devices is growing. And how data processing capacity is being considered as a strategic resource. Due to lack of storage or excessive task computation, many applications on the devices are inaccessible. By offloading the big modules to the cloud, MCC enables access to nearly all the applications that are restricted owing to the size of the battery or memory of the application. Computational offloading is investigated, and given several models for lowering costs, energy consumption, reaction time, and battery life by moving processing to the cloud.
A Study of Different Approaches of Offloading for Mobile Cloud …
165
Table 1 A comparative study for computational offloading Year
Paper
Strategy used for offloading
Description
Testing
2021
[8]
FUR, SUR
The used technique enhances Experiments the power consumption on 4G and compared to Wi-Fi. First Upload Server and Second Upload Server for enhancing the energy draining and develop an algorithm
2021
[9]
MOGALCC
MOGALMCC Simulator experiment (multi-objective genetic algorithm load balancing for load balancing) in MCC atmosphere MOGALMCC considers distance, memory, bandwidth, and cloudlet load for the server to find optimal cloudlet scheduling before VM in another cloudlet
2019
[10]
DLDTO
An algorithm based on deep Simulator experiment learning to perform dynamic job offloading for the mobile cloudlet to decrease the delay that arises in wireless LAN. DLTDO (deep learning dynamic task offloading) is compared with CDTO (cloudlet-based dynamic task offloading) in terms of completion time and energy consumption
2019
[11]
RLCO
RLCO (reinforcement Simulator experiment learning-based computational offloading) structure that uses offloading nodes for learning the computation offloading to decide on the past state of the channel. To satisfy the latency of computation transmission, every offloading node is considered as an agent to learn (continued)
166
R. Singh et al.
Table 1 (continued) Year
Paper
Strategy used for offloading
Description
Testing
2019
[13]
eDors
To save energy and improve computational processing of smart mobile devices, the CPU clock management in transmission and local computing allocation of power in cloud computing use the algorithm eDors that can efficiently minimize application time and energy reduction
Simulator experiment
2019
[14]
GA, DE, PSO, SFLA
To propose the best result for Simulator experiment offloading optimization, the study looked at four algorithms: GA (genetic algorithm), DE (differential evolution), PSO (practical swarm optimization), and SFLA (shuffled frog leaping algorithms) [14]
2018
[12]
HMCC with TMDs HMCC (heterogeneous Simulator experiment mobile cloud computing) system includes local cloudlets, remote cloud servers, TMDs (task offloading mobile devices), NMTDs (non-task offloading mobile devices), and networks of radio access together with cellular networks and WLANs to enhance the processing power and save energy
2017
[11]
MCO
A new online prefetching technique for large-scale programs which seamlessly integrates with numerous tasks, integrated with side-by-side real-time prefetching and computational prediction within the runtime program
Simulator experiment
A Study of Different Approaches of Offloading for Mobile Cloud …
167
References 1. N. Fernando, S.W. Loke, W. Rahayu, Mobile cloud computing: a survey. Future Gener. Comput. Syst. 29(1), 84–106 (2013). https://doi.org/10.1016/j.future.2012.05.023 2. A.S. AlAhmad, H. Kahtan, Y.I. Alzoubi, O. Ali, A. Jaradat, Mobile cloud computing models security issues: a systematic review. J. Netw. Comput. Appl. 190(March), 103152 (2021). https://doi.org/10.1016/j.jnca.2021.103152 3. M. Babar, M.S. Khan, F. Ali, M. Imran, M. Shoaib, Cloudlet computing: recent advances, taxonomy, and challenges. IEEE Access 9, 29609–29622 (2021). https://doi.org/10.1109/ACC ESS.2021.3059072 4. K. Akherfi, M. Gerndt, H. Harroud, Mobile cloud computing for computation offloading: issues and challenges. Appl. Comput. Inform. 14(1), 1–16 (2018). https://doi.org/10.1016/j.aci.2016. 11.002 5. A.M. Rahmani et al., Towards Data and Computation Offloading in Mobile Cloud Computing: Taxonomy, Overview, and Future Directions, vol. 119, no. 1. (Springer US, 2021) 6. M.M. Alqarni, A. Cherif, E. Alkayal, A survey of computational offloading in cloud/edgebased architectures: strategies, optimization models and challenges. KSII Trans. Internet Inf. Syst. 15(3), 952–973 (2021). https://doi.org/10.3837/tiis.2021.03.008 7. O. Approach, I.N. Mobile, C. Computing, International Journal of Engineering Sciences & Management Research, IJESMR vol. 4, no. 3, pp. 1–6 (2017) 8. R. Aldmour, S. Yousef, T. Baker, E. Benkhelifa, An approach for offloading in mobile cloud computing to optimize power consumption and processing time, in Sustainable Computing: Informatics and Systems, p. 100562 (2021) 9. S. Ramasubbareddy, E. Swetha, A.K. Luhach, T.A.S. Srinivas, A multi-objective genetic algorithm-based resource scheduling in mobile cloud computing. Int. J. Cogn. Inform. Nat. Intell. 15(3), 58–73 (2021). https://doi.org/10.4018/IJCINI.20210701.oa5 10. D.S. Rani, M. Pounambal, Deep learning based dynamic task offloading in mobile cloudlet environments, no. 0123456789 (2019) 11. B. Tian, L. Wang, Y. Ai, A. Fei, Reinforcement learning based matching for computation offloading in D2D communications, in 2019 IEEE/CIC International Conference on Communications in China (ICCC) 2019 (ICCC, 2019), pp. 984–988. https://doi.org/10.1109/ICCChina. 2019.8855817 12. H.S. Lee, J.W. Lee, Task offloading in heterogeneous mobile cloud computing: modeling, analysis, and cloudlet deployment. IEEE Access 6(c), 14908–14925 (2018). https://doi.org/10. 1109/ACCESS.2018.2812144 13. S. Guo, J. Liu, Y. Yang, B. Xiao, Z. Li, Energy-efficient dynamic computation offloading and cooperative task scheduling in mobile cloud computing. IEEE Trans. Mob. Comput. 18(2), 319–333 (2019). https://doi.org/10.1109/TMC.2018.2831230 14. S.W. Ko, K. Huang, S.L. Kim, H. Chae, Energy efficient mobile computation offloading via online prefetching, in IEEE International Conference on Communications, no. May (2017). https://doi.org/10.1109/ICC.2017.7997341 15. S. Mehta, P. Kaur, Efficient computation offloading in mobile cloud computing with natureinspired algorithms. Int. J. Comput. Intell. Appl. 18(4), 1–21 (2019). https://doi.org/10.1142/ S1469026819500238 16. M.H. Chen, B. Liang, M. Dong, A semidefinite relaxation approach to mobile cloud offloading with computing access point, in IEEE International Workshop on Signal Processing Advances in Wireless Communications SPAWC, vol. 2015-Augus (2015), pp. 186–190. https://doi.org/ 10.1109/SPAWC.2015.7227025 17. F. Xia, F. Ding, J. Lie, X. Kong, L.T. Yang, J. Ma, Phone2Cloud: exploiting computational offloading for energy saving smart phonesin mobile cloud computing. Nature 388, 539–547 (2018)
168
R. Singh et al.
18. H. Wu, K. Wolter, Tradeoff analysis for mobile cloud offloading based on an additive energyperformance metric, in 8th International Conference on Performance Evaluation Methodologies and Tools, VALUETOOLS 2014 (2014), pp. 90–97. https://doi.org/10.4108/icst.valuetools. 2014.258222
Use of Machine Learning Models for Analyzing the Accuracy of Predicting the Cancerous Diseases Shanthi Makka , Gagandeep Arora , Sai Sindhu Theja Reddy , and Sunitha Lingam
Abstract One of the diseases, which causes more deaths frequently and in more numbers, is Breast Cancer. Let us go through the index for global statistics for breast cancer (BC), which impacts women globally and it causes a lot of trouble to the health of women in turn causing deaths. So it is a threat to society for causing a lot of trouble and the majority of the new cases are suffering from breast cancer. A tumor, which is causing deaths of women around the globe, is known by the name malignant. As is consensus, early diagnosis of any medical disorders, such as cancer or malignant diseases, will always increase the likelihood of survival since patients can receive early or preemptive clinical treatment. It will be helpful in avoiding unnecessary therapies if benign tumors are classified more precisely. So, it is very imperative to precisely diagnose the malignant status, whether it is benign or malignant, for the detection of breast cancer, which has become an extremely rapidly developing field of study. On a complex dataset, if we want to predict or forecast BC patterns, machine learning models will be very useful as they can classify different patterns more accurately than any other general algorithms. Artificial Intelligence models are useful for properly grouping datasets, particularly in the healthcare arena, where these models are frequently used to reach conclusions and are helpful in predicting. While predicting using Logistic Regression, the exactness was calculated; later the same is compared with Decision Tree and Random Forest Classifier to give the best method for predicting breast cancer on a dataset available to us. The main goal is to evaluate each model’s accuracy and precision in terms of productivity and exactness for accuracy, precision, f1 score, and support. According to the findings, the Random Forest Classifier has the highest precision (96.50%) when assessing the data, followed by Decision Tree (93.70%), and Logistic Regression (95.10%). All of the trials are carried out using AI tools in a reenactment. Keywords Breast cancer · Decision tree · Random forest classifier · Logistic regression · Machine learning models · Artificial intelligence S. Makka (B) · G. Arora · S. S. T. Reddy · S. Lingam Vardhaman College of Engineering, Kacharam, Shamshabad, Hyderabad, Telangana 501218, India e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 A. Bhattacharya et al. (eds.), Innovations in Data Analytics, Advances in Intelligent Systems and Computing 1442, https://doi.org/10.1007/978-981-99-0550-8_13
169
170
S. Makka et al.
1 Introduction With 609360 fatalities this year, breast cancer will affect millions of people worldwide by the year 2022 . By the end of 2020, 7.8 million women had received a diagnosis of breast cancer in the previous five years, making it the most prevalent cancer worldwide. The most typical disease, which perturbs women worldwide, both in developed and developing nations, is breast cancer. Estimates indicate that more than 508000 women died worldwide in 2012, and that 609360 women will die from breast cancer by the year 2022. Despite the misconception that breast cancer is a disease primarily affecting industrialized nations, less developed nations account for more than half of all cases and 58% of fatalities. Between 19.3 per 100,000 women in Eastern Africa and 89.7 per 100,000 women in Western Europe are affected by the disease, showing a vast range in prevalence rates. In the majority of emerging areas, rates are less than 40 per 100,000. Breast malignant growth rates are rising, despite the fact that they are lowest in the majority of African nations [1]. Medical expert co-ops are using artificial intelligence technologies to help them with clinical considerations, clinical activities, and charges. AI expectations can impact the conclusion, clinical consideration strategy, and patient hazard definition, among other factors. As a result, clinicians and others should be aware of the “cause” of prediction [2]. This research compares the precision of three distinct models: Logistic Regression, Decision Tree, and Random Forest Classifier, all of which are recent popular models. The goal is to create a superior machine learning model in terms of accuracy.
2 Related Work Models for Machine Learning have the Supervised Learning technique that includes a classification algorithm that divides the data into two parts: training data and testing data. The system then uses the training data to identify the category of fresh observations. The act of classifying fresh observations into one of several classes or groupings after learning from a dataset or prior observations is known as classification. Like Binary values, it works Yes or No, 0 or 1, Spam or Ham, and so forth. Targets/labels, categories, and other words can all be used to describe classes. Classification, unlike regression, yields a category rather than a value, such as “Red or Blue,” “fruit or Vegetable,” “M or B,” etc. As a supervised learning technique, the classification method employs labeled input data, implying that it incorporates both input and output. One of the most crucial and fundamental tasks in information mining and machine learning is classification. Several studies have been done to characterize breast cancer by using artificial intelligence and information mining on various clinical informational indexes. Consider the demonstration of Naive Bayes, K-Nearest Neighbor, and Support Vector Machine (SVM) to obtain the preeminent classifier in WBC [3]. SVM emerges
Use of Machine Learning Models for Analyzing the Accuracy …
171
as the most scrupulous classifier [4] with a precision of 96.99%. Decision tree classifiers are utilized to attain a precision of 69.23% in datasets for breast malignant development. Consider the Naive Bayes, SVM-RBF, RBF NNs, Decision Tree, and test CART to discover the precise classifier in the Wisconsin Breast Cancer dataset [5]; it achieves a precision of (96.84%). In our research, we take into account the usage of information mining techniques like Decision Tree, Logistic Regression, and Random Forest Classifiers with Breast Cancer (unique) datasets for decision-making in both finding and exploring. When analyzing data, the objective is to obtain the best level of accuracy while making the fewest errors feasible. In order to achieve this, we analyze the efficacy and viability of various approaches using a range of parameters, such as exactness, precision, affectability, and explicitness, instances that are correctly and incorrectly organized, as well as model creation time. The Random Forest Classifier had the highest degree of precision in our tests (96.50%).
3 Machine Learning Models Software that recognizes patterns or behaviors based on prior knowledge or data is called a machine learning model. The machine learning model that the learning algorithm develops captures the [6, 7] patterns it finds in the training data and predicts new data.
3.1 Logistic Regression Simple regression, in which the machine depends on a single independent variable, and multiple regressions, in which the machine depends on a number of independent variables, are the two types of logistic regression. A linear regression model predicts the probability for each category of the dependent variable. Logistic regression (Fig. 1) is used to find the linear relationship between the independent variable and the link function of this probability. Following that, the connection function with the best goodness of fit for the provided data is chosen. With the exception of how they are used, linear regression and logistic regression are fairly comparable. Logistic regression is used to solve classification problems, while linear regression is used to solve regression problems. Instead of using a regression line in logistic regression, we fit a “S” shaped logistic function that forecasts two maximum values (0 or 1). The logistic function’s curve represents [8] the probability of a number of outcomes, such as whether malignant cells are present, whether a mouse is fat based on its weight and more. By using real-valued inputs, the logistic regression model [9] forecasts whether they will belong to the standard class (class 0). If the probability is greater than 0.5, we can use the result to predict the alternate class (class 1); otherwise, the default class (class 0) is predicted [6]. Similar to linear regression, logistic regression also
172
S. Makka et al.
Fig. 1 Logistic regression
has three coefficients for this dataset. Output = B0 + B1 × X1 + B2 × X2
(1)
The learning algorithm [10] will choose the ideal coefficient values based on the training data (B0, B1, and B2). The logistic function is used to transform the result into a probability as opposed to linear regression. p(class = 0) = 11 + e − output
(2)
3.2 Decision Tree When making a binary decision tree, the input space must be divided. A greedy technique called recursive binary splitting is used to divide the space. This numerical method evaluates direct split points using a cost function while aligning all of the variables. We select the split with the cheapest price (because we want to save money). A greedy evaluation and selection process is used to assess and choose all possible split points and input variables (For instance, the best split point is always chosen.) The cost function that is reduced to identify spit points for problems with regression predictive modeling is the sum-squared error across all training samples that fall within the rectangle [11].
Use of Machine Learning Models for Analyzing the Accuracy …
i = 1nyi − predictioni2
173
(3)
in which y is the result from the training sample and prediction is what the rectangle is expected to produce. The Gini index function [12], which reveals how pure the leaf nodes are, is utilized for categorization (how diverse the training data each node is given). G = i = 1npk × 1 − pk
(4)
where pk is the percentage of training examples with class k in the rectangle of interest, and G is the Gini index across all classes. In a binary classification issue, G = 0 for a node with all classes of the same kind and G = 0:5 for a node with a 50/50 split of classes. For a binary classification issue, this can be phrased in the manner that follows [13]: G = 2 × p1 × p2
(5)
G = 1 − (p12 + p22)
(6)
Each node’s Gini index is determined and weighted according to the total number within the parent node of instances. The following formula is used to calculate the Gini score for a particular split point in a binary classification problem: G = ((1 − (g112 + g12 2)) × ng1n) + ((1 − (g212 + g22 2)) × ng2n)
(7)
G is the Gini index for the split point, n is the total number of instances we’re attempting to group from the parent node, ng1 and ng2 are the total number of instances in groups 1 and 2, and g11 and g12 are the proportions of instances in groups 1 and 2 for classes 1 and 2, respectively [7].
3.3 Random Forest Classifier Random Forest [12], a popular machine learning algorithm, employs the supervised learning technique [14]. It can be used to solve classification and regression problems in artificial intelligence. It is founded on collective learning, which is a method of combining multiple classifiers to solve a complex problem and improve model performance. In order to increase the predicted accuracy [15] of a dataset, the Random Forest classifier, as the name suggests, “contains a number of decision trees on various subsets of a given dataset and takes the average.” The random forest uses predictions from each decision tree instead of just one, and it predicts the final result based on the predictions that have received the most votes [16, 17].
174
S. Makka et al.
An ensemble learning technique that corrects for each tree’s tendency to overfit the training set is Breiman’s (2001) Random Forest [18]. A large number of “decorrelated” decision trees are created using bagging and a tree-learning algorithm resembling CART. Bagging was suggested as an assembly technique by Breiman (1996a) to increase model correctness [16, 17], especially when a training set is given Z = {x1, y1, (x2, y2, . . . . .(xN, yN)}
(8)
This is accomplished using Z1, Z2, and B bootstrapped training. When N0 is typically equal to N, repeatedly sampling N0 observations and replacing them with data from Z creates ZB. The prediction of model b on an observation x is then denoted as f b, and a separate model is trained for each bootstrapped training set Zb(x). Then, in the case of regression, by giving each model a different bootstrapped [19, 20] sample, it is possible to determine the overall bagged estimate f (x) for an observation x by averaging over all of the individual predictions. f R = 1Bb = 1B f 1 b
(9)
4 Experiment We conducted an experiment to compare the behaviors of Decision Tree, Logistic Regression, and Random Forest Classifier, with the goal of more precisely assessing the algorithms’ effectiveness [21] and efficiency. The experiment’s research questions are: Which model exploits greater effectiveness? Which model is most accurate?
4.1 Experiment Environment All of the experiments in this research were carried out in Google Collaborator, with multiple libraries such as numpy, pandas, matplotlib, and seaborn being used for preprocessing data, clustering, classification [22], and plot presentation. Experimenters and developers can design and analyze their models using the program’s well-defined structure.
4.2 Breast Cancer Dataset The experimental environment was created using a dataset from the Kaggle machine learning repository. We discovered 569 occurrences in the data set (Benign: 357,
Use of Machine Learning Models for Analyzing the Accuracy …
175
Fig. 2 Experimental environment
Malignant: 212), as shown in Figure 2. shows a count plot with the value ‘0’ encoded as benign and the value ‘1’ encoded as malignant.
5 Experimental Results In this section, the data analysis findings are presented and the entire data set is divided into two sections. 75% trained data and 25% tested data, models are represented in the form of [0], [1], [2] while training the model decision tree outperformed Logistic Regression 99.06%, Random Forest Classifier 99.53% (Tables 1 and 2). The model’s performance is evaluated [23] using a confusion matrix (Table 3) constructed from the three models [24]. The third MODEL, Random Forest, has the best performance result among models, with the following parameters: TP = 87, TN = 51, FP = 3, and FN = 2. (We discovered that the model falsely predicted FN = 8, Table 1 Training accuracy with different models
Table 2 Testing accuracy with different models
Training accuracy Model
Accuracy (%)
[0] Logistic regression
99.06
[1] Decision tree
100
[2] Random forest classifier
99.53
Testing accuracy Model
Accuracy (%)
[0] Logistic regression
95.10
[1] Decision tree
93.70
[2] Random forest classifier
96.50
176
S. Makka et al.
Table 3 Confusion matrix Confusion matrix Model
TP
TN
FP
FN
Accuracy (%)
[0] Logistic regression
86
50
4
3
95.10
[1] Decision tree
83
51
7
2
93.70
[2] Random forest classifier
87
51
3
2
96.50
has cancer when in fact they do not, just as the model false positive person does not have cancer when shown as having cancer FP = 12). #print the prediction of the Random Forest Classifier Model [1000000000100111011111001001010101010. 1010010010001111000000111001011100100. 1000001110100011010100100000001010110 11000000000101000001000000110001] [1000000000000001011111001001010101010 1011010010001111000000111001011100101 1000001110100011010100100000001010110 11000000000101000001000000110001] The performance tested again by using the classification report gives exactly the same results as the confusion matrix; here we can get precision, recall, f1 score, and support metrics (Fig. 4). Color variations used Heat Map to represent data (Fig. 3). Whenever you are examining multivariate data, a heat map is very useful when applied to a tabular format by adding variables to the columns and rows and coloring the cells that enclose the table. We use the correlation matrix to visualize a heatmap to find a correlation between each feature and target. A correlation heatmap is a heatmap that depicts a two-dimensional matrix of correlation between two discrete dimensions, where data are represented by colored cells. The first dimension’s values are represented by the table’s rows (Table 4), while the second dimension is represented by a column. In Fig. 3, we are diagnosing cancer by representing yes or no based on the dot color, blue dot—no cancer and red dot—cancer, and comparing the different physical features of breast cancer.
Use of Machine Learning Models for Analyzing the Accuracy …
177
Fig. 3 Heap map
6 Conclusion Various information mining and Artificial Intelligence methods can be used to break down clinical data. The creation of precise and computationally productive models for medical applications is a significant challenge in the AI realm. In this study, we utilized three fundamental models: Logistic Regression, Decision Tree, and Random Forest Classifier Breast Cancer dataset. We sought to consider the productivity and viability of those computations in terms of exactness, in order to find the best characterization precision out of the Random Forest Classifier, which achieves a precision of
178
S. Makka et al.
Fig. 4 Diagnostic analysis
97% and thus outperforms all other models. Taking everything into account, Random Forest Classifier has proven its ability to forecast and predict breast cancer, and it achieves the best results in terms of accuracy and low error rate.
Use of Machine Learning Models for Analyzing the Accuracy …
179
Table 4 Accuracy findings on different models Model
Attrib
Prec
Recall
f1-S
Support
Final accuracy
Logistic regression
0
0.973
0.964
0.961
90
0.951048951048951
1
0.934
53
Decision tree
Random forest classifier
0.946
0.932
Accuracy
0.951
143
Macro Avg. 0.953
0.951
0.952
143
Weighted Avg.
0.954
0.953
0.957
143
0
0.983
0.924
0.953
90
1
0.883
53
0.965
0.925
Accuracy
0.94
143
Macro Avg. 0.936
0.944
0.934
143
Weighted Avg.
0.945
0.945
0.946
143
0
0.986
0.975
0.975
90
1
0.943
0.964
0.957
53
0.97
143
Macro Avg. 0.96
0.96
0.96
143
Weighted Avg.
0.97
0.97
143
Accuracy 0.97
0.9370629370629371
0.965034965034965
References 1. https://www.who.int/cancer/detection/breastcancer/en/index1.html 2. M. Aurangzeb, BCB’18: 9th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics (Washington, DC, USA) 3. A. Christobel, Y. Sivaprakasam, An empirical comparison of data mining classification methods. Int. J. Comput. Inf. Syst. 3(2), 24–28 (2011) 4. V. Chaurasia, S. Pal, Data mining techniques : to predict and resolve breast cancer survivability. Int. J. Comput. Sci. Mob. Comput. IJCSMC 3(1), 10–22 (2014) 5. J. Brownlee, Machine Learning Algorithms Discover How they Work and Implement from the Scratch, pp. 52–56 6. J. Brownlee, Machine Learning Algorithms Discover How they Work and Implement from the Scratch, pp. 74–75 7. C. Song, T. Ristenpart, V. Shmatikov, Machine learning models that remember too much, in Proceedings of the 2017 ACM SIGSAC Conference on Computer and Communications Security (2017), pp. 587–601 8. P.H.C. Chen, Y. Liu, L. Peng, How to develop machine learning models for healthcare. Nat. Mater. 18(5), 410–414 (2019) 9. A. Mosavi, M. Salimi, S. Faizollahzadeh Ardabili, T. Rabczuk, S. Shamshirband, A.R. Varkonyi-Koczy, State of the art of machine learning models in energy systems, a systematic review. Energies 12(7), 1301 (2019) 10. M. Yin, J. Wortman Vaughan, H. Wallach, Understanding the effect of accuracy on trust in machine learning models, in Proceedings of the 2019 Chi Conference on Human Factors in Computing Systems (2019), pp. 1–12 11. T.J. Cleophas, A.H. Zwinderman, H.I. Cleophas-Allers, Machine Learning in Medicine, vol. 9 (Springer, Dordrecht, The Netherlands, 2013)
180
S. Makka et al.
12. N.H. Shah, A. Milstein, S.C. Bagley, Making machine learning models clinically useful. JAMA 322(14), 1351–1352 (2019) 13. D. Assaf, Y.A. Gutman, Y. Neuman, G. Segal, S. Amit, S. Gefen-Halevi, A. Tirosh, Utilization of machine-learning models to accurately predict the risk for critical COVID-19. Intern. Emerg. Med. 15(8), 1435–1443 (2020) 14. A. Bella, C. Ferri, J. Hernández-Orallo, M.J. Ramírez-Quintana, Calibration of machine learning models, in Handbook of Research on Machine Learning Applications and Trends: Algorithms, Methods, and Techniques (IGI Global, 2010) , pp. 128–146 15. J. Zhang, Y. Wang, P. Molino, L. Li, D.S. Ebert, Manifold: a model-agnostic framework for interpretation and diagnosis of machine learning models. IEEE Trans. Vis. Comput. Graph. 25(1), 364–373 (2018) 16. M. Montazeri, M. Montazeri, M. Montazeri, A. Beigzadeh, Machine learning models in breast cancer survival prediction. Technol. Health Care 24(1), 31–42 (2016) 17. M. Benedetti, E. Lloyd, S. Sack, M. Fiorentini, Parameterized quantum circuits as machine learning models. Quantum Sci. Technol. 4(4), 043001 (2019) 18. L. Breiman, Random forests. Mach. Learn. 45, 5–32 (2001). https://doi.org/10.1023/A:101093 3404324 19. Z.H. Zhou, Learnware: on the future of machine learning. Front. Comput. Sci. 10(4), 589–590 (2016) 20. X. Dastile, T. Celik, M. Potsane, Statistical and machine learning models in credit scoring: a systematic literature survey. Appl. Soft Comput. 91, 106263 (2020) 21. A. Chatzimparmpas, R.M. Martins, I. Jusufi, K. Kucher, F. Rossi, A. Kerren, The state of the art in enhancing trust in machine learning models with the use of visualizations. Comput. Graph. Forum 39(3), 713–756 (2020) 22. Y. Raita, T. Goto, M.K. Faridi, D.F. Brown, C.A. Camargo, K. Hasegawa, Emergency department triage prediction of clinical outcomes using machine learning models. Crit. Care 23(1), 1–13 (2019) 23. W. Wang, M. Kiik, N. Peek, V. Curcin, I.J. Marshall, A.G. Rudd, B. Bray, A systematic review of machine learning models for predicting outcomes of stroke with structured data. PLoS One 15(6), e0234722 (2020) 24. S. Aruna, L.V. Nandakishore, Knowledge Based Analysis of Various Statistical Tools in Detecting Breast Cancer (2011), pp. 37–45
Predict Employee Promotion Using Supervised Classification Approaches Mithila Hoq, Papel Chandra, Shakayet Hossain, Sudipto Ghosh, Md. Ifran Ahamad, Md. Shariar Rahman Oion, and Md. Abu Foyez
Abstract In today’s highly developed, massive corporate offices or industries, it is difficult to personally evaluate every employee’s performance and recommend them for promotion. It has only been the subject of a few research projects, but those who have worked on it have created algorithms for predicting promotions as well as the fundamental traits and job qualities for each person. The incorporation of extra attributes allows our model to do more with fewer strategies. The goal of this research is to provide a machine learning-based system for predicting whether or not an employee will be promoted. We do this by offering a linear model that offers a respectable level of accuracy at a cheap cost. This process takes into account the training record, annual performance review score, length of service, key performance indicators, and other elements of the employee. Due to the limited number of classifications, a linear classifier was utilized to train the model. This linear classifier completes 50 iterations with an accuracy rate of 92.6%. Using this method, you may be able to acquire an answer on the likelihood of promoting any employee. This software might prove to be the saving grace for an organization’s HR department. Keywords ML · Epochs · Linear classifier · Key performance indicators · Accuracy
1 Introduction An individual may be highly motivated to do well in their current position because of the promotion opportunities available to them. Human resources have a huge work ahead of them in manually categorizing employee data and identifying the worthy ones. A global corporation faces extreme hardship. This procedure may take a long time and stifle productivity. However, a corporation should promote qualified M. Hoq (B) · P. Chandra · S. Hossain · S. Ghosh · Md. I. Ahamad · Md. S. R. Oion · Md. A. Foyez Bangladesh University of Business and Technology, Mirpur-2, Dhaka, Bangladesh e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 A. Bhattacharya et al. (eds.), Innovations in Data Analytics, Advances in Intelligent Systems and Computing 1442, https://doi.org/10.1007/978-981-99-0550-8_14
181
182
M. Hoq et al.
individuals [1]. Promotion is also important for gaining self-worth and progressing in their present job [2]. How will a corporation discover employees who are suitable for advancement? Our article is concerned with the solution of a firm prediction with their personnel advancement. A machine learning model may predict professional growth and related processes [1–6]. As a result, based on a company’s dataset The use of big data analytic technology in human resources has shown encouraging outcomes [2]. The relationships between promotion and various qualities or attributes are investigated initially using correlation data analysis [2]. What kinds of workers are more likely to be given this opportunity? This effort will mimic HR data in order to categorize it. A classifier like this might aid a business in predicting staff promotions. Predicting employee advancement will be easy in this study because of the employment of a fast algorithm and targeted accuracy analyzing and applying feature engineering and linear classifiers over a large number of datasets. Some of this research used that algorithm, which is sluggish and produces less precision. Such constraints can be circumvented in this study. That has a major impact on how much it will price you back, how long it will take, and how accurate it will be, all of which are important aspects of any business. The rest of the piece follows the same structure as the grouping of works discussed in Sect. 2. In Sect. 3, I’ll go into further depth on how to put this study into action. Speculate on IV’s ultimate evaluation. The last component of this research report displays the model’s findings.
2 Related Works The following is a suggested order for a discussion of relevant models or other work (Table 1). Research on the topic of foreseeing promotions in the workplace is limited. In this part, we looked at a few machine learning models and techniques that are closely linked to the ones we had looked at. Predictions on attrition [4], turnover [5–7], promotions [1], personal basic traits [2], and other phenomena were made using machine learning algorithms. Scientists used the most effective methodology for their research project to get a desirable outcome. Each of [1–10] uses a unique set of algorithms and classifiers to arrive at its conclusions. Data mining and data-driven techniques are used in certain models to foretell future events. This is where they did their studies to figure out how to anticipate the success of a process or the productivity of an individual [9], or to determine the significance of a company’s location [10]. The machine learning method is used to address all these issues and improve prediction and accuracy. On the other hand, many methods cause the model to be sluggish and inaccurate.
Predict Employee Promotion Using Supervised Classification Approaches
183
Table 1 Related work on machine learning prediction approaches Authors of studies Researched the issue Approaches to machine learning Jagan Mohan Reddy
Recruitment prediction using machine learning
Yuxi Long
The likelihood of branding an employee varies greatly from person to person and from position to position Incorporating neural network models to make predictions about staff turnover Predicting employee turnover using machine learning Predicting the likelihood of an employee’s departure using ML algorithms Predicting employee turnover using machine learning is a reliable method An organization’s repercussions upon learning that a user’s staff churn was calculated using ML algorithms Data mining methods for forecasting Churn in the workplace Applying data analysis classifier to predict the performance of the employee Using data to understand promotions in the workplace: the importance of job position
Rohit Punnoose
Sarah S. Alduayj Heng Zhang1
Yue Zhao1
FREDRIK NORRMAN
Ibrahim Onuralp Yigit
John M. Kirimi
Jiamin Liu
Other possible decision-making techniques include the Stochastic Naive Bayes classifier and the K-nearest neighbor algorithm Random forest model, correlation analysis
Logistic regression, extreme gradient-boosting It uses an SVM and a K-nearest neighbor algorithm GBDT algorithm and LR algorithm Strategies like random forests, gradient-boosting, and decision trees might be used Random forest method, support vector machine
Decision tree, logistic regression, SVM, KNN Standard procedure for data mining throughout the sector Random forest, logistic regression
3 Implementation 3.1 Proposed Model Our proposed approach attempted to create a novel model that struck a good balance between accuracy and the possibility of employee promotion prediction. We can better characterize the whole process using this model. We provide figures to demonstrate how our system works. To begin, we needed to amass the data that would
184
M. Hoq et al.
Fig. 1 System design
serve as our input [1]. Then After the dataset is analyzed, feature engineering may be used. Categorical columns and numeric columns were created from the dataset using feature engineering. Identify the feature column by combining a numerical column with a column containing the category’s allocated numbers. In this case, the missing value in the numerical column is calculated and added. To get the data ready for the linear classifier, we first do some preprocessing. The outcome of using this strategy may be used for the analysis of performance. Using this method, we may calculate an individual’s chances of being promoted. After a system’s performance has been assessed, it will provide data on the system’s precision, loss, and f1 value. Also, by entering an employee’s ID, this system may be able to figure out how likely they are to get a promotion (Fig. 1).
3.2 Data Collection The purpose of our research is to determine what factors, if any, may be used to predict an employee’s promotion. Therefore, we need a large sample of employee data that includes things like training hours, years of service, evaluations from the previous year, key performance indicators, and more. Because of these characteristics, plus Kaggle’s validation, this sample was selected. This data collection contains information from 54,808 workers. We divided up the dataset into two distinct ways: training and testing. Analysis and evaluation rely on certain features or attributes. The dataset that was collected is sent to the train-test-split phase with a number of different features.
3.3 Data Analysis The examination of the data is the initial step in arriving at the ultimate conclusion. The section below details the promotion rates for each division. To evaluate our data,
Predict Employee Promotion Using Supervised Classification Approaches
185
Fig. 2 Estimated percentage of workers promoted from each division
we have incorporated the efforts of 9 distinct divisions. The total number of personnel varies widely throughout the various departments. Every employee counts in each department.
3.4 Engineered Functions Both categorical and quantitative information are stored separately in our data. The string value is displayed in the category column, while the corresponding numeric column represents the numerical value. Now that we know where to put the special identifier, we can create the list we’ll call “f-col” by grouping the data in the appropriate column. The category column’s data is now represented numerically. Next, information is transferred to the feature row. All of the columns that our machine learning algorithm used are in the feature column.
3.5 Train-Test-Split Afterward, we split the dataset into a training dataset and a testing dataset, two outcomes of the feature engineering process. Data from all 54,808 employees is analyzed using 13 different traits or qualities. The features from our gathered dataset are sent to the train-test-split phase. Sixty-six percent of the primary dataset is used for training. Testing data: To ensure reliability, we test using 34% of the whole dataset (Figs. 2, 3, 4, 5, 6, 7, 8, 9 and 10).
186
M. Hoq et al.
Fig. 3 A global average percentage of promotions
Fig. 4 Rank advancement equity for workers of varying educational levels
3.6 Model Building Our system is said to have a binary outcome, given there are only two possible results. We will either highlight this outcome or not. Based on our analysis, we have concluded that a linear model is suitable for representing our data. Despite the fact that we have a binary and categorical dataset, we are dealing with a categorization issue. Though our result is a categorical variable, we still need to use a classifier. Overall, our model relies on linear classifiers.
Predict Employee Promotion Using Supervised Classification Approaches
Fig. 5 Varying proportion of promotions from year to year
Fig. 6 Employee training success rate
187
188
M. Hoq et al.
Fig. 7 Ratings averaged for workers with varying levels of education
Fig. 8 Average training score
Fig. 9 Train data
Linear Classifier To feed data into our machine learning model, we developed an input function that takes in a data frame, a data label, and the number of times to feed the data. Our current machine capacity is what dictates the batch size. We will continue to use a batch size of 32 and repeatedly feed in the data fifty times. The tensor flow processes the whole dataset, creating a dictionary in the process; it also shuffles our input data 10,000 times before feeding it to the network, using a batch size of 32, and then repeats this process 50 times. What we’re talking about here is the processing of data.
Predict Employee Promotion Using Supervised Classification Approaches
189
Fig. 10 Test data
Training Model Using a linear classifier as a kind of training data, we developed a linear estimator to populate the features column. Next, we used the input function to train the linear estimator. The linear estimator is evaluated after training, yielding the final result.
4 Performance Evaluation Promotion prediction using a large data collection was a challenging undertaking. With further practice, our system could be able to make more precise judgments. However, it has high accuracy when using our methods. By identifying any worker, we can calculate their potential for advancement. Accuracy, precision, loss, etc. are all available here. There are situations in which a large corporation might benefit from using our technology (Fig. 11). Accuracy The overall accuracy of our model’s tests in this system is 92.6%. This result from the model may be relied on heavily.
Fig. 11 Final result
190
M. Hoq et al.
Fig. 12 The likelihood that worker will be promoted
Precision To help us find the most important information, our classification model may filter out the rest. Under this definition, accuracy is the ratio of the true positives to the sum of the true positives and false positives. 91.6% of reliable predictions are accurate. Recall There was a 16. Prediction of promotion With this method, we can easily determine an employee’s potential for advancement. You may calculate the likelihood by feeding in the employee ID. Even though we just intended to use the employee with ID number 0 for testing purposes, we ended up doing so. In this case, we can show the odds of that worker being promoted. That works out to 56%. Assuming that all other functions stay the same, this probability can be measured by adding up their values (Fig. 12).
5 Conclusion Our findings suggest that this approach has a positive outcome in terms of forecasting employee advancement. Our goals in doing so were to streamline the HR department’s administrative processes and save both time and money. There are other researchers investigating the same issues or comparable features. In any case, we had our sights set on maintaining a straightforward procedure and a fair policy of accuracy. We have labored over a data collection, examining it from several angles.
Predict Employee Promotion Using Supervised Classification Approaches
191
Our approach is future-proof and can be used with data from any firm. This approach, or one that improves upon it, may be used to make projections about the future productivity of both recently recruited workers and talented individuals who are in line for promotions. However, more work has to be done in the future to make this system more user-friendly. More work on the technological side is required to speed up this system. Our method can help predict employee advancement in a reliable way, but it has some problems that may be fixed in the future.
References 1. D.J.M. Reddy, S. Regella, S.R. Seelam, Recruitment prediction using machine learning, in 2020 5th International Conference on Computing, Communication and Security (ICCCS) (IEEE, 2020), pp. 1–4 2. Y. Long, J. Liu, M. Fang, T. Wang, W. Jiang, Prediction of employee promotion based on personal basic features and post features, in Proceedings of the International Conference on Data Processing and Applications (2018), pp. 5–10 3. P. Ajit, Prediction of employee turnover in organizations using machine learning algorithms. Algorithms 4(5), C5 (2016) 4. S.S. Alduayj, K. Rajpoot, Predicting employee attrition using machine learning, in 2018 International Conference on Innovations in Information Technology (IIT) (IEEE, 2018), pp. 93–98 5. H. Zhang, L. Xu, X. Cheng, K. Chao, X. Zhao, Analysis and prediction of employee turnover characteristics based on machine learning, in 2018 18th International Symposium on Communications and Information Technologies (ISCIT) (IEEE, 2018), pp. 371–376 6. ˙I.O. Yi˘git, H. Shourabizadeh, An approach for predicting employee churn by using data mining, in 2017 International Artificial Intelligence and Data Processing Symposium (IDAP) (IEEE, 2017), pp. 1–4 7. S. Ghosh, M.T. Ahammed, M.M. Islam, S. Debnath, M.R. Hasan, Exploration of social attitudes and summarization of film reviews. J. Inf. Technol. Comput. 3(1), 17–22 (2022) 8. Md. T. Ahammed, et al., Design of porous core fiber for terahertz regime using Zeonex, in 2021 4th International Conference on Computing and Communications Technologies (ICCCT) (IEEE, 2021) 9. S. Prathibha, et al., Detection methods for software defined networking intrusions (SDN), in 2022 International Conference on Advances in Computing, Communication and Applied Informatics (ACCAI) (IEEE, 2022) 10. A. Kabir, et al., Light fidelity (Li-Fi) based indoor communication system, in 2022 International Conference on Advances in Computing, Communication and Applied Informatics (ACCAI) (IEEE, 2022) 11. Md. T. Ahammed, et al., Design and implementation of programmable logic controller based automatic transfer switch. J. Artif. Intell. Mach. Learn. Neural Netw. (JAIMLNN) 2(02), 41–51 (2022). ISSN: 2799-1172 12. Md. Ibrahim, et al., Designing a top as based suspended square core low loss single-mode photonic crystal fiber optics communications, in 2022 International Conference on Advances in Computing, Communication and Applied Informatics (ACCAI) (IEEE, 2022) 13. Md. T. Ahammed, et al., Sentiment analysis using a machine learning approach in Python, in 2022 International Conference on Communication, Computing and Internet of Things (IC3IoT) (IEEE, 2022) 14. Md. T. Ahammed, S. Ghosh, Md. A.R. Ashik, Human and object detection using machine learning algorithm, in 2022 Trends in Electrical, Electronics, Computer Engineering Conference (TEECCON) (IEEE, 2022)
192
M. Hoq et al.
15. S. Ghosh et al., Exploration of social attitudes and summarization of film reviews. J. Inf. Technol. Comput. 3(1), 17–22 (2022) 16. Md. T. Ahammed, et al., Big data analysis for e-trading system, in 2022 International Conference on Advances in Computing, Communication and Applied Informatics (ACCAI) (IEEE, 2022) 17. Md. S.H. Shawon, et al., Voice controlled smart home automation system using bluetooth technology, in 2021 4th International Conference on Recent Trends in Computer Science and Technology (ICRTCST) (IEEE, 2022) 18. S. Ghosh, et al., Researching the influence of peer-reviews in action movies based on public opinion. J. Med. Cult. Commun. (JMCC) 2(03), 8–13 (2022). ISSN: 2799-1245 19. S. Ghosh, et al., An overview of clinical trials in chinese medicine. J. Artif. Intell. Mach. Learn. Neural Netw. (JAIMLNN) 2(03), 1–6 (2022). ISSN: 2799-1172 20. S. Ghosh, Md. T. Ahammed, Effects of sentiment analysis on feedback loops between different types of movies. J. Image Process. Intell. Remote Sens. (JIPIRS) 2(01), 22–28 (2022). ISSN 2815-0953 21. Md. J. Ferdous, et al., Design and analysis of a high frequency Bowtie printed ridge gap waveguide antenna. J. Image Process. Intell. Remote Sens. (JIPIRS) 2(02), 26–38 (2022). ISSN 2815-0953
Smart Grid Analytics—Analyzing and Identifying Power Distribution Losses with the Help of Efficient Data Regression Algorithms Amita Mishra, Babita Verma, Smita Agarwal, and K. Subhashini Spurjeon
Abstract In recent years, today’s energy consumption is growing rapidly and it is diligently dealing with huge energy losses, both technical and non-technical. All of this can cause companies to lose a huge amount of energy. This non-technical loss in energy use is called any energy consumed that is not charged with some type of irregularity. Furthermore, it was necessary to perform a computationally escalated algorithm of AI on measured data. This collected data from the measurement is not technical. This is why companies lose systems with characteristics which are similar to those considered big data. Perhaps the NTL problem is the most difficult problem for energy data management and at the level leading to verifying whether advanced technologies are emerging. This paper proposed a methodology for NTLS recognition with the regression analysis that guides how to overcome these drawbacks. The framework deals with energy consumption data, smart meter, and customer registration data to identify losses. The results obtained from regeneration demonstrate the effectiveness of the proposed technique is concerned with the detection of fraudulent electricity consumers and energy losses. Keywords Non-technical losses · Regression · Artificial neural networks
A. Mishra Department of Computer Science, CMREC, Hyderabad, India B. Verma (B) · K. S. Spurjeon BIT-Durg, Department of Information Technology, Durg, India e-mail: [email protected] K. S. Spurjeon e-mail: [email protected] S. Agarwal CSE Department, Chandigarh University, Ludhiana, India e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 A. Bhattacharya et al. (eds.), Innovations in Data Analytics, Advances in Intelligent Systems and Computing 1442, https://doi.org/10.1007/978-981-99-0550-8_15
193
194
A. Mishra et al.
1 Introduction Modern culture and daily exercise are heavily dependent on the availability of electricity in various ways (Fig. 2). Electricity grids enable the distribution and delivery of electricity from production infrastructures, such as various solar cells or energy/power plants, to the customers, like industries, residences, or factories. Power grids are the foundation of contemporary society. Generation and distribution losses are increasing, including financial losses for electricity suppliers and decreased power and reliability. One of the frequently encountered problems are losses in energy networks, especially the difference between produced or purchased energy and charged energy can be categorized into mainly two irreplaceable parts: technical and non-technical losses. One of the data mining techniques, Regression (Fig. 1), is used to predict a range of continuous values, given a specific dataset. For example, it might be used to forecast a product’s service cost, given other variables. Reference [1] states that the first is connected to issues in the system via the physical characteristics of the equipment; specifically, technical losses are the energy loss in transportation, transformation, and measurement equipment, which result in very high prices for energy firms. Non-technical losses, which apply to provided and uncharged energy and cause a loss of profit, are losses associated with the commercialization of the delivered energy by the user. They are additionally described as the distinction between overall losses and technical losses, the latter of which are closely related to unauthorized connections in the distribution system. The problem of how non-specialized losses are seen within the context of propagation has recently gained attention. The main causes of non-technical losses in energy firms are theft and the fabrication of electricity meters with the intention of changing the estimate of energy use. From this point on, doing routine checks to reduce such fraud can be exceedingly expensive, difficult to identify or measure loss rates, and ultimately very impossible to determine where they are occurring. Several power providers were worried that illicit organizations needed to be accurately characterized in order to reduce fraud and power theft. Although fraud will always exist, power
Fig. 1 Regression
Smart Grid Analytics—Analyzing and Identifying Power Distribution …
195
Fig. 2 Electricity transmission and distribution
providers can assure interest in power quality initiatives by minimizing such losses. Currently, some improvement can be observed in this area with [2] the use of various insights of artificial techniques to subsequently recognize non-technical disasters, which are real applications in Smart Grids. The main problem is the selection of the most useful and highlighted features, which have not been thoroughly explored with respect to various non-technical losses. Bypassing the metre or otherwise forming unlawful associations. • • • •
Just disregarding delinquent bills Malfunctioning energy meters or an unmetered supply Inaccuracies and delays in meter reading and charging Customer non-payment.
Among the main concerns regarding power distribution is non-technical losses (NTL) for electricity distribution companies around the world. In 2019, the National Energy Grid (NEGI), the sole electricity provider in Peninsular India, experienced revenue losses of up to 27,000 nuclear rupees per year due to electricity theft, mismetering, and charging errors [3–5]. When compared to the losses endured by utilities in poor nations, such as non-technical power loss distribution, NTLs experienced by electric utilities in the United States have been estimated to be between 0.5 and 3.5% of gross yearly revenue [6]. The best strategy to reduce NTLs and commercial losses up to date is by utilizing
shrewd and smart electronic meters that make fraudulent exercises more difficult, and simple to identify [7, 8]. From 4 an electrical engineering perspective, one strategy to distinguish
losses is to compute the energy balance reported in [9], which requires topological information of the network.
196
A. Mishra et al.
This is not realistic in developing economies, which are particularly concerning due to their high share of NTL, for the following reasons: The network topology changes constantly to meet the rapidly increasing demand for electricity; the infrastructure may fail and cause an error in the energy balance calculation; and the high share of NTL in these economies makes them particularly vulnerable and transformers, feeders, and associated meters are needed for smooth and continuous operation. To identify NTKs, customer checks are carried out with regard to predictions of whether NTKs might exist. Results of the inspection are then incorporated into learning algorithms to enhance predictions. However, conducting inspections is costly because it necessitates the physical presence of experts. Therefore, it’s critical to develop precise predictions in order to lower the incidence of false positives. Numerous data mining, fraud identification, and prediction studies have been carried out recently in the energy distribution industry. These include artificial neural network (ANN) decision tree selection strategies, database discovery (KDD) clustering procedures; support vector machine [10, 11]; and multiple classifiers using cross-identification and ballot casting schedules [9]. Among these techniques, load profiling is one of the most widespread [9] approaches, which is defined as a model of the electricity consumption of a customer or a group of customers over a certain period [9]. NTL recognition is a direct result of a wide variety of possible NTL reasons, such as different types of fraudulent customers. A supervised learning test for inconsistency detection.
2 Related Work For which generic surveys are provided in Refs. [11, 12], NTL detection can be thought of as a specific case of fraud detection. In both, fundamental techniques for detecting dishonest conduct in applications like credit card fraud, computer intrusion, and broadcast fraud are expert systems and artificial intelligence. An overview of existing AI methods for NTL identification is the main topic of this section. Existing concise summaries of prior work in this field, such as those in Refs. [3, 13–15], only offer a limited comparison of all pertinent publications. The uniqueness of this survey is that it not only reviews and compares the many results that have been published in the literature but also identifies the noteworthy challenges of NTL detection. Numerous NTL detection works use traditional meters that are read monthly or annually by counters. Due to these data, the characteristics of average consumption are used in the reference points [1, 7, 16, 17]. In those works, the feature computation utilized can be summarized as follows: For M customers {0, 1, ..., M − 1} over the keep going N months {0, 1, ..., N − 1}, a feature matrix F is computed, in which component Fm,d is a day by day average kWh consumption feature during that month: consumption in the past three months and the customers consumption over the past few years. The average consumption, the most extreme consumption, the standard deviation, the number of checks, and the average consumption of the
Smart Grid Analytics—Analyzing and Identifying Power Distribution …
197
residential area are all deduced from prior six-month meter readings. Additionally, average consumption is employed as a characteristic. In Refs. [1, 12], the cited works use each customer’s credit rating (CWR) as a function. It is calculated from the electricity provider’s billing system and reflects whether the customer is late or delayed in paying invoices. CWR ranges from 0 to 5, with 5 being the most extreme score. It reflects various information about the customer, for example, payment performance, salary, and neighborhood prosperity in a solitary element. Pre-filtering the customers allows the gathering to identify unusual and recurring clients in ID number 14. Due to their accurate representation of two distinct classes, these clients are then used for training. In light of the commotion over inspection names, this is done. In the classification stage, a neural fuzzy hierarchical system is employed. In contrast to the stochastic gradient technique employed in Ref. [13], the fuzzy membership parameters in this situation are optimized using a neural network. On the test set, precision values of 0.512 and 0.682 are attained. In file number 11, five consumer consumption characteristics over the last six months are derived: average consumption, extreme consumption, standard deviation, number of checks, and average consumption in the residential area. The fuzzy c-implies clustering algorithm is then applied to these features to classify clients into c groups. The Euclidean separation measure is then utilized to classify customers into NTL and non-NTL groups using the fuzzy membership values. On the test set, assertiveness was assessed to be on average 0.745. Neural Networks Neural networks, which allow for the learning of complex concepts from data, are arguably affected by how the human brain functions. For instance, Refs. [14, 15] describes some wheels. Advanced Learning Mother Lectures (ELMs) are singlelayer neural networks with Extreme learning machines (ELM) are one-shrouded layer neural networks in which the loads from the inputs to the concealed layer are randomly set and never updated. Support Vector Machines (SVM), a sophisticated classification technique that is prone to being slightly overweight, was first proposed in Ref. [13]. Reference [12] uses less than 400 uneven electrical consumer statistics for 260 000 users in Kuala Lumpur, Malaysia. From June 2006 to June 2008, each client’s meter is read every month for a total of 25 months. The functions utilized every day for a month are computed from this reading. Following standardization, these features are utilized to train an SVM with a Gaussian kernel. A credit rating (CWR) is also utilized. It is electronically calculated using data from the supplier’s billing system and indicates if the consumer is paying bills on time or avoiding them altogether. CWR has a scale from 0 to 5, with 5 being the most extreme value. It has been demonstrated that CWR is a crucial sign of whether or not customers are stealing electricity. 0.53 recalls were made on the test set in this situation. The corresponding settings are used for Ref. [1], which reports a test accuracy of 0.86 and test recall of 0.77 in a separate dataset. It makes use of a Boolean industrial expert system and its fuzzy extension, and in this dataset, stochastic gradient descent is used to generate the fuzzy system parameters. The Boolean system is surpassed by this enigmatic one.
198
A. Mishra et al.
An SVM that exploits the most recent elements of everyday life to outperform expert systems is supported by Refs. [12, 13, 16]. The three algorithms are contrasted with one another in samples of different fraud prevention techniques containing 100,000 clients. The Receiver Operating Curve is utilized (AUC). ~1.1 million customer data is used. The article identifies the smallest category of tested clients as the primary test for NTK acquisition. It then raises stratified samples to increase the number of tests and reduce the measurable difference between them. The split sample process is defined as an evolution problem with unbounded constraints to limit the total energy losses due to energy theft. This reduction problem is solved using two strategies: (1) genetic algorithm and (2) reproductive anger. The first method is better than the second. Only reduced variability is reported, which cannot be compared with other studies and is therefore excluded from this study. Bad sets give lower and higher ratings than the original traditional or pure set. The first use of hard set analysis was used in the acquisition of NTL for 40 thousand clients, but it requires the fineness of the attributes used by each customer, which is achieved with a precision check of 0.2. Fault set analysis is also used in NTL-related functions. This supervised learning method allows for a number of concepts that define deception and common usage. A test accuracy of 0.9322 was reported.
3 Proposed Work The regression analysis for non-technical loss in power distribution is shown in the following figure (Fig. 3). Fig. 3 Diagrammatic representation for detecting NTL loss
Smart Grid Analytics—Analyzing and Identifying Power Distribution …
199
Fig. 4 Information sharpening by uncertainty
Regression analysis is utilized in this research to detect non-technical losses in the city of Srivilliputhur in the Tamil Nadu state. The dataset includes consumers in the city of Srivilliputhur in all tariffs, as well as substations, distribution transformers, poles, stacks, load-related distribution transformers, and load-related poles [9]. Regression analysis is a requirement for the suggested system for DT. To find confirmations of non-technical losses, use heap estimation. The degree of overall agreement of the evaluated values of customer data pooled on DT is utilized to reduce the irregularity in electricity usage [15]. Regression study of the related load with customers is the only basis on which algorithms are based. Regression analysis is typically used for forecasting and prediction, where the field of AI overlaps with its validation. For the examination of electrical data, regression analysis is a crucial factual technique. It makes it possible to recognize and describe the connections between various components. A simplified numerical model is used in regression analysis to describe the relationships between dependent and independent variables [17]. There are three parameters for regression points as follows (Fig. 4): 1. The structure of the poles, DT, and SS connections. 2. Consumers per DT—The pattern of the poles per DT, Consumers per poles, and Consumers per DT, respectively, can be used to depict this. 3. Load per DT—The heap consumption pattern connected to DT and poles can be used to depict this. Whether a significant amount of the total variability in the data may be stated in the relationship between the variance will depend on how the F scale is reversed. The retreat increases with increasing F value. When the data points are close to the model line, this rate rises as the SSE errors decline. Calculate the coefficient ratio (r). R gauges how much of a linear relationship there is between y and (x1, x2). Calculate the R Square’s vertical slope. R Square, often known as the loading associated with DT, calculates the level of variance dependency. R2 ranges in value from 0 to 1. To ensure
200
A. Mishra et al.
that the least squares are normal, perform a fossil analysis. The difference between the estimated value, the dependent variance, the relative burden, and the forecast value can be used to calculate the residuals. For each DT and buyer for each DT, the reversible plane represents the estimated load in the Euclidean subregion separated by unique poles. Outliers: A point that deviates from the database is seen to be a small outlier, whereas a point that deviates from the underlying data is thought to be a big outlier. The line of retreat may be significantly affected by their positioning.
4 Results In quantifiable testing, outliers can result in serious issues. Although outliers can appear at random in any distribution, they typically show that the population or error measurement has a continuous distribution. If the latter, it signifies that the population has a large tracking distribution, whereas if the former, it means that important statistics from non-members should be ignored or utilized. In the first case, strong statistics should be discarded or used against outsiders, while in the second case, the variance of the distribution is large and we are very cautious about using tools or natural sentiment waiting for a normal distribution. The most common reason for outside contractors is a combination of two distributions that may include two undisputed minority individuals or be defined as “due process” to “measurement error.” This is illustrated by a mixture model. In order to accurately identify and detect non-technical impairment losses and statistical analysis of R-squared values, F impairment and P impairment can be calculated and displayed as in the following table (Table 1). The intensity and direction of the line’s association between the poles per DT and the customers per DT accountable for each DT are both measured by the R Square Statistic of Regression. In this case, r is close to +1. This indicates that the variables are extremely closely F Statistic. Regression value will range from zero to an arbitrarily large number. P – value >.05 indicates weak evidence against null hypothesis. A comparison of various regression models is illustrated in Fig. 5. Table 1 Results of regression model
Regression model
Efficiency
Time
p-value of regression
High
1.942
F statistic of regression
Moderate
1.916
R square statistic of regression
Low
1.213
Smart Grid Analytics—Analyzing and Identifying Power Distribution …
201
Fig. 5 Comparison of various regression models
5 Conclusion Non-technical losses (NTL) are the basic form of losses in power energy networks. We reviewed their impact on the economy and the losses in sales and revenue capacity of electricity providers. A huge number of NTL detection strategies using artificial perception strategies are discussed in the literature. This document introduces a completely new framework for validating and estimating meter statistics to emphasize the estimate and change the approach to increase satisfaction with meter data. The analysis for the planned distribution system regression analysis is referenced. The obtained results tells that the planned technique could be an appropriate choice for the distribution system SE. It refers to the rating for the deliberate distribution analysis of the gadget. The obtained effects suggest that the planned approach may be the exact preference of the SE distribution system.
References 1. S.-C. Huang, Y.-L. Lo, and C.-N. Lu, Fellow, Non-technical loss detection using state estimation and analysis of variance. IEEE Trans. Power Syst. 2. P. Kadurek, J. Blom, J.F.G. Cobben, W.L. Kling, Theft detection and smart metering practices and expectations in the Netherlands, in Proceedings of IEEE PES Innovative Smart Grid Technologies Conference Europe (ISGT Europe) (2010) 3. I.H. Cavdar, A solution to remote detection of illegal electricity usage via power line communication. IEEE Trans. Power Del. 19(4), 1663–1667 (2004)
202
A. Mishra et al.
4. A. Pasdar, S. Mirzakouchaki, A solution to remote detecting of illegal electricity usage based on smart metering. in Proceedings of 2nd IEEE International Workshop Soft Computing Applications (SOFA 2007), (2007) 5. J. Nagi, A.M. Mohammad, K.S. Yap, S.K. Tiong, S.K. Ahmed, Non-technical loss analysis for detection of electricity theft using support vector machines, in Proceedings of 2nd IEEE International Conference on Power and Energy, (2008) 6. D. Matheson, C. Jmg, F. Monforte, Meter data management for the electricity market, in Proceedings of 8th International Conference on Probabilistic Methods Applied to Power Systems, (2004) 7. M.E. Baran, A.W. Kelley, State estimation for real-time monitoring of distribution systems. IEEE Trans. Power Syst. 9(3), 1601–1609 (1994) 8. C.N. Lu, J.H. Teng, W.E.-H. Liu, Distribution system state estimation. IEEE Trans. Power Syst. 10(1), 229–240 (1995) 9. S. McLaughlin, D. Podkuiko, P. McDaniel, Energy theft in the advanced metering infrastructure. in Proceedings of CRITIS 2009, LNCS 6027, (2010), pp. 176–187 10. VEE Standard for the Ontario Smart Metering System, (3.0), (2010) 11. Release 2.0 of the NIST Framework and Roadmap for SmartGrid Interoperability Standards, NIST Special Publication 1108R2 (2012) 12. A. Monticelli, State Estimation in Electric Power Systems a Generalized Approach. Norwell, MA, Kluwer, p. 199 13. Z.J. Simendic, V.C. Strezoski, G.S. Svenda, In-field verification of the real-time distribution state estimation, in Proceedings of 18th International Conference on Electricity Distribution (CIRED), Turin, Italy, (2005) 14. R.C. Dugan, R.F. Arritt, T.E. McDermott, S.M. Brahma, K. Schneider, Distribution system analysis to support the smart grid,” in Proceedings of IEEE Power Energy Society General Meeting, (2010), pp. 1–8 15. M. Baran, T.E. McDermott, Distribution system state estimation using AMI data, in Proceedings of IEEE PES PSCE, Seattle, WA, (2009), pp. 1–3 16. J. Wan, K.N. Miu, Weighted least squares methods for load estimation in distribution networks. IEEE Trans. Power Syst. 18(4), 1338–1345 (2003) 17. Roytelman, Real-time distribution power flow-lessons from practical implementations, in Proceedings of IEEE PES PSCE, (2006), pp. 505–509
Load Balancing on Cloud Using Genetic Algorithms Kapila Malhotra, Rajiv Chopra, and Ritu Sibal
Abstract With the growth of technologies like cloud computing, the dissent of ‘Single-Process-Single-Server’ for handling server traffic is open to question. Parallel server requests every millisecond would frequently swamp a single server. To handle this scenario, many models and algorithms have been suggested in the literature. However, our approach is a novice and concentrates on the selection of the best breed of processes to balance load. Genetic Algorithms have been used in almost all fields of engineering and sciences. The intent of this paper is to apply a genetic algorithm on a set of processes on the server to manage load balancing on the server. Genetic Algorithms have given results with near-optimum solutions. Hence, there is a desideratum to develop better techniques of load balancing. This paper focuses on ameliorating the task of load balancing on cloud servers using genetic algorithms. In this paper, we have optimized the set of processes using Genetic Algorithm to get a better breed of processes before deploying them on a cloud platform. This improves the overall quality of the applications deployed on the cloud. Keywords Single-Process-Single-Server (SPSS) · Cloud Service Provider (CSP) · Quality-of-Service (QoS) · Multiple Application Multiple Server (MAMS) · Genetic Algorithm (GA)
1 Introduction The heap of client requests on a server is rarely distributed evenly. With the passage of time, the servers experience peaks and valleys in this heap of workload. To schedule client requests in such a scenario is a critical task in a cloud environment. Scheduling of client requests is done using optimum hardware and software solutions. This K. Malhotra (B) · R. Chopra Institute of Technology, Guru Tegh Bahadur, Delhi, India e-mail: [email protected] R. Sibal Netaji Subas University of Technology, Delhi, India © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 A. Bhattacharya et al. (eds.), Innovations in Data Analytics, Advances in Intelligent Systems and Computing 1442, https://doi.org/10.1007/978-981-99-0550-8_16
203
204
K. Malhotra et al.
Table 1 Static versus dynamic LB algorithms Static LB
Dynamic LB
(1) Static load balancing is done at compile time. It does not consider the current state of nodes
(1) Dynamic load balancing is done at run-time with no previous information required
(2) Once the load is given modifications, can’t (2) They react to system states those change be done dynamically (3) It minimizes response time without considering overall system throughput
(3) They give better throughput
adds extra horsepower to handle such contingencies. At the same time, this extra horsepower purchased will not be utilized in non-peak conditions, whereas, during peak conditions, distribution of load to the processing elements is done which is called the load balancing problem. With load balancing, workload more than the resource capacity is held and delayed till more resources are available to process the user request. Load unbalancing is an inimical situation for cloud service providers (CSPs) because it reduces the dependability and effectiveness of computing services while hampering the Quality-of-Service (QoS). The desideration of load balancing emerges in these circumstances. Load balancing (LB) is the process of disseminating the workload across multiple servers. It enhances the performance, reliability and stability of applications, databases and other services. Load balancing encompasses task restructuring in a distributed network like cloud computing so that there are no overloaded, under-loaded or idle processing elements. It elevates cloud performance. In-depth research analysis shows that load balancing approaches are the most appropriate for cloud resource utilization. Load balancing algorithms have been divided into two broad categories as follows (Table 1): (a) Static Load Balancing Algorithm. (b) Dynamic Load Balancing Algorithm. Figure 1 depicts the categories of LB algorithms. Dynamic LB algorithms can be further categorized into centralized and distributed algorithms. Table 2 compares centralized and distributed LB algorithms. Our paper provides an extensive survey of various load balancing strategies and proposes LB using Genetic Algorithm on a set of processes competing for servers on cloud. Section 2 gives an extensive literature survey on existing LB algorithms. Section 3 of this study provides the methodology for achieving LB in cloud networks using Genetic Algorithms. Future work and conclusions are mentioned in Sect. 4.
Load Balancing on Cloud Using Genetic Algorithms
205
Load Balancing Algorithms
Dynamic
Static
Centralized
Distributed
Fig. 1 Classification of load balancing algorithms
Table 2 Centralized versus distributed LB algorithms Centralized dynamic LB
Distributed dynamic LB
(1) All task allocation and scheduling is made by a single node (server)
(1) There is no node responsible for allocating resources or scheduling tasks
(2) This node contains the knowledge base for the entire cloud network
(2) Multiple nodes monitor the cloud network to make precise load balancing decisions
2 Literature Survey Table 3 provides an extensive literature survey on various LB techniques along with their advantages and limitations.
3 Proposed Work From the literature review, it is evident that load balancing issues need to be addressed more appropriately. Our proposed methodology is shown in Fig. 2. The proposed new Genetic Algorithm is as follows: Step-1: Get n-process requests from different clients. Step-2: Find out the two best processes. Step-3: Crossover (recombination) of these processes is done until next generation has individual processes. Step-4: For each incoming request, check resource usage threshold. Step-5: If it goes beyond threshold, then check resource usage on another node.
206
K. Malhotra et al.
Table 3 Review of various LB algorithms Authors
Algorithms used
Technique used
Advantages
Limitations
Lua et al. [1]
Join-idle-queue
Multiple queuing
Large-scale LB
Slow processing
Swarm-based optimization
Reduced total execution time to process set of jobs
Tasks different from each other
Li et al. [2] Ant colony optimization Kapur [3]
Non-classical
Heuristic
High scalability, shorter response time
Low resource utilization, high migration time
Dam et al. [4]
Genetic algorithm
Optimization
Better fault-tolerance
Lack of load balancing and inefficient use of resources, high process rejection
Rajput et al. [5]
Genetic algorithm
Evolutionary based heuristic
Reduced execution Lower level of costs load balancing
Vasudevan et al. [6]
Honey bee algorithm
Optimization
Lesser execution time
Low throughput
Vanitha et al. [7]
Genetic algorithm
Meta-heuristic
Reduced task-rejection ratio
Lesser resource utilization
Xiao et al. [8]
Fairness aware algorithm
Game theory-based optimization
Better load balancing
Higher execution time
Kumar et al. [9]
Conventional non-classical
Heuristic
Improved flexibility of resources
High task-rejection ratio
Peng et al. [10]
Ant colony optimization
Swarm-based optimization
Better resource utilization
Cost overruns
Jayaraj et al. [11]
Firefly algorithm
Metrics-based approach
Reduced response time
Low CPU utilization
Kumar et al. [12]
Firefly algorithm
Prioritizing the tasks
Higher throughput
Low level of load balancing
Oludayo [13]
Central-distributive load balancing
Rearrangement of nodes for load balancing
Maximization of throughput
Lesser load balancing
Step-6: Mutation alters processes in small ways to introduce new good traits. It is applied to bring diversity. Step-7: Now we have a set of optimized processes. Step-8: These optimized processes will then be deployed on CLOUD SERVER.
Load Balancing on Cloud Using Genetic Algorithms
207
Client
P1
- requests
P2
.
. . .
S E R V E R
LB using GA
S E R V E R (cloud)
Pn Mutation
Optimized set of processes
Fig. 2 Proposed methodology for load balancing
4 Results Since we are deploying an optimized set of processes to cloud, processing becomes faster. All resources have been optimally utilized. The problem of cost overruns has been minimized using our methodology because of the deployment on cloud servers. Hence, we have achieved a better level of load balancing.
5 Conclusions and Future Work In cloud computing, load balancing is a critical issue because when a client requests for a service, the service should be available to the client. When a node is overloaded with processes, at that time load balancer mitigates the load on another free node. Also, these process requests are not optimized. Our future work will focus on using tools to deploy the optimized set of processes on the cloud using tools like MS AZURE and AWS. In the long term, this type of simulation has a big potential to perform testing of the optimized set of processes, thereby improving the performance and hence the quality of emerging cloud applications.
208
K. Malhotra et al.
References 1. Y. Lua, Q. Xiea, G. Kliotb, A. Gellerb, J.R. Larusb, A . Green-ber, Int. J. Perform. Eval. (2011) 2. K. Li, G. Xu, W.D. Zhao, Cloud task scheduling based on load balancing and colony optimization, in Sixth Annual China Grid Conference, (2011), pp. 3–9 3. R. Kapur, A workload balanced approach for resource scheduling in cloud computing, in 8th International Conference on Contemporary Computing (IC3), (2015), pp. 36–41 4. S. Dam, G. Mandal, K. Dasgupta, P. Dutta, Genetic algorithm and gravitational emulation based hybrid load balancing strategy in cloud computing, in: Proceedings of the 3rd International Conference on Computers, Communication, Control and Information Technology (C3IT), (2015), pp. 1–7 5. S.S. Rajput, V.S. Kushwah, A genetic based improved load balanced min-min task scheduling algorithm for load balancing in cloud computing, in 8th International Conference on Computational Intelligence and Communication Network (CICN), (2016), pp. 677–681 6. S.K. Vasudevan, S.M. Anandaram, A. Aravinth, A novel improved honeybee based load balancing technique in cloud computing environment. Asian. J. Infor. Technol. 15(9), 1425–1430 (2016) 7. M. Vanitha, P. Marikkannu, Effective resource utilization in cloud environment through a dynamic well-organized load balancing algorithm for virtual machines. Comp Elec Eng. 57, 199–208 (2017) 8. Z. Xiao, Z. Tong, K. Li, Learning non-cooperative game for load balancing under self-interested distributed environment. Appl Soft Comput. 52, 376–386 (2017) 9. M. Kumar, S.C. Sharma, Deadline constrained based dynamic load balancing algorithm with elasticity in cloud environment. Comput. Electr. Eng. 69, 395–411 (2018) 10. X. Peng, H. Guimin, L. Zhenhao, Z. Zhongbao, An efficient load balancing algorithm for virtual machine allocation based on ant colony optimization. Int. J. Distrib. Sens. Netw. 14(12), 1–9 (2018) 11. T. Jayaraj, S.J. Abdul, Process optimization of big-data cloud centre using nature inspired firefly algorithm and k-means clustering. Int J Innov Technol Explor Eng (IJITEE) 8(12), 48–52 (2019) 12. K.P. Kumar, T. Ragunathan, D. Vasumathi, P.K. Prasad, An efficient load balancing technique based on cuckoo-search and firefly algorithm in cloud. Int J Intell Eng Syst 13(3), 422–432 (2020) 13. O.A. Oduwole, S.A. Akinboro, O.G. Lala, M.A. Fayemiwo, S.O. Olabiyisi, Cloud computing load balancing techniques: restrospect and recommendations. J. Eng. Technol. 7(1), 17–22 (2022)
Network Security and Telecommunication
Hybrid Feature Selection Approach to Classify IoT Network Traffic for Intrusion Detection System Sanskriti Goel and Puneet Jai Kaur
Abstract Internet of Things (IoT) systems have become a major part of our daily lives. As IoT systems are rising, the amount of data generated by them is also growing. It is a need of the hour to build a robust system that could protect humans, information, and infrastructure from damage. The IoT network traffic classification is very much essential to filter out the attacked traffic and save the system from larger damage. This paper proposes a hybrid feature selection technique to classify incoming IoT network traffic as normal or attacked. The proposed hybrid techniques combine a feature set from two embedded feature selectors: Random Forest and LightGBM under different mathematical set operations, i.e. union and intersection. Data under this feature set is trained and analyzed over Extra Tree and Gradient Boost Classifier. The hybrid selector approach gives promising results with an accuracy of around 99% and the execution time of building a model is lower than individual selector and is reduced by almost 50% when no feature selection technique is used. Keywords IoT network traffic classification · Intrusion detection · Hybrid feature selection · Security · ML
1 Introduction IoT is rising globally at a very high pace. From providing connectivity to interoperability, its advanced features have led the world to come more closer. With the benefits of growing technology, there is also concern associated with the security of the IOT systems. IOT systems are found to be vulnerable at different IOT layers and protocols associated with these layers. Integration and communication of diverse things have become a common practice, thanks to the growth of the Internet of Things S. Goel (B) · P. J. Kaur University Institute of Engineering and Technology Panjab University, Chandigarh 160014, India e-mail: [email protected] P. J. Kaur e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 A. Bhattacharya et al. (eds.), Innovations in Data Analytics, Advances in Intelligent Systems and Computing 1442, https://doi.org/10.1007/978-981-99-0550-8_17
211
212
S. Goel and P. J. Kaur
(IoT). The rapid expansion of IoT devices and the diversity of IoT traffic patterns have drawn attention to traffic categorization techniques as a means of addressing a number of pressing problems in IoT applications since IoT devices have different traffic characteristics than Non-IoT devices [1]. IoT systems have devices as the most basic components which lack basic security procedures, making them vulnerable targets for hackers and attackers [2]. Various devices and techniques are used to attack and hack computers and can target them to launch attacks like DOS, Botnet attacks, DDos, Renconaisse, fuzzers, and worms, against organizations for their confidential data. Since cyberattacks are becoming more complex, it is getting harder to reliably identify breaches. If the incursions are not stopped, security services like data confidentiality, integrity, and availability may lose their trust. To combat threats to computer security, a number of intrusion detection techniques have been developed in the literature [3]. Intrusion detection system (IDS) aims to detect unusual activities from both within and outside the network system. Since they serve as an organization’s first line of defense against online threats and are tasked with properly detecting possible network intrusions, intrusion detection systems (IDSs) are regarded as one of the core components of network security [4]. Effective IDS is the one that monitors the IoT network traffic quietly and instantly malicious behavior without creating overhead to the nodes. In addition, it should provide mitigation steps when an intrusion is detected. Data collected at IoT nodes are coming from various sources and have high variability in their features. Various Machine Learning (ML) and Deep Learning (DL) approaches are being employed in literature to train models such that they could detect any intrusion, anomaly, or malware accurately and at a very quick pace [4–6]. The significant features in a dataset correctly determine the abnormal behavior of IoT traffic, thereby feature selections become an essential step while classifying IoT traffic. Hence, for this connected world, it becomes of prime importance to build a robust IDS with a high accuracy, minimum execution time, and considering only relevant features of network traffic. The main contributions of this paper are as follows: 1. A combination of embedded feature selection method approaches is proposed to select the most relevant features. 2. This proposed feature selection approach is analyzed on Gradient Boost Classifier and Extra Trees Classifier using the UNSW-NB15 dataset. 3. Performance measurement of the proposed approach in terms of accuracy and execution time. 4. Comparison of performance measurements of features retrieved from the hybrid technique with the original feature set and individual feature selection method.
Hybrid Feature Selection Approach to Classify IoT Network Traffic …
213
2 Literature Survey In order to detect DoS and DDoS attacks, Pushpraj and Deepak [2] suggested a feature selection method for intrusion detection systems (IDSs) employing Information Gain (IG) and Gain Ratio (GR) using the rated top 50% features. The top 50% of ranking IG and GR features are used to create feature subsets for the proposed system using a JRipclassifier, and the suggested technique is assessed and validated using the IoT-BoT and KDD Cup 1999 datasets, respectively. Xianwei Gao et al. [7], in the year 2019, used NSL-KDD dataset as their research object where after analyzing current issues in IDS, an adaptive ensemble learning model was suggested. A Multi-Tree algorithm and an adaptive voting algorithm with Decision Tree (DT), Random forest (RF), K-Nearest Neighbors (KNN), and Deep Neural Networks were used for intrusion detection with an accuracy of 84.2 and 85.2, respectively. In 2020, Panda and Panda [8] address the intelligent classification IoT dataset in the healthcare sector using machine learning methods. The authors worked upon two datasets, namely LSTN voice detection for Parkinson’s disease and CNAE-9 Dataset. Three models that specifically they use are RF, DT, and Naïve Bayes (NB) models. Feature extraction was done using Principal Component Analysis (PCA) considering varying eigenvalues. They compared these three models in terms of accuracy and execution time and suggested which one to be used when low accuracy or high execution time is tolerable. Data is classified as normal or obtrusive using Support vector machine (SVM), KNN, LR, NB, DT, multi-layer perceptron (MLP), RF, and extra tree classifier (ETC) by Abrar [9] et al. Four feature subsets taken from the NSL-KDD dataset were studied and data was preprocessed to determine the dimensions of the data. Observed outcomes say RF, ETC, and DT all performed better than 99% of the time. To address the issues of security like Intrusion detection, Kasongo and Sun [10] analyzed the UNSW-NB15 intrusion detection dataset and used the XGBoost algorithm to implement a filter-based feature reduction approach. Then, using the condensed feature space, applied SVM, KNN, Logistic Regression (LR), Artificial Neural Network (ANN), and DT machine learning techniques. The findings showed that approaches like the DT can raise their test accuracy for the binary classification scheme from 88.13 to 90.85 percent using the XGBoost-based feature selection method. To categorize cyberattacks in network traffic data, Doreswamy and Hooshmand [11] employ a combination of feature selection techniques and ensemble learning algorithms. The most correlated characteristics are removed first using the pairwise correlation technique in the feature selection process. In the next step, four individual feature selection techniques were used to independently identify their own sets of features, such as analysis of variance (ANOVA), variance threshold (VT), Chi-square, and recursive feature elimination (RFE), and then the identified sets of features by individual methods were combined using the intersection combination technique from set theory to find an optimal subset. For classification, two algorithms were used:
214
S. Goel and P. J. Kaur
AdaBoost and RF. AdaBoost outperforms RF in terms of classification accuracy, accuracy detection rate (ADR), and minimizing false alarm rate (FAR). In 2021, another effective work has been presented in the field of smart systems by Vigoya et al. [12]. Their work employs five classification algorithms Naive Bayes, AdaBoost, LR, RF, and linear SVC on the DAD dataset. RF and AdaBoost tend to perform better than the other three algorithms. Yakubu Imrana et al. [13] present a bidirectional Long Short-Term-Memory (BiDLSTM)-based intrusion detection system. NSL-KDD dataset is used to train and evaluate the model. The efficiency of the BiDLSTM technique is demonstrated and validated by experimental data. In terms of accuracy, precision, recall, and Fscore values, it surpasses traditional LSTM and other state-of-the-art models. It also has a substantially lower percentage of false alarms than previous models. Advanced classifier techniques AdaBoost, XGBoost, and Random Forest with Chi-Square technique as feature selection have been proposed by Hussein et al. [14]. The dataset used in this work is NSL-KDD and UNSW NB15, where AdaBoosT performs best for NSL-KDD and XGBoosT for the UNSW dataset. In 2022, Uzma Amin et al. [15] proposed an anomaly-based intrusion detection system that uses the ANOVA test as a feature selection technique and machine learning algorithms RF and Bagging algorithms to detect unknown malicious attempts using an ensemble approach on the UNSW-NB15 dataset in their work. For embedded sensors, Naive Bayes, KNN, and APSO-SVM were compared on NSL-KDD data by Mutyalaiah et al. [16]. Accuracy, sensitivity, and specificity were the performance metrics measured and APSO-SVM turns most accurate. In the field of agriculture, Amna Ikram et al. [17] have worked upon real-time sensory data and rainfall prediction data and analyzed them using five different algorithms like DT, KNN, RF, SVM, and NB. It turned out that ensemble learning on all these five algorithms yields better results than individual ones. Finding multivariate outliers and selecting the best features together provide a novel method for enhancing intrusion detection system performance. Birnur and Serkan [18] developed and tested using the NSL-KDD dataset, which comprises 41 features. First, 20 features have been found using the ReliefF Feature Selection technique to determine the best features that keep the classification performance at a high level. The Mahalanobis Distance and Chi-Square techniques have then been used to identify outliers in the dataset. The dataset was then subjected to a number of machine learning techniques, and the outcomes were compared.
3 Research Gaps Based on the analysis of existing literature, various research gaps have been identified which are as follows:
Hybrid Feature Selection Approach to Classify IoT Network Traffic …
215
1. A majority of work has been done in strengthening the classifiers which are trained and then used to classify the IoT traffic data and detect intrusions, anomalies, and any other suspicious activities. 2. Only a few works have made use of feature selection techniques, which are done so that a model could be trained easily with better accuracy, removing irrelevant features, saving up on disk space and execution time. 3. Within the scope of this study, only filter and wrapper class methods are used as feature selection techniques. An embedded feature selection technique on IoT dataset for intrusion detection systems has not yet been implemented. Also, literature has employed only single feature selection approaches and no work has been proposed on hybrid techniques. 4. Most of the literature focuses on the accuracy of a classifier. Some of the work talks about the execution time of the model but that too only on a few specific datasets like the NSS-KDD dataset.
4 Proposed Work A hybrid approach has been proposed for IoT network Traffic Classification to detect Intrusion detection which employs a hybrid embedded feature selection technique to select the most relevant and correlated feature set which is further trained upon ensemble classifiers to get the best results. An optimal model with high accuracy and minimum execution time has been proposed to detect intrusion in this work. The hybrid feature selection technique takes a mathematical combination of a feature set of Embedded Random Forest Feature Selector and Embedded LightGBM feature selector. Each of these selectors is embedded in the SelectFromModel feature selection technique to select the best features. The model training time and memory requirement increase with the number of features [19]. The aim of the hybrid feature selection technique is to reduce the execution time of the IDS by selecting only relevant features. Ensemble Classifiers are used to train the model with the aim to detect any intrusion with a very high accuracy.
4.1 Proposed Methodology Figure 1 describes the workflow process of the proposed methodology. The steps involved in the proposed methodology are as follows: • Step 1: Dataset preprocessing In this step, statistical analysis of features, checking for redundant data, missing values, or class imbalance is done.
216
S. Goel and P. J. Kaur
Fig. 1 IoT traffic classification using the proposed hybrid embedded feature selection method
Out of 45 features, 3 columns were dropped manually, i.e. Id (irrelevant), Attack Category (for multiclass classification), and Label (Target variable). From the remaining 42 features left, One hot encoding is done on 3 features that were in string format: [“protocol”, “state”, “service”] to convert into numerical features. One hot encoding converts every unique value in a category as a feature. No. of features turned to 196. • Step 2: Feature selection There are 196 features, including all the features that will add noise and redundant features to classifiers which might lead to low accuracy, overfit, and misclassifications. Also, the model training time and memory requirement increase with the number of features. In the proposed system, hybrid embedded feature selectors are used, i.e. the combination of features selected from the Embedded RandomForest Feature Selector and Embedded LightGBM feature selector. Each of these selectors is embedded in the SelectFromModel feature selection technique to select the best features. Embedded RandomForest(EmRF) and Embedded LightGBM (EmLGBM) feature selector selects 35 and 33 best features from 196 features, resp. The INTERSECTION set of features (HyFS1) selected by these two selectors results in 22 features and the UNION set of features (HyFS2) selected by these two selectors results in 42 features which are stored in Table 1. Table 1 No. of features in different feature sets
S. No.
Feature set
No. of features
1
EmRF
35
2
EmLGBM
33
3
HyFS-1(EmRF ∩ EmLGBM)
22
4
HyFS-1(EmRF U EmLGBM)
42
Hybrid Feature Selection Approach to Classify IoT Network Traffic …
217
• Step 3: Feature scaling Next, the step for classification is to convert the entire values of features within a particular range. This is called Normalization. The Standard Scaler Normalization technique is performed on the combined encoded feature set. • Step 4: Train test split The preprocessed data done in the above steps is splitted into test and training sets in different ratios (90–10, 80–20, 75–25) to analyze the accuracy and execution time of the machine learning classifier on it. The 4:1 ratio of the train-test split gives the optimal results in terms of accuracy as well as execution time. • Step 5: Classification Two ensemble classifiers used in the proposed technique are as follows: • Gradient Boost Classifier (GBC): This class of machine learning techniques combines a number of weak learning models to produce a powerful predicting model. Gradient boosting frequently makes use of decision trees [14]. • Extra Tree Classifier (ETC): A form of ensemble learning technique that combines the findings of various de-correlated decision trees gathered in a “forest” to produce its classification outcome [9]. Normalized feature sets are fed into these two classifiers and their accuracy and performance time were analyzed.
5 Experimental Results 5.1 Dataset Description In this paper, the UNSW-NB15 dataset has been selected to classify IoT network traffic for network intrusion detection. This dataset represents a thorough portrayal of network traffic and attack situations in the modern era [20]. The dataset classifies the records as normal or attacked. There are nine types of the subcategory of attacks that have been defined. There are about 2.5 lakhs of records in a dataset with 49 features. The dataset is divided into two files training.csv and testing.csv with 175,341 records and 82,332 records, respectively, from the different types, attack and normal. The four flow attributes, namely src id, src port, destid, and dest port have been ignored in these files as they tend to bias the model toward a particular ip [3]. Hence, based on its advantages, availability, and accessibility, the UNSW-NB15 dataset has been selected [21].
218
S. Goel and P. J. Kaur
Table 2 Selected parameters to build classification models: GBC and ETC Classifier
Parameters
GBC
No. of estimators = 9, loss = exponential, learning_rate = 1.0, max_depth = 30
ETC
No. of base estimators = 100, criterion = “Gini”, max_depth = default
5.2 System Configuration The proposed system is implemented using the Google Collaboratory platform with Python version 3.6.9 as the programming language. ML models and techniques have been implemented using sklearn libraries.
5.3 Model Parameters Classifiers were tuned on different parameters like the no. of base estimators in the ensemble model, the learning rate of a model, the maximum depth of the tree, etc. to get the best optimal results in terms of accuracy and execution time. Table 2 shows values of parameters that were set for trained models.
5.4 Performance Parameters UNSW-NB15 dataset has been used to test the methodology and results were analyzed on two parameters: • Classification Accuracy includes test and train accuracy. Train accuracy is a metric that defines how well a model is trained on a given training dataset. Test Accuracy metric defines how correctly our trained model predicts the Y(target value) for given X values • Execution Time: It is the time taken by the processor to build the given model, train it on the training set, and further make predictions or classifications on the test dataset. Classification results for individual feature selection, hybrid feature set, and without feature selection as test and train accuracy, along with model build time/execution time, are given in Tables 3 and 4 for Gradient Boost and Extra Tree Classifie,r respectively. Figure 2 is a graphical representation of comparison of test accuracy of GBM and ETC with individual, hybrid, and without feature selection approaches. Figure 3 is a graphical representation of comparison of train accuracy of GBM and ETC with individual, hybrid, and without feature selection approaches.
Hybrid Feature Selection Approach to Classify IoT Network Traffic …
219
Table 3 Execution time and accuracy with individual and hybrid method for GBM S.No.
Feature selection
Test accuracy (%)
Train accuracy (%)
Execution time (s)
1
EmRF
94.67
99.75
56.14
2
EmLGBM
94.92
99.70
52.064
3
HyFS-2
94.85
99.78
68.72
4
HyFS-1
94.675
99.71
44.64
5
W/O feature selection
94.635
99.70
108.516
Table 4 Execution time and accuracy with individual and hybrid method for ETC S. No.
Feature selection
Test accuracy(%)
Train accuracy (%)
Execution time (s)
1
EmRF
94.90
99.76
24.35
2
EmLGBM
95.14
99.75
28.48
3
HyFS-2
95.13
99.80
37.61
4
HyFS-1
95.00
99.74
22.33
5
W/O feature selection
94.78
99.70
57.738
Fig. 2 Comparison of test accuracy of Gradient Boost and extra tree classifiers for different feature selection methods
ETC with a hybrid feature set HyFS-2 gives the best results for both test and train accuracies, i.e. 95.13 and 99.80, respectively. Execution Time comparison of GBC and ETC with individual, hybrid, and without feature selection techniques is given in Fig. 4.
220
S. Goel and P. J. Kaur
Fig. 3 Comparison of train accuracy of Gradient Boost and extra tree classifiers with different feature selection methods
Fig. 4 Comparison of execution time of gradient boost and extra tree classifiers with individual, hybrid, and without feature selection approach
It can be concluded that HyFS-1 with ETC takes a minimum time of 22.33 s when compared with others. Also, HyFS-1 with GBC reduces time by 14%, 20%, and 58% when compared with EmRF, EmLGBM, and without Feature selection, respectively. HyFS-1 with ETC reduces execution time by 61% as compared to when no feature selection technique was used. The proposed approach is compared with existing work in the literature on accuracy for different classifiers, with and without feature selection. Results are stored in Table 5 and Fig. 5 gives its graphical representation.
Hybrid Feature Selection Approach to Classify IoT Network Traffic …
221
Table 5 Comparison with existing approaches [10, 14] on the UNSW NB15 dataset Study
Feature selection technique
Classifier
Test accuracy (%)
Kasongo and Sun [10]
XGBoost
DT ANN LR KNN SVM
90.85 84.39 77.64 84.46 94.12 93.75 89.21 95.86 60.89 75.42
Kasongo and Sun [10]
W/O feature DT selection ANN LR KNN SVM
88.13 86.71 79.59 83.18 93.65 94.49 93.22 96.76 62.42 70.98
Hussein et al. [14]
CHI 2 test
82.55 80.84 89.08
Hussein et al. [14]
W/O feature AdaBoost selection RF XGBoost
82.47 77.39 93.54
Purposed work
HyFS-1 HyFS-2
94.67 95.00 94.85 95.13
99.71 99.74 99.78 99.80
Purposed Work W/O feature GBC ETC 94.63 selection 94.78
99.70 99.70
AdaBoost RF XGBoost
GBC ETC GBC ETC
Train accuracy (%)
Fig. 5 Comparison of accuracy of existing and proposed approach with and without feature selection
222
S. Goel and P. J. Kaur
It can be observed that the proposed approach offers the best accuracy of 94.85 with GBC and HyFS-2 higher than all other combinations of feature selection approaches and classifiers. Also, without feature selection ETC offers an accuracy of 94.78 which is the highest among others.
6 Conclusion Intrusion Detection System (IDS) screens out the malicious traffic at the very initial phase, preventing the system from serious loss. Along with the ability to correctly identify the attacked or normal traffic, IDS must be quick in their execution time as well, so that timely detection and prevention could be done. The above proposed technique gives Union and Intersection of best features from embedded random forest and lgb selector, which is trained on Extra Tree Classifier achieving better accuracy over other classifiers and reducing performance time over individual feature selectors. HyFS-1 should be chosen when model build time is of prime importance and a small reduction in accuracy could be dealt with. HyFS-2 is selected when accuracy is the priority over the model build time. In the future, the above technique could further be optimized such that the reduced feature set is very small yet effective. Results for the hybrid technique could be analyzed over other ML and DL models and must be tested with different datasets.
References 1. H. Tahaei, F. Afifi, A. Asemi, F. Zaki, N.B. Anuar, The rise of traffic classification in IoT networks: a survey. J. Netw. Comput. Appl. 154(September), 2020 (2019). https://doi.org/10. 1016/j.jnca.2020.102538 2. P. Nimbalkar, D. Kshirsagar, Feature selection for intrusion detection system in Internet-ofThings (IoT). ICT Express 7(2), 177–181 (2021). https://doi.org/10.1016/j.icte.2021.04.012 3. A. Khraisat, I. Gondal, P. Vamplew, J. Kamruzzaman, Survey of intrusion detection systems: techniques, datasets and challenges. Cybersecurity 2(1), (2019). https://doi.org/10.1186/s42 400-019-0038-7 4. N.T. Pham, E. Foo, S. Suriadi, H. Jeffrey, H.F.M. Lahza, Improving performance of intrusion detection system using ensemble methods and feature selection, in ACM International Conference Proceeding Series, (2018).https://doi.org/10.1145/3167918.3167951 5. A. Tabassum, A. Erbad, M. Guizani, A survey on recent approaches in intrusion detection system in IoTs, in 2019 15th International Wireless Communications and Mobile Computing Conference, (2019), pp. 1190–1197. https://doi.org/10.1109/IWCMC.2019.8766455 6. F. Hussain, R. Hussain, S.A. Hassan, E. Hossain, Machine Learning in IoT security: current solutions and future challenges. IEEE Commun. Surv. Tutor. 22(3), 1686–1721 (2020). https:// doi.org/10.1109/COMST.2020.2986444 7. X. Gao, C. Shan, C. Hu, Z. Niu, Z. Liu, An Adaptive ensemble machine learning model for intrusion detection. IEEE Access 7, 82512–82521 (2019). https://doi.org/10.1109/ACCESS. 2019.2923640
Hybrid Feature Selection Approach to Classify IoT Network Traffic …
223
8. S. Panda, G. Panda, Intelligent classification of IoT traffic in healthcare using machine learning techniques, in 2020 6th International Conference on Control, Automation and Robotics ICCAR (2020), pp. 581–585. https://doi.org/10.1109/ICCAR49639.2020.9107979 9. I. Abrar, Z. Ayub, F. Masoodi, A.M. Bamhdi, A Machine Learning Approach for Intrusion Detection System on NSL-KDD Dataset, in Proceedings of International Conference on Smart Electronics and Communication ICOSEC, (2020), pp. 919–924. https://doi.org/10.1109/ICO SEC49089.2020.9215232 10. S.M. Kasongo, Y. Sun, performance analysis of intrusion detection systems using a feature selection method on the UNSW-NB15 dataset. J. Big Data 7(1), (2020). https://doi.org/10. 1186/s40537-020-00379-6 11. Doreswamy, M. K. Hooshmand, Using ensemble learning approach to identify rare cyberattacks in network traffic data, in International Conference on Advanced Computer Science and Information Systems (ICACSIS), (2020), pp. 141–146. https://doi.org/10.1109/ICACSIS51 025.2020.9263111 12. L. Vigoya, D. Fernandez, V. Carneiro, F.J. Nóvoa, IoT dataset validation using machine learning techniques for traffic anomaly detection. Electron 10(22), (2021). https://doi.org/10.3390/ele ctronics10222857 13. Y. Imrana, Y. Xiang, L. Ali, Z. Abdul-Rauf, A bidirectional LSTM deep learning approach for intrusion detection. Expert Syst. Appl. 185 (2021). https://doi.org/10.1016/j.eswa.2021. 115524 14. S.A. Hussein, A.A. Mahmood, E.O. Oraby, Network intrusion detection system using ensemble learning approaches. Webology 18(Special Issue), 962–974 (2021). https://doi.org/10.14704/ WEB/V18SI05/WEB18274 15. U. Amin, A.S Ahanger, F. Masoodi, A.M Bamhdi, Ensemble based Effective intrusion detection system for cloud environment over UNSWNB15 dataset. Scrs Conf. Proc. Intell. Syst. 483–494 (2021). https://doi.org/10.52458/978-93-91842-08-6-46 16. M. Paricherla et al., Towards development of machine learning framework for enhancing security in internet of things. Secur. Commun. Netw. 2022, 1–5 (2022). https://doi.org/10.1155/ 2022/4477507 17. A. Ikram et al., Crop Yield Maximization Using an IoT-Based Smart Decision. J Sens. 2022, 1–15 (2022). https://doi.org/10.1155/2022/2022923 18. B. Uzun, S. Ballı, A novel method for intrusion detection in computer networks by identifying multivariate outliers and ReliefF feature selection. Neural Comput. Appl. 2 (2022). https://doi. org/10.1007/s00521-022-07402-2 19. A. Hameed, J. Violos, A. Leivadeas, A deep learning approach for IoT traffic multi-classification in a smart-city scenario. IEEE Access 10(i), pp. 21193–21210 (2022). https://doi.org/10.1109/ ACCESS.2022.3153331 20. S.A.V. Jatti, V.J.K. Kishor Sontif, Intrusion detection systems. Int. J. Recent Technol. Eng. 8(2, Special Issue 11), 3976–3983 (2019). https://doi.org/10.35940/ijrte.B1540.0982S1119. 21. https://research.unsw.edu.au/projects/unsw-nb15-dataset
A Deep Learning-Based Framework for Analyzing Stress Factors Among Working Women Chhaya Gupta, Sangeeta, Nasib Singh Gill, and Preeti Gulia
Abstract Due to gender-specific job stresses, occupational stress is becoming more common in a variety of sectors and may be a problem of uneven size for working women. Even while these pressures haven’t received much attention up until now, new research indicates that they can have a negative impact on health. Numerous diseases, such as cardiovascular disease, neurological disease, and others, are greatly influenced by stress. Additionally, it has been asserted that high employee stress levels negatively affect an organization’s productivity, which may have an effect on society’s financial burden. Therefore, it is essential for both personal health and society’s well-being to handle stress. Women employees are a vital and integral part of society, and by managing their stress and anticipating problems before they arise, they may enhance their performance both personally and organizationally. In this paper, various stressors on working women are examined. This paper provides a framework for analyzing stress in working women that considers a number of physiological aspects, including respiratory rate, heart rate, and other sociocultural assessments by using a convolutional neural network as the base model which is trained on the FER2013 dataset and achieves an accuracy of 97.7%. According to the severity of the results of the stress prediction, various corrective measures are then recommended. Keywords Deep learning · Stress factors · Working women · Convolution neural network
C. Gupta (B) · Sangeeta · N. S. Gill · P. Gulia Department of Computer Science and Applications, Maharshi Dayanand University, Rohtak, India e-mail: [email protected] Sangeeta e-mail: [email protected] N. S. Gill e-mail: [email protected] P. Gulia e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 A. Bhattacharya et al. (eds.), Innovations in Data Analytics, Advances in Intelligent Systems and Computing 1442, https://doi.org/10.1007/978-981-99-0550-8_18
225
226
C. Gupta et al.
1 Introduction Whenever there is an inconsistency between a situation’s demands and a person’s ability to handle them, stress is the result. Stress has an effect on everyone. Everyone has experienced a natural reaction like that at some point in their lives. The body’s response to a sensed threat or danger is stress. Numerous issues in life can make one feel stressed out. Today’s families are going through enormous changes as a result of the accelerated speed of development and technology. Women from all walks of life in India have entered a variety of professions, causing stress in both their personal and professional lives. Women’s access to educational possibilities is far greater than it was a few years ago, particularly in cities [1]. Due to difficulties such as childcare, health issues, and interpersonal relationships, women in particular face a specific set of challenges at work. Stress primarily causes feelings of despair or anxiety. Chronic stress causes permanent changes in a person’s physiological systems, and long-term disorders such as asthma, diabetes, and hypertension emerge. It has also been claimed that employee stress levels impair an organization’s performance, which can have an impact on society’s economic burden. As a result, stress management is critical for both individual health and societal well-being. Women’s experiences with stress remained largely unexplored since, until recently, attitudes and research on workplace stress were solely centered on men’s experiences. With the growing number of papers in the field of stress research, it has become clear that the traditional application of the stress concept has significant flaws. A professional’s mental status evaluation (MSE) is a defined technique for evaluating a person’s mental and emotional functioning. It entails a specific set of observations as well as a set of questions [2]. Healthcare is evolving, and it’s critical to take use of new technology to generate fresh data and assist the adoption of precision medicine. Recent scientific developments and technological advancements have increased our illness understanding and transformed diagnostic and treatment procedures, resulting in custodial healthcare that is more precise, predictive, preventative, and individualized. With the rapid growth of technology, computer-aided analysis and prediction of early stress symptoms can help the prevailing problems. The domains of applied sciences and engineering are currently using a variety of artificial intelligence (AI) and deep learning (DL) technologies due to their significant maturation over time. Deep learning models scale to big datasets and improve with new data, allowing them to outperform many traditional machine learning approaches. This is partly because they can function on specialized computer hardware. Machines’ ability to understand and modify data, such as images, language, and speech, has advanced dramatically in recent years. Deep learning in healthcare and medicine is expected to flourish as a result of the enormous amount of data being produced and the quick adoption of medical technology and digital record systems [3]. This paper presents the analysis of various stress factors in working women with the help of a deep learning-based framework. The main objectives and contributions of this paper are summarized as follows:
A Deep Learning-Based Framework for Analyzing Stress Factors …
227
• The paper helps in identifying Stressors which affect women’s health at their workplace. • The paper presents a deep learning-based framework for analyzing stress at an early stage using convolutional neural networks which have been trained on the FER2013 dataset which is available on the Kaggle repository. • The paper applies convolutional neural networks for identifying stress in working women by classifying facial expressions/emotions. • Various remedial actions are also suggested after early detection to reduce the impact of these factors on working women. The rest of the paper is divided as follows: Sect. 2 provides a related literature survey. Section 3 provides information about various stress factors and presents a brief background about deep learning and the proposed deep learning framework. Section 4 discusses the results, and Sect. 5 concludes the paper.
2 Literature Survey Omar AlShorman et al. [4] analyzed the frontal lobes EEG Spectrum by using Fast Fourier Transform (FFT) technique for feature extraction and then authors applied Support Vector Machine (SVM) and Naïve Bayes (NB) machine learning methods for classifying mental stress. The proposed technique achieved an accuracy of 98.21%. Amir Mohammad et al. [5] proposed an approach for stress detection using the Kruskal–Wallis analysis technique for feature extraction and then K-Nearest Neighbor (KNN) is applied for classification and detection. The proposed approach achieved an accuracy of 96.24%. Rajeshwari et al. [6] used different machine learning methods for evaluating stress levels. The authors have implemented the Kernel Support Vector machine method, KNN method, AdaBoost, Random Forest, and Decision Tree methods for classification, and random forest achieved good results with an F1-score of 93.77% and an accuracy of 70.03%. Yuli Wahyuni et al. [7] detect the heart rate by analyzing stress levels in pregnant women. With the help of project planning, research, component testing, mechanical system design, functional test, and system optimization, the authors introduced a hardware programming approach. The results showed that the proposed model had achieved good results. Rajendran et al. [8] provided a survey on mental stress detection using Perceived Stress Scale (PSS) and EEG methods. The pros and cons of the methods are also discussed. Prasanalakshmi et al. [9] used machine learning techniques for anxiety detection and proposed a hybrid strategy DR-CNN for stress detection. Manuel Martin et al. [10] proposed a deep learning technique based on convolutional neural networks for human stress management. The proposed technique achieved an accuracy of 96.6%. Saad Saeed et al. [11] profounded an algorithm for
228
C. Gupta et al.
stress detection using facial expressions. The proposed algorithm FD-CNN obtained an accuracy of 94% on the FER dataset. Jana Shafi et al. [12] proposed a technique that predicts the abnormalities in heart functionality. The technique employs convolutional neural networks and achieved good results in terms of F1-score, time, and ROC curves when compared with twostage neural networks. Manan Hingorani et al. [13] provided a survey on machine learning techniques that make use of audio features and are used for detecting mental illness and stated that supervised machine learning techniques are preferable with audio features rather than unsupervised machine learning techniques. Mahesh Reddy et al. [14] proposed the RNN-LSTM method for identifying stress levels. The proposed method has been compared with different machine learning techniques like SVM, KNN, Hidden Markov Model, and Multilayer Perceptron. The proposed method achieved better results. Anuja Pinge et al. [15] compared two chest-worn heart rate monitors and one wrist-worn heart rate monitor for detecting stress levels. Samriti Sharma et al. [16] presented a review of supervised learning soft computing skills for stress diagnosis in human beings. Aditi Sakalle et al. [17] introduced a LSTM-based deep learning network for recognizing emotions using EEG signals. The network is compared with multilayer perceptron, KNN, and SVM, and it was observed that LSTM network achieved an accuracy of 92.66% with 10-cross fold validation. The literature survey disclosed some limitations in the state-of-the-art field that needs to be investigated. The following limitations are noted: • Mostly the research work has been carried out with small datasets with a smaller number of images. The dataset size used is not sufficient to reveal the actual performance of the models used. • The research carried out so far have produced high accuracy results using the pre-trained models via transfer learning and other techniques but least attention has been paid to creating or building a new model from scratch. Changing the complete architecture of any pre-trained model by either removing or adding new layers or even combining new layers of two different models is a tedious task and is difficult. • Most of the research work has been carried out on raw images rather than preprocessed images that generalize properties of any pre-trained model. To overcome the challenges, the framework has been proposed in this paper that is trained from scratch on a large dataset with 35,887 images of different facial expressions/emotions. The dataset distinguishes the images into seven different facial expressions as angry, disgust, fear, happy, sad, surprise, and neutral. The dataset is further divided into training and testing sets with 28,709 and 7178 images, respectively. The framework when trained on this large dataset achieves an accuracy of 97.7%.
A Deep Learning-Based Framework for Analyzing Stress Factors …
229
3 Stress Factors (Stressors) One needs to identify the prime underlying causes that cause stress, before assessing the severity of stress in working women. Stress-inducing factors and demands are known as stressors. Pressure is typically associated with undesirable things, such as a demanding job schedule or a turbulent relationship. On the other side, anything that involves a lot of demands could be stressful. This category includes happy occasions like getting hitched, purchasing a home, enrolling in college, or receiving a promotion. Unexpected visitors, followed by the lack of domestic support, pose substantial stress among working women under socioeconomic stressors. Similarly, being a perfectionist with excessive anxiety might lead to a psychological setback in professional women. Furthermore, fear for the future of the children and the insecurity of the husband’s employment play a significant role in producing stress in the family and relationship [1]. Broadly these stressors are classified into four categories (Fig. 1).
3.1 Deep Learning: Background Deep learning is a subset of machine learning that uses artificial neural networks exclusively. The idea of deep learning is not new. It has been in existence for a while. In early times, the processing power or data were not as much as they are available,
Physical
Psychospiritual
Stress Factors
Sociaeconomical
Fig. 1 Classification of stress factors
Psychol ogical
230
C. Gupta et al.
Fig. 2 Relationship between deep learning, ML, and AI
it’s all the rage these days. A deep learning algorithm will aim to reduce the difference between its prediction and expected output given a huge dataset of input and output pairs [18] (Fig. 2).
3.2 Deep Learning-Based Framework for Early Stress Detection This section presents the deep learning-based framework for identifying stress levels in working women. The framework is tested on the FER2013 dataset which is available on the Kaggle repository [19]. The dataset includes 48 × 48 grayscale images of different faces, each expressing a distinct emotion. The training set comprises 28,709 images and the test set consists of 7178 images. The emotions are numbered from 0 to 6 (0 = angry, 1 = disgust, 2 = fear, 3 = happy, 4 = sad, 5 = surprise, 6 = neutral). This clearly indicates it is a multi-class classification problem. The proposed framework consists of four convolutional layers, three Max pooling layers, three dropout layers, one flatten layer, and two fully connected layers. Figure 3 provides the complete architecture of the proposed framework. The framework is trained with the training set with 28,709 images and achieves an accuracy of 97.7%. The data is collected from the Kaggle repository. The dataset consists of seven different facial expressions/emotions images which are pre-processed. The convolutional neural layers in the network are used to choose features. There are four convolutional layers in the framework. Except for the final convolutional layer, each is followed by a Max pooling layer, thus the framework is composed of three max pooling layers. The Max pooling layer helps in selecting maximum elements from a feature map selected by the convolutional layer. In order to avoid the overfitting issue in neural networks, dropout layers have been used. The framework consists
A Deep Learning-Based Framework for Analyzing Stress Factors …
231
Fig. 3 Stress prediction using deep learning framework
of three dropout layers (Fig. 2). The flatten layer helps in converting the 3D matrix into a vector form. Finally, there are dense layers which are also known as fully connected layers. The dense layer helps each neuron to receive input from all the previous neurons. This layer is responsible for classifying images based on output from convolutional layers. Figure 4 provides a complete layer architecture of the proposed framework.
4 Results and Discussion The framework is tested on a dataset that consists of 7178 images featuring seven emotions, namely angry, disgust, fear, happy, sad, surprise, and neutral. Figure 5 provides the framework accuracy and loss after 15 epochs. The model is tested using a random image from the testing dataset shown in Fig. 6 and the framework
232
C. Gupta et al.
Fig. 4 The layer architecture of the proposed framework
Fig. 5 Framework accuracy and loss
predicted correctly that the image belongs to a happy emotion. In the study, the framework is used for detecting stress levels in working women with the help of facial expressions/emotions.
A Deep Learning-Based Framework for Analyzing Stress Factors …
233
Fig. 6 Prediction results
5 Conclusion Stress being a prevailing problem all around the world attracts researchers to use new technology that is capable of mitigating the problem. Working women are the prime victim because of the daily routine practices that are not favorable toward them. Deep learning-based architecture has attracted many researchers in this field of study. This paper classifies various stressors into four categories, namely physical, socioeconomic, psycho-spiritual, or psychological factors, and proposes a deep learning-based framework for analyzing the stress factors and modeling them for early fault prediction. The framework is trained on the FER2013 dataset which consists of facial images with different emotions, namely angry, happy, sad, disgust, surprise, fear, and neutral. The training set consists of 28,709 images and the testing set consists of 7178 images. The framework achieved an accuracy of 97.7%. In the future, the framework may be combined with IoT-based sensors and can be used to form some useful wearable devices.
References 1. D.L. Krishnan, Factors causing stress among working women and strategies to cope up. IOSR J. Bus. Manag. 16(5), 12–17 (2014). https://doi.org/10.9790/487x-16551217 2. A. Kumar, K. Sharma, A. Sharma, Hierarchical deep neural network for mental stress state detection using IoT based biomarkers. Pattern Recognit. Lett. 145, 81–87 (2021). https://doi. org/10.1016/j.patrec.2021.01.030 3. A. Esteva et al., A guide to deep learning in healthcare. Nat. Med. 25(1), 24–29 (2019). https:// doi.org/10.1038/s41591-018-0316-z 4. O. AlShorman et al., Frontal lobe real-time EEG analysis using machine learning techniques for mental stress detection. J. Integr. Neurosci. 21(1), 1–11 (2022). https://doi.org/10.31083/j. jin2101020 5. A. Mohammadi, M. Fakharzadeh, B. Baraeinejad, An integrated human stress detection sensor using supervised algorithms. IEEE Sens. J. 22(8), 8216–8223 (2022). https://doi.org/10.1109/ JSEN.2022.3157795
234
C. Gupta et al.
6. V. Sasikala, T. Rajeswari, S.K. Naseema Begum, C. Divya Sri, M. Sravya, Stress detection from sensor data using machine learning algorithms. Proceedings of International Conference on Electronics and Renewable Systems ICEARS 2022, (2022), pp. 1335–1340. https://doi.org/ 10.1109/ICEARS53579.2022.9751881 7. Y. Wahyuni, M.A. Pany, Stress detection in pregnant women. 3(1), 46–55 (2022). https://doi. org/10.30997/ijar.v3i1.182 8. V.G. Rajendran, S. Jayalalitha, K. Adalarasu, G. Usha, A review on mental stress detection using PSS method and EEG signal method. ECS Trans. 107(1), 1845–1855 (2022). https://doi. org/10.1149/10701.1845ECST/XML 9. B. Prasanalakshmi, T.A. Kumar, Deep regression hybridized neural network in human stress detection, in 1st IEEE International Conference on Smart Technologies and Systems for Next Generation Computing ICSTSN 2022, (2022). https://doi.org/10.1109/ICSTSN53084.2022. 9761305 10. M. Gil-Martin, R. San-Segundo, A. Mateos, and J. Ferreiros-Lopez, Human stress detection with wearable sensors using convolutional neural networks. IEEE Aerosp. Electron. Syst. Mag. 37(1), 60–70 (2022). https://doi.org/10.1109/MAES.2021.3115198 11. S. Saeed, A.A. Shah, M.K. Ehsan, M.R. Amirzada, A. Mahmood, T. Mezgebo, Automated facial expression recognition framework using deep learning. J. Healthc. Eng. 2022, (2022). https://doi.org/10.1155/2022/5707930 12. J. Shafi, M.S. Obaidat, P.V. Krishna, B. Sadoun, M. Pounambal, J. Gitanjali, Prediction of heart abnormalities using deep learning model and wearabledevices in smart health homes. Multimed. Tools Appl. 81(1), 543–557 (2022). https://doi.org/10.1007/s11042-021-11346-5 13. M. Hingorani, N. Pise, Detect. Ment. Illn. Using Mach. Learn. Deep. Learn. 2272–2278 (2021) 14. S.M.R.S. Mahesh Reddy, Y.B.Y. Bhanusree, Acoustic based Stress level identification using Deep Neural architecture. Int. J. Eng. Technol. Manag. Sci. (6), 7–17 (2022). https://doi.org/ 10.46647/ijetms.2022.v06i02.002 15. A. Pinge, S. Bandyopadhyay, S. Ghosh, S. Sen, A comparative Study between ECG-based and PPG-based heart rate monitors for stress detection, in 2022 14th International Conference on Communication Systems and Networks, COMSNETS 2022, (2022), pp. 84–89. https://doi.org/ 10.1109/COMSNETS53615.2022.9668342 16. S. Sharma, G. Singh, M. Sharma, A comprehensive review and analysis of supervised-learning and soft computing techniques for stress diagnosis in humans. Comput. Biol. Med. 134, 104450 (2021). https://doi.org/10.1016/j.compbiomed.2021.104450 17. A. Sakalle, P. Tomar, H. Bhardwaj, D. Acharya, A. Bhardwaj, A LSTM based deep learning network for recognizing emotions using wireless brainwave driven system. Expert Syst. Appl. 173, 114516 (2021). https://doi.org/10.1016/j.eswa.2020.114516 18. S. Jha, C. Seo, E. Yang, G.P. Joshi, Real time object detection and tracking system for video surveillance system. Multimed. Tools Appl. 80(3), 3981–3996 (2021). https://doi.org/10.1007/ s11042-020-09749-x 19. FER-2013 | Kaggle. https://www.kaggle.com/datasets/msambare/fer2013. Accessed 30 July 2022
Automatic Question Generation Abhishek Phalak, Shubhankar Yevale, Gaurav Muley, and Amit Nerurkar
Abstract It is critical to evaluate an individual’s capabilities in order to assess their capabilities. Exams have a vital part in the testing process and, as a result, the questions that must be tested. Framing questions may be a tiresome endeavor at times, necessitating the use of automated question generating. Natural language processing may be used to produce questions that are accurate, answered, and efficient (NLP). We can structure questions by accepting and processing information using NLP and different question creation methods. MCQs, fill in the blanks, Wh-questions, crosswords, and a quick synopsis of the supplied material are among the questions created. Before being sent to the AQG, the data is preprocessed. Depending on the supplied content, the questions are verified for contextual relevance and answerability. Automated question generation can be beneficial in self-analysis, self-guidance, and other areas. Keywords Natural language processing · Data preprocessing · Machine learning · T5 transformer · Naive Bayes · Prediction
1 Introduction In today’s society, online learning has become one of the most important aspects of education, allowing students to access learning resources at any time and from any location. Conducting assessments is also important for assessing the student’s understanding. The exam curriculum is extensive, but the number of questions to be asked is rather modest; as a result, manually generating questions with the same level of precision becomes laborious. A set of well-crafted questions would aid in A. Phalak · S. Yevale · G. Muley (B) · A. Nerurkar Vidyalankar Institute of Technology, Mumbai, Maharashtra, India e-mail: [email protected] A. Nerurkar e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 A. Bhattacharya et al. (eds.), Innovations in Data Analytics, Advances in Intelligent Systems and Computing 1442, https://doi.org/10.1007/978-981-99-0550-8_19
235
236
A. Phalak et al.
determining a student’s level of understanding. As a result, utilizing a model to create questions with the highest level of accuracy every time would be beneficial. The system receives text documents as input from the user and creates questions depending on the text document’s various portions and parameters. The system does some preprocessing and looks for words/phrases that might be used as responses to the questions that are created. Based on the context in which they are utilized in the paper, these essential responses would then be used to generate questions. From these replies, the system generates syntactically and semantically valid questions that may be answered in context with the material that the user has supplied. The significant and answerable questions are then utilized to create a self-evaluation quiz that allows the user to assess their grasp of the topic or idea.
2 Literature Survey We have analyzed and studied some related work to our project, i.e., automatic question generation. The inferences of all the 12 papers that have been studied are described below. In [1], a holistic and novel generator-evaluator framework has been incorporated that directly optimizes objectives that reward semantics and structure. The paper aims at generating syntactically correct and semantically structured question generation from the text. In [2], they have tried to mimic the human process of generating questions by first creating an initial draft and then refining it. The present automatic assessment metrics based on n-gram similarity do not necessarily correspond well with human judgments about the answerability of a question, hence a scoring function to capture answerability is proposed and integrated with existing metrics. This new method appears to correspond much better with human judgments [3]. In [4], a template-based approach is used where it takes input as the data table and generates random tuples and passes it to the question generator to get the desired questions. The paper [5] presents the proposed system for the automatic generation of FIB that identifies informative sentences in a passage and uses a hybrid algorithm to choose the most appropriate word/phrase that can be marked as a gap. AQG from [6] focuses on the production of Wh-questions based on the sentence’s determined class. StanfordNER is used to analyze the subject and object to obtain the class. This class is also used to generate questions by looking at the relationship between subject, verb, and object. More recent research by Blšták and Rozinajová [7] is based on the data-driven technique, which involves extracting features from text and selecting the best appropriate class of questions (based on the features). Each sentence class contains a collection of potential queries, which are picked based on the similarity of traits. This [8] work focuses on MCQ generation for sports, notably cricket, where a specific pattern of MCQ is described and new questions are generated based on it. This is a more precise approach to generate question for a certain domain. Agarwal and Mannem have tendered [9] by first blanking keys from the sentences and then
Automatic Question Generation
237
identifying similar terms for the distractors for these keys, the system automatically removes the informational sentences from the comprehension or paragraph and generates gaps from it. The technology demonstrated [10] a reading comprehension rule-based automatic question creation mechanism. The system selects notable phrases in a paragraph using different ways, then uses named entity recognition and constituent parsing to produce possible question–answer pairings. Based on a set of principles and probable answers, the statement is then turned into an interrogative form. The computational intelligence framework that automates or semi-automates the process of quiz and exam question generation uses the framework based on information retrieval and NLP algorithms. It incorporates the application of production rules, LSTM neural network models, and other intelligent techniques [11]. This [12] research focuses on the production of Wh-questions based on the crucial response selected by the user or system. Question screening is done based on BERTH to exclude unanswerable questions, putting together a group of questions with similar answers and choosing one.
3 Proposed System We propose a system which will help users to generate different types of questions such as MCQ, fill in the blanks, crossword, and Wh-questions. We also provide a short summary of the text given as input. The components of the system are as follows: 1. Input The user has to input the text for which questions have to be generated in the UI. The website is integrated with flask to analyze the text and carry out machine learning operations. According to user needs, we generate the number of questions based on the input given by the user. 2. Preprocessing Various preprocessing techniques used during question generation are as follows: (a) POS Tagging—Part of Speech (POS) tagging is a process in natural language processing which categorizes the words into their corresponding part of speech, e.g., verb, adjective, noun, etc. For POS tagging, we use the pos_tag function from NLTK library which returns the pos tag of word. (b) NER Tagging—NER tagging categorizes a named entity into defined classes such as location, person, organization, etc. NER tagging is very crucial in generating rule-based Wh-questions as depending upon the entity questions can be framed. (c) Tokenization—Tokenization is the process of breaking strings into small tokens. The tokens can be a substring or a word. We have used both sentence tokenizer and word tokenizer in question generation.
238
A. Phalak et al.
Fig. 1 Proposed system
3. Question Generation After the user inputs the text and number of questions to be formed, the text cleaning and preprocessing is performed. The fill in the blank generator is trained on SQUAD v1 dataset which finds the pivotal answers and replaces it with a blank. The answer choices are also then sent to a crossword generator to make a crossword where the answers are hidden and clues are provided to find the answers. For MCQ, wrong answers are generated by finding the words which have very similar global word representation. The summary generator provides a short summary to give a brief description about the text (Fig. 1).
4 Methodology 1.
Crossword
Crossword generation is an extension of fill in the blanks generation. The clues required to find answers in the crossword are the fill in the blanks questions which are generated using ML and NLP techniques as stated earlier. The answers are hidden by the crossword which can be oriented in any of these directions, i.e., horizontal, vertical, or diagonal. The crossword is basically a 15 × 15 matrix made up of twodimensional list. The spaces in the matrix other than the answers are filled with random alphabets. The process of crossword generation and placing the word in the two-dimensional matrix with different orientation is extremely random and hence it is better than many rule-based crossword generation algorithms.
Automatic Question Generation
239
We have initialized the 15 × 15 matrix by zeros. We have stored the answers in a list. For every answer which is a word, we split the word into individual letters. We have created three functions to place a word in different orientations. Top_bottom function for vertical alignment of the word, Left_right function for horizontal, and Diagonal function for slant alignment of the word. The algorithm generates a random number between 1 and 3 using randint() function which in turn results in selecting a random function out of the three functions to place a word. For example, if the random number generated is 1, then it will select the top_bottom() function; if the number generated is 2, then it will select the diagonal() function; and so on. Once the function is decided by the random number generation, the function will keep on trying to fit the word in the same orientation as defined by the function until the word fits in the crossword. The algorithm of all three functions works in similar fashion, except the orientational changes. When a list of words splitted into alphabets is passed to the function, it generates two random numbers. The two numbers combined give a random location of a coordinate in the 2-D space. The algorithm will try to fit the word with the location of the starting letter mentioned by the random coordinate. The function will check whether each and every element is either zero or the required alphabet. If for all the letters the condition is satisfied then the function will place the word. Else it will again find another set of random coordinates and will try to check whether the word can be placed starting with those coordinates. We also have imposed restrictions on the generation of random numbers so as to avoid index out of bound error. Once all the words (answers) are placed in the 2-D matrix, we replace the blank spaces by random alphabets. 1. Start 2. For each I in range(0,225) { 3. Initialize variable counter to zero 4. Generate two random numbers a(between 1 to (15-length)) and b(between 1 to 15) 5. For each x in range(0, length(word)) { 6. if matrix[(a+x)][b] == 0 or matrix[(a+x)][b] array[x] { 7. Increment the counter by 1 } } 8. If counter == length { 9.
for j in range(0, length(word)) { 10. matrix[(a+j)][b] = array[j] 11. return } } } 12. End
240
A. Phalak et al.
We also generate the answer matrix wherein we capitalize the answers and rest all letters are small letters which helps users to clearly figure out where the answers are. The answer word can also be in entangled form. For example, if a word “sweet” is placed and the next word to be placed is train, then the word train can be placed below letter “t” of word “sweet”. The main feature of this crossword is its dynamicity. As the orientation of the word chosen is random along with the location chosen to place the word is also random, every time the user gets a different crossword for the same text as input. Hence, different crosswords for the same input can be created. Also, the number of answers which are to be hidden in the crossword can be defined by the user. Accordingly, the system generates fill in the blanks as clues and hides the words in the crossword depending upon the number of questions the user wants to generate. 2. Fill in the Blanks and MCQ Generation For fill in the blanks and MCQ generation, the dataset that we used is SQuAD v1 [13] dataset (Stanford Question Answering Dataset). The dataset contains around 21,000 paragraphs and 100,000 questions. We also extracted the sentences from paragraphs wherein the answers of the question generated lies. We removed all stopwords to be considered for training to find pivotal answers. All the words from paragraphs are taken except the stopwords and features are generated for training. We used POS and NER tagging on all the words to generate features NER tag and POS tag. We have also added features like DEP tag and shape. The most important feature is the “Answer” feature which is a binary column which represents whether the word is the answer or not. With all these features we trained the data on the Naïve Bayes model of the scikit-learn package because Naïve Bayes gives the probability of each word which will help us to decide the pivotal answer. After the model is trained, we save the model so that later it can be used for predicting pivotal answers (Fig. 2). The user should provide the number of questions to be generated. The text given by input is broken into sentences and we preprocess the input. The preprocessed input is then passed to the model to predict the pivotal answer. The word with maximum probability of being a pivotal answer is then replaced by a blank space. To generate MCQ, it is similar to fill in the blanks, just we have to provide three wrong choices along with answers. So, to generate wrong answers we used a pretrained model of GloVe word-embeddings [14] (Global Vectors for Word Representation) and cosine similarity. GloVe was used to create vector representation for words and cosine similarity was used to measure the similarity between the two vectors. These correct answers were later passed to the crossword generation algorithm to generate crosswords. 3. Wh-Question Generation We used a rule-based approach (Fig. 3) to generate Wh-questions because the sentence formation differs from the input sentence. The preprocessing of the input text is done as follows:
Automatic Question Generation
241
Fig. 2 MCQ generation process
1. Tokenize into sentences. 2. Remove full stop. 3. Capitalize single alphabet “I” as it is correctly recognized by parser in capital form. 4. Make the first letter lowercase. As every sentence in the text cannot be used to frame questions, there arises a need to rank the sentences. So, to rank the sentences, we have used a text rank algorithm to filter out the important sentences. POS and NER tagging is used to determine the parts of speech and named entity. The two main important aspects which help in framing Wh-questions are • Auxiliary verb—Auxiliary verb conveys information about tense, mood, or person. It provides support to the main verb. • Discourse markers—A discourse marker plays a very important role in knowing the direction of a conversation. Mostly the later part of the sentence after the discourse marker is the answer to the previous part. For such sentences, the question type becomes defined, e.g., because, as a result, although, etc. We have used three different algorithms for generation of Wh-questions which differ in rules to generate the question and depend on the type of sentence. 3.1 NER-Based Algorithm This type of algorithm is most suitable for declarative/assertive sentences. To find nouns, we used POS tagging and then to classify into named entities we used NER
242
A. Phalak et al.
Fig. 3 Rule-based approach
tagging. The NER tagging tags the word into predefined categories. For example, India is tagged as GEO which means it represents a geographical location, Google is tagged as ORG which means it represents an organization. For such tags, the Whquestion to be framed is already determined by the rules. For a location or place, word “where” will be used to generate a Wh-question whereas for a time entity, “when” will be used to generate a Wh-question. After identifying the Wh-question word to be used, we need to transform the sentence along with ensuring semantic and syntactic correctness which is according to the rules defined in the algorithm. If the named entity is found at the start of the sentence, we replace the entity with the corresponding Wh-question, if not found we will have to look after the first non-auxiliary verb. If it turns out to be a noun then discard it else keep it in the question part and rearrange the question part. Finally, the Wh-word is attached in the beginning and a question mark is attached at the end.
Automatic Question Generation
243
3.2 Discourse Marker-Based Algorithm As discussed earlier, sentences with discourse markers can be easily transformed into Wh-questions. We have assigned Wh-question word for every discourse marker. Using NER and POS tagging, find if the auxiliary verb exists. If it exists, then tokenize the part of the sentence before the discourse marker. If the tags indicate that the sentence is in first person, word “you” will be appended to the question part after the Wh-word as defined by the discourse marker rules along with the verb “were”. If it is not in first person, then the auxiliary word would be appended after the Wh-word. If there is no auxiliary verb present in the sentence then the auxiliary verb will be in the form “do”, i.e., do, did, does, etc. and using lemmatization we stem the present verb of the sentence. We have created rules to determine which form of do will be required depending upon the combination of noun and verb. For example, if NN/NNP and VBN combination occurs, then the aux verb will be “did”, for NN/NNP and VBZ we use “does”, and so on. The aux verb is attached to the question part with the Wh-word at the start and question mark at the end to generate a question. 3.3 Non-discourse Marker-Based Algorithm Assertive or declarative sentences which do not have any discourse marker come under this category. Such sentences are usually Yes or No sentences. Here there is no need for the Wh-word, we simply find the auxiliary verb and attach it to the beginning. The rest of processing and transformation of the sentences is done exactly the same as the discourse marker algorithm. Finally, a question mark is added at the end of the generated question. 4.
Summary Generation
Summary is a collection of main points which gives you the idea of the whole text in very few words. It becomes easy for users and saves time. For summary generation, we used the T5 transformer model [15]. Transformer is basically a deep learning model which can transform one sentence to another with the help of encoder and decoders. We have used the “t5-large” model with min-length parameter as 50 and max-length parameter as 200. Percentage of text formed as summary (i.e., number of words in summary * 100/number of words in text) is close to 30% which means there is 70% reduction in the size of the text while converting into summary.
5 Conclusion and Future Scope Automatic question generation and evaluation are becoming more prevalent in the intelligent education system as a result of developments in online learning. The paper begins with a compilation of review articles over the previous decade. Following that, it highlights the study progress by discussing state-of-the-art ways of diverse automatic question generation techniques by creating questions in the form of MCQs,
244
A. Phalak et al.
fill in the blanks, Wh-questions, crosswords, and summary. Sometimes the model chooses wrong words as an answer. So, in future, we would like to increase the efficiency of the model by training it with a larger dataset. Also, rather than using rule-based approach for Wh-questions, the results may be better if neural networks are used to generate Wh-questions. Acknowledgements The successful completion of our project required the guidance along with the assistance from various people and we are extremely privileged to have got that. We sincerely thank our project guide Prof. Amit Nerurkar who took keen interest in our project work.
References 1. V. Kumar, G. Ramakrishnan, Y.-F. Li, Putting the horse before the cart: a generator-evaluator framework for question generation from text (2019) 2. P. Nema, A.K. Mohankumar, M.M. Khapra, B.V. Srinivasan, B. Ravindran, Let’s ask again: refine network for automatic question generation (2019) 3. P. Nema, M.M. Kahpra, Towards a better metric for evaluating question generation systems (2018) 4. A. Shirude, S. Totaia, S. Nikhar, V. Attar, J. Ramanand, Automated Question Generation Tool for Structured Data, in International Conference on Advances in Computing, Communications and Informatics (ICACCl) (2015) 5. S. Pannu, A. Krishna, S. Kumari, R. Patra, S.K. Saha, Automatic Generation of Fill in the Blank Questions from History Books for School-Level Evaluation (Springer Nature Singapore, 2018) 6. D. Swali, J. Palan, I. Shah, Automatic question generation from paragraph. Int. J. Adv. Eng. Res. Dev. 7. M. Blšták, V. Rozinajová, Machine learning approach to the process of question generation, in Text, Speech, and Dialogue, ed. by K. Ekštein, V. Matoušek (TSD, 2017) 8. A System for Generating Multiple Choice Questions: With a Novel Approach for Sentence Selection10.18653/v1/W15-4410 9. M. Agarwal, P. Mannem, Automatic gap-fill question generation from text books, in Proceedings of the Sixth Workshop on Innovative Use of NLP for Building Educational Applications, (2011), pp. 56–64 10. H. Lovenia, F. Limanta, A. Gunawan, Automatic question-answer pairs generation from text. researchgate.net/publication/328916588 (2018) 11. A. Killawala, I. Khokhlov, L. Reznik, Computational Intelligence Framework for Automatic Quiz Question Generation, (IEEE, 2018) 12. V. Kumar, S. Muneeswaran, G. Ramakrishnan, Y.-F. Li, Para QG: a system for generating questions and answers from paragraphs 13. P. Rajpurkar, J. Zhang, K. Lopyrev, P. Liang, Squad: 100, 000+ questions for machine comprehension of text, in Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, EMNLP 2016, (Austin, Texas, USA, 2016), pp. 2383–2392 14. https://nlp.stanford.edu/projects/glove/ 15. https://ai.googleblog.com/2020/02/exploring-transfer-learning-with-t5.html
Automatic Construction of Named Entity Corpus for Adverse Drug Reaction Prediction Samridhi Dev and Aditi Sharan
Abstract One of the hefty global causes of obliteration is cancer. The scientific communities of clinical and molecular oncology have recently contrived to significantly extend the life expectancy of patients with specific cancer forms. Multitudinous medical health records are being published on a prodigious scale embracing electronic health records, case reports, discharge summaries, patient reviews, etc. The attainability of structured data presents a significant stumbling block. Natural language processing (NLP) is the sole practical method to extract and encode textual data for clinical science, but the absence of standard, annotated datasets for training and testing of machine learning algorithms impedes the progress in clinical NLP. Therefore, the intention of this research is to bridge the gap identified from the literature by developing a reliable labelled dataset for the automated prediction of adverse drug reactions induced by the medications used in the treatment of cancer. This corpus is based on the complete text of case reports relating to cancer, unlike other datasets. Keywords Automatic dataset creation · Labelled corpus · Adverse drug reaction · Cancer · Drugs · Diseases · Annotations · Normalization · Machine learning
1 Introduction Biomedical literature publications are manoeuvres and are only valuable if reliable and opposite methods for obtaining and analysing that material are accessible. For bioscientists, the surging body of literature presents a motley of difficulties. One of the rudimentary biological concepts in biomedical research is the adverse drug reaction that is pertinent to a given drug and disease. The diversity and ambiguity in the names of adverse drug reaction entities deterrent its automatic recognition from free text. The majority of data, however, is still disorganized, particularly in the case S. Dev (B) · A. Sharan Jawaharlal Nehru University, New Delhi 110067, India e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 A. Bhattacharya et al. (eds.), Innovations in Data Analytics, Advances in Intelligent Systems and Computing 1442, https://doi.org/10.1007/978-981-99-0550-8_20
245
246
S. Dev and A. Sharan
of clinical trials. The inability to acquire a labelled corpus for training, testing, and validation presents a significant obstacle for machine learning algorithms to assess the entity. Consequently, there is an unmet requirement for the development of biological dataset curation and structured labelled datasets. As far as we are aware, there is no dataset that is disease-specific and offers details on drugs associated with the cancer disease and its side effects. The objectives of this paper are the creation of cancerspecific dataset that embraces data on cancer-related drugs and their cognate side effects and to provide a methodology for automatic named entity recognition (NER) corpus construction. Achieved tasks include automatic normalization of entities, assignment of properties to the entities, and employment of full-text case reports for corpus construction. Another promising attainment is the reduction of manual labour and dependency on expert knowledge required for the biomedical corpus construction.
2 Related Work 2.1 Biomedical Named Entity Recognition The goal of biomedical named entity identification is to automatically identify biological entities in textual data, such as chemical substances, genes, proteins, viruses, ADRs, alleles, illnesses, DNAs, and RNAs. Clinical NER allows recognizing and labelling of entities in the clinical text, which is an essential building block for many downstream NLP applications defining its receptiveness such as information extraction from corpus and de-identification of entities in clinical narratives. Unstructured text has been converted into meaningful usable data by identifying the entities like chemicals and proteins [1, 2], pituitary adenomas [3], and breast cancer [4] genes [5]. Entity recognition has also been used for achieving various tasks like classifying drug-induced liver injury [6], recognition of rare diseases and their clinical manifestations [7], extraction of medical plant phenotypes [8], identification of human-related biomedical terms [9], classification of chemicals into organic and inorganic [7], and biomedical question answering system [10]. For many subsequent NLP applications, such as information extraction from raw data and de-identification of entities in clinical narratives, clinical NER enables the recognition and labelling of entities in the clinical text, which is a crucial key component. In comparison to other entities, very little work has been done to extract adverse drug effects. Unwanted drug reactions or ADRs occur as a consequence of the medications to treat a variety of diseases and can be fatal. The irreplaceable life of a patient can be saved with early drug reaction prediction. The existence of data in an unstructured format and the restricted availability of data are factor where relatively little work is done for recognizing ADRs.
Automatic Construction of Named Entity Corpus for Adverse Drug …
247
2.2 Datasets for Biomedical Entity Recognition In spite of the fact that health experts are generating large volumes of data in the form of electronic health records, case reports, clinical reports, discharge summaries, etc., only very few annotated corpora in the medical domain exist. Few of the available corpora are in structured format but not annotated such as the i2b2 2010 dataset of 877 discharge summaries [11]. Many of them, such as the BC5CDR corpus, the Comparative Toxicogenomics Database, and the ProGene corpus2 [1, 12], focused on the entities like chemicals, diseases, proteins, drugs, and genes. Several new corpora of annotated case reports were made available recently. Reference [13] presented a corpus focusing on phenotypic information for chronic obstructive pulmonary disease while [14] presented a corpus focusing on identifying main finding sentences in case reports. Reference [15] performed a review on applications for the extraction of clinical information. Several corpora are available for adverse drug reactions some of them are SIDER, FDA-ars, DeIDNER, and ade-corpus-v2 [16, 17].
3 Proposed Work 3.1 ADR Corpus Overview There are relatively few annotated corpora in the medical field. The amount of data relevant to the prediction of adverse drug reactions is immense and growing quickly every day; however, there are still very few useful datasets available. Abstracts were the only source used to construct the dataset used so far for ADR prediction. The proposed dataset is created solely for cancer disease and gathers adverse drug reactions on full-text case reports. We included four distinct types of entities with their properties: drugs, adverse drug reactions, dysn, and cancer types. The dataset contains extracted entity, starting position, entity length, normalized term, and assigned unique concept identifier. Figure 1 describes the proposed workflow designed for the construction of corpus.
3.2 Corpus Construction Steps The majority of the accessible datasets have been built manually, which is a time-consuming and labour-intensive task that needs the focus of numerous subject experts. We present an automated way for annotating data and normalization of extracted entity to address these issues. 500 PubMed case reports in total were selected by throwing a query on PubMed. SNOMED CT and RxNorm have been utilized for mapping the extracted entities to a standard term. MetaMap tool has
248
S. Dev and A. Sharan
Fig. 1 Workflow for construction of corpus
been used for annotation and normalization followed by the assignment of properties which elucidates the structure of the entity (abbreviation, acronym, combination, and overlapped form). Figure 2 elucidates the steps followed for the construction of the presented corpus. The following subsection includes a detailed explanation of the workflow for the construction of the proposed corpus.
3.2.1
Corpus Collection
For dataset construction, the availability of informative and instructive data is of utmost importance. Cancer is still one of the major causes of mortality in the world and has a large impact on healthcare, and this dataset has been dedicated to Cancer. The following are the sub-steps involved in corpus collection. • Case report identification In order to curtail the corpus to cancer-related drugs and adverse effects, a PubMed search was performed limiting it to the English language. Figure 3 shows the PubMed query used for retrieving cancer-related case reports. • Shortlisting case reports PubMed query retrieved 2351 case reports. From these 500 case reports were shortlisted based on the publication date and accessibility. Each corpus has the identical abstract, report, and discussion structures. Each case report’s entire text was used for annotation. Figure 4 displays the PubMed query used. • Gathering data We used a procedure called web scraping, which involves gathering data from a website using a code, to download all textual information from case reports. A Python script based on the beautiful soup module was developed to execute web scraping, and as a result, we obtained textual data for 500 case reports.
Automatic Construction of Named Entity Corpus for Adverse Drug …
Corpus Collection
Entity Annotation
Entity Normalization
Case report identification Shortlisting case reports Gathering data
Entity type selection MetaMap input Entity extraction Entity annotation Output generation
MetaMap input Concept mapping CUI assignment Entity property identification Output generation
Manual verification Manual Verification
Fig. 2 Corpus construction steps
Fig. 3 PubMed query for selecting case reports
Fig. 4 Instance of entity labelling in corpus
249
250
3.2.2
S. Dev and A. Sharan
Entity Annotation
Few databases have embraced some automated curation techniques; however, the majority of bio-curation is still carried out manually. Manual annotating is a laborious, time-consuming activity that requires expertise. We propose the creation of a cancerspecific corpus with an automatic annotation system that uses the MetaMap tool in order to reduce the time and workload of specialists and curators. The highly flexible MetaMap programme is used to locate biomedical things using UMLS Metathesaurus. Table 1 shows the entities of interest. Detailed steps for entity annotation are described as follows: • Entity type selection We encoded detailed information about four entities of interest: drugs, adverse drug reaction, dysn, and cancer types. • MetaMap input Case reports collected in step 1 are given as an input in a textual form to MetaMap tool • Entity labelling We were able to label the entities of interest using MetaMap. SNOMED CT and RxNorm were the two biomedical terminologies that we employed to annotate case reports. Table 1 Entities of interest in corpus Entity type
Entity label Explanation
Cancer type
Neop
When the body’s abnormal cells proliferate in an uncontrolled manner and spread to other body regions, cancer is developed. In this corpus, 66 cancer types were annotated
Drugs
Drug
Drugs are chemicals or substances that alter how our bodies work. This corpus contains 397 annotated drugs which are used for cancer treatment
Adverse drug reaction ADR
An unintended alteration brought on by drug use. A total of 324 ADRs were annotated which was caused due to consumption of drugs used in the treatment of cancer
Dysn
Sometimes there is a possibility that a particular disease can be induced by another disease or by the consumption of any drug or by any medication as a side effect. This particular disease is considered as Dysn in this proposed dataset
Dysn
Automatic Construction of Named Entity Corpus for Adverse Drug …
3.2.3
251
Entity Normalization
The process of mapping collected entities to standard terminologies is known as entity normalization. Domain-specific dictionaries are often used by entity normalization techniques to resolve synonyms and acronyms. SNOMED CT has been utilized in this scenario to normalize the entities. Entities are likely to be represented in diverse manners and have multiple variants. Consequently, it becomes a crucial task, especially in the biomedical and life science fields. The following are the steps involved in the normalization process. Figure 8 gives the visualization of entity normalization. The first column is the CUI assigned to the extracted entity followed by its extracted entity, starting position, length of entity, and normalized word. • MetaMap input Entities annotated in the previous step are provided as an input to MetaMap. • Concept mapping Based on lexical lookup, MetaMap normalizes all retrieved entities by identifying all potential entities for each entity and assigns them a candidate score. The ideal candidate is determined by MetaMap based on the candidate score. • CUI assignment After the normalization of meaningful entities, the UMLS concept unique identifier was assigned to each of the entities. • Entity property identification Property of entity defines the form of entities. Three properties are considered in the proposed corpus: Abbreviations or acronyms, overlapped entity, and combined entity. An entity should be labelled as entity type “dysn” if it may be classified as both a disease and a negative side effect. For the treatment of a disease, a patient is typically given a combination of drugs; therefore, it becomes vital to get these integrated entities in order to have a more reliable entity recognition dataset. Figures 5, 6, and 7 show the entities having these three properties. • Output generation Output was generated in XML and JSON format. We build a Python code that automatically creates a labelled corpus using obtained information (Fig. 8).
Fig. 5 Instance of overlapped entities in corpus
252
S. Dev and A. Sharan
Fig. 6 Instance of combined entity in corpus
Fig. 7 Instance of abbreviation and acronym in corpus
Fig. 8 Instance of entity normalization in corpus
3.2.4
Manual Verification
After the construction of the ADR corpus, extracted entities were manually verified by the annotator to assure the efficacy and validity of the corpus.
4 Results This study contrived to provide the biomedical community with an annotated corpus that is specifically created for automated prediction of adverse drug reactions associated with cancer and its types. The presented corpus will be made freely available soon.
4.1 Contributions This section summarizes the contributions and findings made. The research gaps in data curation that were found in the literature are filled in this work. We intend to present the Cancer-specific ADR dataset, which has been automatically curated to
Automatic Construction of Named Entity Corpus for Adverse Drug …
253
predict adverse drug reactions. The effective development of corpus by automation reduced the need for manual labour and saved time. Based on case reports, this dataset includes medications, adverse drug responses, and different forms of cancer. The primary goal of adopting case reports as a corpus is to gather as much information as possible because they provide real-time treatment details. Compared to other sources like patient reviews, static databases, scientific journals, etc., case reports provide more meaningful information. In the process of normalizing and annotating the corpus entities, many problems have been addressed with considerable success. These issues are annotations and normalization of overlapped entities and combined entities and dealing with abbreviations and acronyms.
4.2 Statistics Table 2 explicates the number of unique and total mentions of extracted entities. Figure 9 shows the statistics of ADR corpus. Table 2 Entities annotated and normalized in corpus
Entity type
Total mentions
66
261
Drugs
393
3531
ADR
324
2399
Dysn
35
261
Cancer type
Fig. 9 Chart showing statistics of unique entity instances annotated and normalized
Unique mentions
254
S. Dev and A. Sharan
5 Discussion The primary objective of this work was to explicate the construction of the Cancerspecific ADR dataset, which comprises automatic annotation of cancer types, drugs used to treat cancer, and adverse drug reactions associated with each form of cancer. We were able to cut down on the amount of manual labour and time needed to create a dataset. Following entity annotation, the retrieved entities were normalized with the assistance of the SNOMED CT and MetaMap tools. Each extracted entity received a unique concept identity. The annotation of defined entities also addresses several NER difficulties, such as overlapped entities, multiple word entities, combined entities, and dealing with abbreviations and acronyms. The ADR dataset can serve as a valuable corpus that can facilitate the adverse drug reaction predictions and biomedical text mining.
References 1. L. Luo, Z. Yang, P. Yang, Y. Zhang, L. Wang, J. Wang, H. Lin, A neural network approach to chemical and gene/protein entity recognition in patents. J. Cheminform. 10, 65 (2018). https:// doi.org/10.1186/s13321-018-0318-3 2. T. Isazawa, J.M. Cole, Single model for organic and inorganic chemical named entity recognition in ChemDataExtractor. J. Chem. Inf. Model. 62(5), 1207–1213 (2022). https://doi.org/ 10.1021/acs.jcim.1c01199 3. F.W. Mutinda, K. Liew, S. Yada, S. Wakamiya, E. Aramaki, Automatic data extraction to support meta-analysis statistical analysis: a case study on breast cancer. BMC Med. Inform. Decis. Mak. 22(1), 158 (2022). https://doi.org/10.1186/s12911-022-01897-4 4. E. Karatzas, F.A. Baltoumas, I. Kasionis, D. Sanoudou, A.G. Eliopoulos, T. Theodosiou, I. Iliopoulos, G.A. Pavlopoulos, Darling: a web application for detecting disease-related biomedical entity associations with literature mining. Biomolecules 12(4), 520 (2022). https://doi.org/ 10.3390/biom12040520 5. I. Segura-Bedmar, D. Camino-Perdones, Guerrero- S. Aspizua, Exploring deep learning methods for recognizing rare diseases and their clinical manifestations from texts. BMC Bioinform. 23(1), 1–23 (2022) 6. P. Banerjee, K.K. Pal, M. Devarakonda, C. Baral, Biomedical named entity recognition via knowledge guidance and question answering. ACM Trans. Comput. Healthc. 2(4), 1–24 (2021). https://doi.org/10.1145/3465221 7. A. Delgado, S. Stewart, M. Urroz, A. Rodríguez, A.M. Borobia, I. Akatbach-Bousaid, M. González-Muñoz, E. Ramírez, Characterisation of drug-induced liver injury in patients with COVID-19 detected by a proactive pharmacovigilance program from laboratory signals. J. Clin. Med. 10(19), 4432 (2021). https://doi.org/10.3390/jcm10194432 8. H. Cho, B. Kim, W. Choi, D. Lee, H. Lee, Plant phenotype relationship corpus for biomedical relationships between plants and phenotypes. Sci. Data 9(1), 235 (2022). https://doi.org/10. 1038/s41597-022-01350-1 9. T. Almeida, J. Silva, J. Almeida, S. Matos, Chemical identification and indexing in PubMed full-text articles using deep learning and heuristics. Database (2022). https://doi.org/10.1093/ database/baac047 10. M. Syed, S. Al-Shukri, S. Syed, K. Sexton, M.L. Greer, M. Zozus, S. Bhattacharyya, F. Prior DeIDNER corpus: annotation of clinical discharge summary notes for named entity recognition
Automatic Construction of Named Entity Corpus for Adverse Drug …
11.
12.
13.
14. 15.
16.
17.
255
using BRAT tool. Stud. Health Technol. Inform. 281, 432–436 (2021). https://doi.org/10.3233/ SHTI210195 E. Faessler, L. Modersohn, C. Lohr, U. Hahn. ProGene—a large-scale, high-quality proteingene annotated benchmark corpus, in Proceedings of the 12th Language Resources and Evaluation Conference, (2020), pp. 4585–4596. https://aclanthology.org/2020.lrec-1.564 Y. Wang, L. Wang, M. Rastegar-Mojarad, S. Moon, F. Shen, N. Afzal, S. Liu, Y. Zeng, S. Mehrabi, S. Sohn, H. Liu, Clinical information extraction applications: a literature review. J. Biomed. Inform. 77, 34–49 (2018). https://doi.org/10.1016/j.jbi.2017.11.011 D. Demner-Fushman, S.E. Shooshan, L. Rodriguez, A.R. Aronson, F. Lang, W. Rogers, K. Roberts, J. Tonning, A dataset of 200 structured product labels annotated for adverse drug reactions. Sci Data 5(1), 180001 (2018). https://doi.org/10.1038/sdata.2018.1 M. Kuhn, I. Letunic, L.J. Jensen, P. Bork, The SIDER database of drugs and side effects. Nucleic Acids Res. 44(D1), D1075–D1079 (2016). https://doi.org/10.1093/nar/gkv1075 H. Gurulingappa, A.M. Rajput, A. Roberts, J. Fluck, M. Hofmann-Apitius, L. Toldo, Development of a benchmark corpus to support the automatic extraction of drug-related adverse effects from medical case reports. J. Biomed. Inform. 45(5), 885–892 (2012). https://doi.org/10.1016/ j.jbi.2012.04.008 Ö. Uzuner, B.R. South, S. Shen, S.L. DuVall, 2010 i2b2/VA challenge on concepts, assertions, and relations in clinical text. J. Am. Med. Inform. Assoc. JAMIA 18(5), 552–556 (2011). https://doi.org/10.1136/amiajnl-2011-000203 C.-H. Wei, Y. Peng, R. Leaman, A.P. Davis, C.J. Mattingly, J. Li, T.C. Wiegers, Z. Lu, (n.d.). Overview of the BioCreative V Chemical Disease Relation (CDR) Task. 13
A Space Efficient Metadata Structure for Ranking Subset Sums Biswajit Sanyal, Subhashis Majumder, and Priya Ranjan Sinha Mahapatra
Abstract The top-k variation of the Subset Sum problem can be successfully used in solving some popular problems in Recommendation Systems as well as other domains. Given a set of n real numbers, we generate the k best subsets so that the sums of their elements are minimized, where k is a positive integer chosen by the user. Our solution methodology is based on constructing a metadata structure G for a given n. The metadata structure G is constructed as a layered directed acyclic graph where in each node an n-bit vector is kept from which a suitable subset can be retrieved. The explicit construction of the whole graph is never needed; only an implicit traversal is carried out in an output-sensitive manner to reduce the total time and space requirement. We then improve the efficiency of our algorithm by reporting each subset incrementally, doing away with the storage of the bit vector in each node. We have implemented our algorithms and compared one of the variations with an existing algorithm, which illustrates the superiority of our algorithm by a constant factor both with respect to time and space complexity. Keywords One shift · Incremental one shift · DAG · Top-k query · Aggregation function
Pre-published versions of the results appeared in arXiv:2105.11250 [cs.DS], 2021. B. Sanyal (B) Department of Information Technology, Government College of Engineering and Textile Technology, Serampore, Hooghly, West Bengal 712201, India e-mail: [email protected] S. Majumder Department of Computer Science and Engineering, Heritage Institute of Technology, Kolkata, West Bengal 700107, India e-mail: [email protected] P. R. S. Mahapatra Department of Computer Science and Engineering, University of Kalyani, Kalyani, West Bengal, India e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 A. Bhattacharya et al. (eds.), Innovations in Data Analytics, Advances in Intelligent Systems and Computing 1442, https://doi.org/10.1007/978-981-99-0550-8_21
257
258
B. Sanyal et al.
1 Introduction In recent years, in different application domains, end users are more focused on knowing the most relevant query answers rather than only recovering a list of all the data items that satisfy a query, i.e., a manageable abstract of the query results in the potentially huge answer space. Generating such an abstract often requires applying aggregation functions to the query outcome. One of the straightforward aggregation functions is reporting the k best objects among the ones that satisfy the query. However, several applications in recommendation systems require reporting of k best subsets of objects with lowest (or highest) overall scores, rather generating a list of k best independent objects. As an example, we can consider an online shopping site with an inventory of items. Obviously, each item has its own cost. Now, suppose a buyer requires to buy multiple items from the site but he has his own budget restriction. Then recommending a list of k best subsets of items with the lowest overall costs will be helpful to the buyers wherefrom they can choose the subset of items, they want most. In this paper, we model the above problem as the top-k subset sums problem that reports the k best subsets of items from an input inventory of items I , where a subset with a lower sum of costs has a higher position in the top-k list. Let us consider that a customer is looking for a vacation package where he wants to visit different places on a continent. As it is known that a continent has several places to visit and obviously each visit has a cost involvement also. So if the visitor has a budget constraint then obviously he can’t cover all the places. In that case also, recommending a list of k best subsets of places with the lowest overall costs will be helpful to the visitor where from he can pick the subset of places, he admires most. This problem can also be modeled as a top-k subset sums problem that reports k best subsets of places with the lowest overall costs.
1.1 Problem Formulation Given a finite set R of n real numbers, {r1 , r2 , . . . , rn }, sorted in non-decreasing order, our goal is to generate the k best subsets (top-k Subset Sums) for any Σ input value k, ranked on the basis of summation function F, such that F(S) = r ∈S r , for any subset S ⊆ R. Clearly | S | ∈ [1 . . . n]. In our problem, a subset Si is ranked higher than a subset S j if F(Si ) < F(S j ). Furthermore, it is assumed that the rank is unique, so that when F(Si ) = F(S j ), ties are broken arbitrarily. Note that if the input set of numbers does not come as sorted, an additional O(n log2 n) time can be taken to sort it first. However, since k < n makes the problem trivial, each of the generated subsets being of cardinality one, the n log2 n term is typically not mentioned even if the input set does not come as sorted.
A Space Efficient Metadata Structure for Ranking Subset Sums
259
1.2 Past Work Top-K query processing has a rich literature in many different domains, including databases [1], combinatorial objects [2], data mining [3], or computational geometry [4, 5]. There are also other extensions of top-k queries in other environments, such as ad hoc top-k queries [6] or no need for exact aggregate scores [7]. The original subset-sum problem is a well-known NP-complete problem [8]. However, the Top-k version that we are dealing with can be solved in polynomial time as long as k = O(n c ), where n is the cardinality of S and c is a constant. Typically, two different variations of the Top-k Subset Sums problem are found in the literature. Some of them generate only the subset sums in the correct order whereas others report the respective subsets also along with their sums. Clearly the latter variation will need a little bit of higher resource in terms of time and/or space. Sanyal et al. [9] developed algorithms for reporting all the top-k subsets (top-k combinations), where the subsets are of a fixed size r . Their proposed algorithm runs in O(r k + k log2 k) time and some of its variants run in O(r + k log2 k) time. In the last few years, many programmers as well as researchers have been attracted to the problem of finding the sum of a particular subset whose rank is k, basically a variation of the Top-k Subset Sums problem. Different solutions were proposed for reporting the top-k subset sums. However, the most promising one among them appeared to be a O(k log2 k) algorithm [10] proposed by Eppstein. However, if it has to report the kth subset or as a matter of fact all the k subsets, some extra pointers need to be stored in each node of the graph that they are using and some additional computation is needed as well. Very recently, in the database domain, Deep et al. [11] worked on a similar problem. Here ranked enumeration of Conjunctive Query (C Q) results were used to enumerate the tuples of Q(D) according to the order specified by a rank-function rank. The variable Q(D) was used to denote the result of the query Q over an input database D. The problem considered in this manuscript can also be solved using their approach by considering it as a full Union of Conjunctive Query (UCQ) φ = φ1 ∪ . . . φn , where n is the size of the input set R, i.e., | R |. The solution will require a pre-processing time of O(n subw+1 log2 n) and a delay of O(n log2 n), where subw is the sub modular width [12] of all decompositions across all CQs φi .
1.3 Our Contribution In this manuscript, we first propose an efficient output-sensitive algorithm to report the top-k subset sums along with their subsets, where the size of the subsets s lies anything between 1 and n, with an overall execution time of O(nk + k log2 k) and we then improve it to O(k log2 k). We have implemented both algorithms and reported their runtimes on randomly generated test cases. Later, another version of the algorithm is also considered that can only report the top-k subset sums without the subsets.
260
B. Sanyal et al.
We prove that this algorithm also runs in O(k log2 k) time. Finally, on a large number of problem instances with the inputs varying from small values of n and k to very large ones, we compare our approach with a prior solution [10] and show that consistently our approach performs better than the prior solution [10] in terms of time and peak memory used. It means though the asymptotic time complexities are the same for both, the constant factor in our algorithm is definitely less than the earlier work.
2 Outline of Our Technique The paper is organized as follows. In Sect. 3, first the full metadata structure G is introduced and it is shown that how it can be constructed using n local metadata structures G 1 to G n . In addition, it is further shown that how G can be used in conjunction with a min-heap structure H to obtain the desired top-k subsets. In this section, we also highlight the problem of duplicate entries in heap H and show how we can remove this problem by modifying the construction of the G. Ultimately, a modified G is constructed that can report the desired top-k subsets efficiently. Section 3 is concluded by showing that to answer a query, the required portions of G can be generated on demand, so that the requirement for creating G in totality is never needed as a part of pre-processing. Two different variations of the algorithm are presented, the latter version being an improvement over the former both in terms of time and space requirement. In Sect. 4 the results of our implementation are presented and it is shown how the required run-time varies with different values of n and k for both algorithms. In the later part of Sect. 4, our first algorithm is slightly modified to report only the top-k subset sums and compare our solution with an existing solution [10]. Both methods are implemented and run under exactly the same inputs and it is shown that our algorithm is consistently performing better than the existing algorithm. Finally, in Sect. 5, the article is concluded and some open problems are mentioned.
3 Generation of Top-k Subsets In this section, we first consider the following—given any input set R of n real numbers, and a positive integer k, we construct a novel metadata structure G on demand to report the top-k subsets efficiently. Here it is assumed that the numbers of the input set R are kept in a list R , = (r1, , r2, , . . . , rn, ), sorted in non-decreasing order and let P = {1, 2, . . . , n} be the set of positions of the numbers in the list. A subset S ⊆ R is now viewed as a sorted list of | S | distinct positions chosen from P.
A Space Efficient Metadata Structure for Ranking Subset Sums
261
3.1 The Metadata Structure G Our solution methodology is based on constructing a metadata structure G for a given n. The metadata structure G for our present scenario, is constructed with n local metadata structures G 1 to G n . The local metadata structure G i for any fixed value i can be used to generate the top-k subsets of size i ∈ [1 . . . n], | R | = n, ranked on the basis of summation function F [9]. In this current work, G = (V , E) is constructed as a layered Directed Acyclic Graph (DAG) in a fashion similar to the one done by Sanyal et al. [9]. Each node v ∈ V contains the information of | S | positions of a subset S ⊆ R. In DAG G, for each node, the | S | positions are stored as a bit vector B[1 . . . n]. Note that the bit vector B has in total | S | number of 1s and n − | S | 0s. The bit value B[i] = 1 represents that ri, of R , is included in the subset S, whereas, B[i] = 0 means r , i is not in S. Consider the bit vector 110100 for any subset S. It says that the 1st , 2nd , and 4th elements of the list R , , are included in the subset S, where the total number of numbers in R is six. The directed edges between the nodes of local metadata structure G i (i ∈ [1 . . . n], | R | = n) are drawn using the concept of “Mandatory Static One Shift” that is similar to the concept of “Mandatory One Shift” of Sanyal et al. [9]. However, the construction of the full metadata structure G with n local metadata structures G 1 to G n using the concept of “Incremental One Shift” is novel and is different form their work. 1. Mandatory Static One Shift: Definition 1 (One Shift) Let PS = ( p1 , p2 , . . . , p|S| ) denote the list of sorted posi, tions of the numbers in S ⊆ R and let PS , = ( p1, , p2, , . . . , p|S , | ) denote the list of , sorted positions of the numbers in another subset S ⊆ R where | S , | = | S |. Now, if for some j, p ,j = p j + 1 and pi, = pi for i /= j, then, it is said that S , is a One Shift of S. Definition 2 (Mandatory Static One Shift) S , is said to be a Mandatory Static One Shift of S, if (i) S , is a one shift of S, and (ii) among all subsets of whom S , is a one shift, S is the one whose list of positions is lexicographically the smallest (equivalently, the n-bit string representation is lexicographically the largest). Now, let us consider an example of mandatory static one shift. Let, R = {3, 7, 12, 14, 25, 45} and S , = {3, 12, 25}. Here S is either {3, 7, 25} or {3, 12, 14} for which S , is a One Shift of S. So in this case, PS , = (1, 3, 5) and PS is either (1, 2, 5) or (1, 3, 4) among which PS = (1, 2, 5) is lexicographically the smallest. Also, it easily follows from the definition that S , = {3, 12, 25} is a mandatory static one shift of S = {3, 7, 25}. 2. Incremental One Shift: The concept of “Incremental One Shift” is somewhat different where the subset S , is a one shift of the subset S and | S , | = | S | + 1, i.e., S , contains one more element than S. Hence, the bit vector representation of the node v(S , ) has one extra 1 than v(S).
262
B. Sanyal et al.
Definition 3 (Incremental One Shift) Let PS = ( p1 , p2 , . . . , p|S| ) denote the list of , sorted positions of the numbers in a subset S ⊆ R and let PS , = ( p1, , p2, , . . . , p|S ,|) , denote the list of sorted positions of the numbers in another subset S ⊆ R, where | S , | = | S | + 1. Now, if ∀i, 1 ≤ i ≤ | S |, the position index pi is also present in PS , but for some j, 1 ≤ j ≤ | S , |, the position index p ,j is not in PS , then, we say that S , is an Incremental One Shift of S. Now, let us consider an example. Let, R = {3, 7, 12, 14, 25, 45}, S = {3, 12, 45} and S , = {3, 12, 25, 45}. In this case, PS = (1, 3, 6) and PS , = (1, 3, 5, 6). Here, S , is a Incremental One Shift of S, since ∀i, 1 ≤ i ≤ 3, the position index pi is also present in PS , but the position index p3, = 5 is not present in PS .
3.2 Construction of Metadata Structure G In this subsection, for a given n, the detailed construction methodology of the metadata structure G is described that requires the following steps to be performed: 1. Construction of local metadata structures G 1 to G n with mandatory static one shift: Note that if S is a non-empty subset of R, then | S | ∈ {1..n}. With the concept of “Mandatory Static One Shift”, we first construct a local metadata structure for each possible size of the subset S and name this local metadata structure as G |S| . So, G |S| is basically a directed acyclic graph where each subset of size | S | is present exactly once in some node of the graph and in its bit vector representation, the number of 1s is also | S |. Let us consider the node corresponding to any subset S be v(S). Then there will be a directed edge from node v(S) to node v(S , ) iff the subset S , is a mandatory static one shift of the subset S. Clearly we will have n such local metadata structures G 1 to G n . The metadata structure G i for any fixed value i can be used to generate the top-k subsets of size i efficiently, ranked on the basis of summation function F [9]. Figure 1 shows the local metadata structure G 3 for the case n = 6 with mandatory static one shift. Here, the node v = 101010 can be obtained by a static one shift from both the nodes 110010 and 101100. However, v is the mandatory static one shift of only the node containing 110010, and not that of 101100. 2. Construction of the complete metadata structure G with incremental one shift: In order to define the complete metadata structure G for our present scenario, for each subset size i ∈ [1 . . . n − 1], | R | = n, two consecutive local metadata structures G i and G i+1 are connected using the concept of “Incremental One Shift”. Here a directed edge goes from a node v(S) ∈ V (G |S| ) to a node v(S , ) ∈ V (G |S|+1 ). Note that the bit vector representation of the node v(S , ) has one extra ‘1’ than v(S). Figure 2a shows the metadata structure G for the case n = 4. Here, a directed edge goes from a node v = 0100 of G 1 to nodes v1 = 1100, v2 = 0110, and v3 = 0101 of G 2 , where all v1 , v2 and v3 are the “Incremental One Shifts” of v.
A Space Efficient Metadata Structure for Ranking Subset Sums
263
111000 110100 110010
101100
110001
101010
011100
101001
011010
100110
100101
010110
011001
100011
010101
001110
001101
010011
001011 000111
Fig. 1 G 3 for the case n = 6 with mandatory static one shift
Definition of incremental one shift leads to the following two observations. Observation 1 Let ( p1 , p2 , . . . , p|S| ) denote the list of sorted positions of the | S | , numbers in a subset S ⊆ R and further let ( p1, , p2, , . . . , p|S , | ) denote the list of sorted , , positions of the | S | numbers in a subset S ⊆ R, where | S , | = | S | + 1. Then, S , , is an incremental one shift of S if and only some Σif for Σ j, 1 ≤ ,j ≤ | S |, the position , , index p j is not in ( p1 , p2 , . . . , p|S| ) and i pi − i pi = p j . Observation 2 Each node v(S), S ⊆ R of the metadata structure G has n – | S | incremental one shift children.
3.3 Query Answering with Heap The current subsection first describes the detailed steps of reporting top-k subsets using the metadata structure G and a min-heap H . Later part of the subsection highlights the node duplication problem; we are facing in reporting the outputs. The subsection is concluded with a solution of the node duplication problem that is
264
B. Sanyal et al.
reflected in the pseudo-code of Algorithm 1, which is somewhat similar in principle to the query algorithm that works along with a pre-processing step, presented in our earlier work [9]. However, the on demand version (Algorithm 2) presented later in this work is novel and is totally different from their work. 1. Reporting top-k subsets using the metadata structure G and a min-heap H : Note that in G, two subsets Si and S j are comparable if there is a directed path between the two corresponding nodes v(Si ) and v(S j ). However, if there is no path in G between the nodes v(Si ) and v(S j ), then it is required to calculate the values of the summation function F(Si ) and F(S j ) explicitly to find out which one ranks higher in the output list. To facilitate this process, a min-heap H is maintained to store the candidate subsets S according to their key values F(S). Initially, the root node T of the metadata structure G is inserted in min-heap H , with key value F(T ). Then, to report the desired top-k subsets, at each step, the minimum element Z of H is extracted and report it as an answer, and then insert each of its children X from G into H with key value F(X ). Obviously, the above set of steps have to be performed k − 1 times until all the top-k subsets are reported. 2. The node duplication problem: The problem that we face here is the high out degree of each node v in G. Here each node can have at most two static one shift edges [9] but has a high number of incremental one shift edges. As a consequence, many nodes have multiple parents in G. Note that, for any node v(S) of G with multiple parents, we need to insert the subset S or rather the node v(S) into the heap H , right after reporting the subset stored in any one of its parent nodes, i.e., when for the first time we extract any of its parents say u from H . On the other hand, v(S) can be extracted from H only after all the subsets stored in its parent nodes have been reported as part of the desired result, i.e., all their corresponding nodes have been extracted from H . So during the entire lifetime of v in H , whenever the subset stored in some other parent of v is reported, either the subset S needs to be inserted again in H or a checking is to be performed whether any node corresponding to S is already there in H . The former strategy will lead to duplication in the heap H and the latter one will lead to too much overhead as we then have to then check for prior existence in H , for each and every child of any node that will get extracted from H . Either way, the time complexity will rise. 3. The solution of the node duplication problem: To avoid the problem of duplication, whenever a node in H is inserted, we also store the label of that node in a Skip List or a height-balanced binary search tree (AVL tree) T , and only insert v to H if v is not already present in T . The above step has to be performed exactly k − 1 times till all top-k subsets are generated as output. The summary of this discussion is reflected in the pseudo-code of Algorithm 1. Actually a min-max-heap can be used instead of a min-heap, so as to limit the number of candidates in H to be at most k. Alternatively, a max-heap M along with H can be used, to achieve the same feat in the following way. Whenever an element is inserted in H , it is also inserted in M and an invariant is maintained such that the size of M is always less than or equal to k. If it tries to cross k, the maximum element from M as well as H are removed, since such an element can
A Space Efficient Metadata Structure for Ranking Subset Sums
265
Algorithm 1 Top-k_Subsets_With_Metadata_Structure(R[1 . . . n], G, k) Create an empty min-heap H ; Create an empty binary search tree T ; Sort the n real numbers of R in non-decreasing order; Root ← the root node of G (Root node of G 1 ); Insert Root into H with key value F(Root); Insert Root into T ; for q ← 1 to k do Z ← extract-min(H ); Output Z as the q th best subset; Delete Z from T ; for each child X of Z in G do if X is not found in T then Insert X into H with key value F(X ); Σ ▷ F is the summation function, such that F(X ) = r∈X r , for any subset X ⊆ R. Insert X into T ; end if end for end for
never come in the list of top-k elements being the maximum within k elements. Note that Algorithm 1 can be made even more output-sensitive by dynamically limiting the number of elements in the two heaps by (k − y) if y is the number of subsets already reported. This is also being reflected in the pseudo-code of Algorithm 2 later. This leads to the following lemma. For brevity, we have omitted the proofs of lemmas in this current manuscript. Lemma 1 The extra working space of Algorithm 1, in addition to that for maintaining G, is O(kn). Lemma 2 Apart from the time to sort R, Algorithm 1 runs in O(nk log2 k) time.
3.4 Modified Metadata Structure G In order to reduce the overall time complexity, a natural choice would be to reduce the number of incremental one shift edges between two consecutive local metadata structures G i and G i+1 (i ∈ [1 . . . n − 1]), so that the duplication problem in the heap will automatically get reduced. In this subsection we define two different variations of the incremental one shift—“Mandatory Incremental One Shift” and “Modified Mandatory Incremental One Shift” and introduce two modified versions of the metadata structures G. 1. The metadata structure G with Mandatory Incremental One Shift: Definition 4 (Mandatory Incremental One Shift) S , is said to be a Mandatory Incremental One Shift of S, if (i) S , is an incremental one shift of S, and (ii) among all the
266
B. Sanyal et al.
subsets of R, for which S , is an incremental one shift, S is the one whose position sequence is lexicographically the largest (equivalently, the n-bit string representation is lexicographically the smallest). The constructed metadata structure G after applying mandatory incremental one shift, exhibits several interesting properties. For brevity, we have skipped the properties in the current manuscript. Figure 2b gives an example of the metadata structure G for the case n = 4. Note that v = 1101 is a mandatory incremental one shift of 0101, but v is not a mandatory incremental one shift of 1100 or 1001. The definition of mandatory incremental one shift directly leads to the following two observations. Observation 3 Let ( p1 , p2 , . . . , p|S| ) denote the list of sorted positions of the num, bers in a subset S ⊆ R and let ( p1, , p2, , . . . , p|S , | ) denote the list of sorted positions , of the numbers in another subset S ⊆ R, where | S , | = | S | + 1. Then, S , is a mandatory incremental one shift of S if and only if for some j, 1 ≤ j ≤ | S , | (i) the position index p ,j is not in ( p1 , p2 , . . . , p|S| ), , (ii) pΣ j < p1 , and Σ (iii) i pi, − i pi = p ,j . Observation 4 If ( p1 , p2 , . . . , p|S| ) denotes the list of sorted positions of the numbers in a subset S ⊆ R, then the node v(S) of the metadata structure G has exactly ( p1 − 1) mandatory incremental one shift children. The 2nd property from Observation 3 can be easily verified from Fig. 2b. The nodes containing the subsets 1010 and 0110 are the children of the node containing the subset 0010 which means the subsets 1010 and 0110 can be obtained from the subset 0010 by mandatory incremental one shift. This in fact leads us to the next lemma. The rationale behind refining the definition of a shift in steps is to make the graph G more sparse without disturbing the inherent topological ordering, since the complexity of the algorithm directly depends on the number of edges that G contains. So a last enhancement is further made on the graph G by defining another type of shift below. 2. The metadata structure G with Modified Mandatory Incremental One Shift: Note that many nodes in the DAG G (version I) still have high out degrees due to multiple mandatory incremental one shift edges. Especially, the bottom most node in each G i (except G n ) has n − i mandatory incremental one shift edges. We can remove most of these incremental edges to decrease the number of edges in the DAG and hence its complexity by keeping at most one outgoing incremental edge from each node by redefining the definition of mandatory incremental one shift. Definition 5 (Modified Mandatory Incremental One Shift) S , is said to be a Modified Mandatory Incremental One Shift of S, if (i) S , is a mandatory incremental one shift
A Space Efficient Metadata Structure for Ranking Subset Sums
267
of S, and (ii) among all those subsets of R, which are mandatory incremental one shifts of S, S , is the one whose position sequence is lexicographically the smallest (equivalently, the n-bit string representation is lexicographically the largest). The metadata structure G, we have, after applying modified mandatory incremental one shift, has several interesting properties. For brevity, we have skipped the properties in the current manuscript. Note that in Fig. 2c for the modified DAG G for n = 4, the subsets 1001, 0101, and 0011 are all mandatory incremental one shifts of 0001 but only 1001 according to the definition, is the modified mandatory incremental one shift of 0001. The definition of modified mandatory incremental one shift easily leads to the following observation. Observation 5 If ( p1 , p2 , . . . , p|S| ) denotes the list of sorted positions of the | S | numbers in a subset S ⊆ R, then the node v(S) of the metadata structure G has only one modified mandatory incremental one shift children, where p1 > 1. Clearly, after the introduction of modified mandatory incremental one shift, any node v ∈ V (G) has now at most one outgoing incremental edge. However, having multiple outgoing edges is not the only issue that affects the run-time complexity, having multiple incoming edges also does so. For example, consider Fig. 2c, where 1001 has two parents 0001 (by modified mandatory incremental one shift) and 1010 (by static one shift). Without loss of generality, if the parent 0001 is reported first as part of desired top-k subsets then as its child node, 1001 will be inserted into the heap and it will stay in a heap till its next parent 1010 is reported. It happens so, as parents are always ranked higher than the child in the DAG G. But when the parent 1010 is reported, as its child node, 1001 is again supposed to be added in the heap following the reporting logic using the heap. So if added without checking, it would have caused multiple insertions of the same node thereby causing an unnecessary increase in run-time. The other option is to check for node duplication before inserting the child node, which would also cause an increase in the run-time complexity. However, it is obvious that the problem of node duplication can be removed altogether if every node in G has only one parent.
3.5 The Final Structure of G In order to improve the time complexity, it is needed to remove the node duplication problem altogether, i.e., we want that each node v of G to have only one parent. However, it must also be ensured that each valid subset can be reached from the root of G (which is also the root of G 1 ) and also the subsets corresponding to all the children of a node v can be easily deduced from the subset corresponding to v. In the current design of G, the root of G 1 has no parent. Other nodes of G 1 has one parent. The roots of other data structures G 2 to G n has one parent (by modified
268
B. Sanyal et al. 1000
1000
0100
0100
G1
0010
0010
0001
0001
1100
1100
1010 0110
G1
1010 1001
G2
0110 0101
0011
0011
1110
1110
1101
1101
G3
1011
1011
0111
0111
G4 Mandatory Static One Shift edge
1111
G3
G4 Mandatory Static One Shift edge
1111
Incremental One Shift edge
(a) With incremental one shift
Mandatory Incremental One Shift edge
(b) With mandatory incremental one shift
1000
1000
0100
0100
G1
0010
0010
0001
0001
1100
1100
1010 0110
G2
0110
1001
0101
0101
0011
0011
1110
G2
1110 1101
G3
1011
1011
0111
0111
1111
G1
1010 1001
1101
G2
1001
0101
G4 Mandatory Static One Shift edge
1111
Modified Mandatory Incremental One Shift edge
(c) With modified mandatory incremental one shift
G3
G4 Mandatory Static One Shift edge Modified Mandatory Incremental One Shift edge
(d) With one parent for each node
Fig. 2 Different versions of model metadata structure G for n = 4
mandatory incremental one shift) and all other nodes have at least one parent and at most two parents (one by static one shift and the other possibly by modified mandatory incremental one shift). The desired goal of having every node (other than the root of G) with only one parent can be achieved, simply by omitting all incremental edges from G except the ones that lead to the roots of G i for each subset size i ∈ {2, . . . , n}, | R |= n. Deleting these edges will reduce the graph in Fig. 2c to that of Fig. 2d, which gives us the final DAG G for n = 4. The metadata structure G, we have, after applying these last set of modifications, exhibits several interesting properties. For brevity, we have skipped the properties in the current manuscript. Lemma 3 Every node of this final DAG G has at most two children. For brevity, we have omitted the proof of the lemma in this current manuscript.
3.6 Generation of Top-k Subsets by G on Demand Note that instead of creating this final metadata structure G as a part of preprocessing, only its required portions can be constructed on demand, more impor-
A Space Efficient Metadata Structure for Ranking Subset Sums
269
tantly the edges of this DAG G are not even needed to be explicitly connected. The real fact is a portion of the graph that is explored just remains implicit among the nodes that are inserted into H . Here, we create and evaluate a subset corresponding to a node only if its parent node u in the implicit DAG gets extracted. It saves considerable storage space as well as improves the run-time complexity. See Algorithm 2 for a detailed pseudo-code showing how the whole algorithm can be implemented. Lemma 4 in turn establishes the fact that the algorithm really performs as intended. Note that apart from the root of the metadata structure, each node can be created in O(1) incremental time. Since we create the subset corresponding to node v only when its parent u in the implicit DAG is extracted, and there is only a difference of O(1) bits between u and v, the creation as well as evaluation of the subset for v can be done in O(1) incremental time, if the subset of u is known. Lemma 4 Given any bit-pattern of a subset and its corresponding value, the bitpattern of the subset and the corresponding value for any of its child can be generated in O(1) incremental time. Also the type of edges that come out of its corresponding node can be determined in O(1) time. For brevity, we have omitted the proof of the lemma in this current manuscript. Here also, a min-max-heap or a max-heap can be used in addition to the minheap, so as to limit the number of candidates in H to be at most k. This gives the following two lemmas. For brevity, we have omitted the proofs of lemmas in this current manuscript. Lemma 5 The working space required for Algorithm 2 is O(kn). Lemma 6 Apart from the time required to sort R, Algorithm 2 runs in O(n + k log2 k) time. If it is required to report the subsets, it will take O(nk + k log2 k) time.
3.7 Getting Rid of the Bit String Note that in Lemma 4, it is already established that the bit string of a child node varies from its parent only at a constant number of places and also the subset sum of any child node can be generated from its parent node in O(1) time, since it requires only a constant number additions and subtractions with the sum stored in the parent node. Actually, with a slight modification, Algorithm 2 can be run even without the explicit storage of the bit-pattern at each node. Here it is required to maintain another pointer value that points to the position of the first 1 after the position pointed by pointer p1 . It can now be noted that the types of edges that come out from any node in G can be determined in O(1) time. Once the indices of the elements present in the subset corresponding to the parent node are known, it is possible to generate the array indices of the elements and from them the actual element(s) of each subset can be found using O(1) incremental time for each node. In the following lemma it is shown
270
B. Sanyal et al.
Algorithm 2 Top-k_Subsets_On_Demand(R[1 . . . n], k) Structure of a node of the implicit DAG: bit-pattern B[1 . . . n], subsetSize: ls, aggregation-Value: F, tuple of three array indices ( p1 , p2 , p3 ) 1: Create an empty min-heap H and an empty max-heap M; 2: Sort the n real numbers of R in non-decreasing order; ▷ Create the root node Root of G 3: Root.B[1] ← 1; 4: for i ← 2 to | R | do 5: Root.B[i] ← 0; 6: end for 7: Root.( p1 , p2 , p3 ) ← (0, 1, 1); 8: Root.ls ← 1; 9: Root.F ← R[1]; 10: insert Root in H as well as M; 11: count ← 1; 12: for q ← 1 to k do 13: curr ent N ode ← extractMin(H ); 14: count ← count − 1; 15: Output the subset for bit-pattern curr ent N ode.B[1 . . . n] and also its value curr ent N ode.F as the q th best subset; ▷ Get one mandatory static one shift child (if any) 16: if (child ST ype1 ← genManStaticType1(curr ent N ode)) then 17: insert child ST ype1 in H as well as M; 18: count ← count + 1; 19: end if ▷ Get the other mandatory static one shift child (if any) 20: if (child ST ype2 ← genManStaticType2(curr ent N ode)) then 21: insert child ST ype2 in H as well as M; 22: count ← count + 1; 23: end if ▷ Get mandatory incremental static one shift child (if any) 24: if (child M I ← genManIncremental(curr ent N ode)) then 25: insert child M I in H as well as M; 26: count ← count + 1; 27: end if ▷ Remove the maximum element from both heaps if required, as we need to output only k subsets; also removing more than one node at a time is never required since any parent node can have at most 2 children 28: if count > k − q then 29: extra N ode ← extract Max(M); 30: remove extra N ode also from H ; 31: count ← count − 1; 32: end if 33: end for
A Space Efficient Metadata Structure for Ranking Subset Sums
271
1: function genManStaticType1(node ParentNode) 2: nodechild N ode ← N U L L; 3: if Par ent N ode. p1 > 1 and Par ent N ode. p1 < n then 4: if Par ent N ode.B[Par ent N ode. p1 + 1] = 0 then 5: child N ode ← Par ent N ode; ▷ copy ParentNode into childNode 6: child N ode.B[child N ode. p1 ] ← 0; 7: child N ode.B[child N ode. p1 + 1] ← 1; 8: child N ode. p1 ← child N ode. p1 + 1; 9: if child N ode. p3 = child N ode. p1 − 1 then 10: child N ode. p3 ← child N ode. p3 + 1; 11: end if 12: child N ode.F ← child N ode.F − R[child N ode. p1 − 1]; 13: child N ode.F ← child N ode.F + R[child N ode. p1 ]; 14: end if 15: end if 16: return child N ode; 17: end function 1: function genManStaticType2(node ParentNode) 2: nodechild N ode ← N U L L; 3: if Par ent N ode. p2 > 0 and Par ent N ode. p2 < n then 4: child N ode ← Par ent N ode; ▷ copy ParentNode into childNode 5: Swap(child N ode.B[child N ode. p2 ], child N ode.B[child N ode. p2 + 1]); ▷ p2 now points to leftmost 0 6: child N ode. p1 ← child N ode. p2 + 1; 7: if child N ode. p3 = child N ode. p2 then 8: child N ode. p3 ← child N ode. p2 + 1; 9: end if 10: child N ode. p2 ← child N ode. p2 − 1; 11: child N ode.F ← child N ode.F − R[child N ode. p2 + 1]; 12: child N ode.F ← child N ode.F + R[child N ode. p2 + 2]; 13: end if 14: return child N ode; 15: end function 1: function genManIncremental(node ParentNode) 2: nodechild N ode ← N U L L; 3: if Par ent N ode. p1 = 2 and Par ent N ode. p2 = 0 and Par ent N ode. p3 = Par ent N ode.ls + 1 then 4: child N ode ← Par ent N ode; ▷ copy ParentNode into childNode 5: child N ode.B[1] ← 1; 6: child N ode. p1 ← 0; 7: child N ode. p2 ← child N ode. p3 ; 8: child N ode.ls ← child N ode.ls + 1; 9: child N ode.F ← child N ode.F + R[1]; 10: end if 11: return child N ode; 12: end function
272
B. Sanyal et al.
that the running time improves to O(k log2 k) and hence becomes independent of n. It also gets exhibited in the next section from the results of our implementation, where we see that the run-time falls drastically. Recall that the subset sum of any child node can also be generated from its parent node in O(1) time. Lemma 7 The modified version of Algorithm 2 that works without storing the bit stream runs in O(k log2 k) time. For brevity, we have omitted the proof of the lemma in this current manuscript.
4 Experimental Results The two algorithms, one storing the bit vector in each node explicitly (Algorithm 2) and the other one being the modified version of the same algorithm where the bit string is not stored, were implemented in C and were tested with varying values of n and k. For each specific choice of (n, k), five datasets were generated randomly, and after running each of the algorithms it was found that the runtimes and other reported parameters hardly varied from one dataset to another, for that specific pair of n and k. We made sure that whenever two algorithms were compared, they were run on exactly the same dataset. The experiments were conducted on a desktop powered with an Intel Xeon 2.4 GHz quad-core CPU and 32 GB RAM. The operating system loaded in the machine was Fedora LINUX version 3.3.4. The latter version clearly outperforms the former one and it is found that the speedup is sometimes as high as 20X for higher values of k. The modified algorithm can even run in less than a minute, where k was as high as 107 . The former algorithm, however, was getting very slow at those values of k and we did not wait till the runs completed. The results are presented in Table 1. Finally, the results of our algorithm that only reports the sum of the subsets according to the ranks are compared with an existing algorithm [10] that also achieves the same feat. Both the algorithms were implemented in C and were run on the same machine mentioned above and on exactly the same set of randomly generated integers. The running times, total number of entries into the heap, and also the peak number of entries in the heap attained during the whole run for each of the two algorithms were compared and were found that the gains are much more for the two latter parameters; the difference in run-time being somewhat compensated by the constant time updating of some integers values stored in each node, which is little more in our case. The results are presented in Table 2. It can be seen that our method consistently performs better than the existing method. For brevity we deleted the runtimes from Table 1 for k = 10000 and 100000.
A Space Efficient Metadata Structure for Ranking Subset Sums
273
Table 1 Speed up for non-bit-vector variant for different n and k k = 1000 n
Time in sec with bit vector (B)
Time in Speed up sec (S = without B/ V ) bit vector (V )
k= 10000
k= 100000
Speed up (S = B/ V )
Speed up (S = B/ V )
k = 1000000
Time with bit vector (B)
k= 10000000
Time Speed up without (S = bit vector B/ V ) (V )
Time without bit vector (V )
100
0.0069
0.0020
3.45X
2.76X
2.20X
4.4800
2.2010
2.04X
25.8228
200
0.0105
0.0026
4.04X
2.42X
2.93X
6.3452
2.1755
2.92X
24.9263
300
0.0142
0.0024
5.92X
6.12X
3.89X
8.2391
2.1625
3.81X
24.4716
400
0.0173
0.0021
8.24X
6.34X
4.78X
10.0176
2.1052
4.76X
23.7235
500
0.0193
0.0019
10.16X
8.74X
5.95X
11.7948
2.1720
5.43X
24.0756
600
0.0229
0.0023
9.96X
8.39X
6.53X
14.2409
2.2006
6.47X
33.5502
700
0.0246
0.0022
11.18X
6.41X
7.32X
20.6880
2.2287
9.28X
33.8687
800
0.0281
0.0022
12.77X
5.75X
7.87X
30.6208
2.1917
13.97X
32.0629
900
0.0334
0.0022
15.18X
8.80X
8.51X
35.0497
2.2686
15.45X
33.5968
1000
0.0364
0.0022
16.55X
11.09X
9.74X
49.1019
2.3635
20.76X
39.2978
4.1 Analysis of the Results It can be recalled that in [10], each time a node is extracted from the heap, two of its children are inserted immediately and hence the number of total entries in the heap has a fixed value of 2k + 1. Also, the peak number of entries into the heap always rises to k + 1, which can be verified from Table 2 also. This is because each extraction from the heap is followed by a double insertion that effectively increases the heap size by one after each instruction. The last insertion could be avoided however if the check is performed before insertion that whether k th item has been extracted. In that case, the numbers would have been reduced to 2k − 1 and k, respectively. Note that in our solution, every node of the DAG G can have at most two children (Lemma 3). However, most of the times, the number of children is only one and only in a few cases it is two and in some other cases it is zero also. Hence, though the number of entries in the heap of our solution lies in [k, 2k − 1], in practice, the actual number of entries turns out to be far less than 2k − 1. Also, the peak number of entries into the heap at any point of time remains far lower than the value of k. This can be easily observed in Table 2. It is sort of obvious because in our case, most of the times only one child node needs to be inserted in the heap after extracting the node with the minimum value from the heap. This could have been predicted very easily from the sparse nature of the graph G shown in Fig. 2d, that we implicitly kept on constructing during the run of the algorithm. It hardly increases the heap size as in most steps the extraction compensates for the single insertion that follows. Actually, our algorithm exploits the partial order that inherently exists in the data in a better manner than the simpler algorithm. It is clear from the above discussion that due to the lesser value of the peak number of entries into the heap at any point of time; our
274
B. Sanyal et al.
Table 2 Comparison of our algorithm with a prior work reporting only the sums Total time (s)
Speed up (S)
n
k
Existing method (B)
Our method (V )
B/ V
Total no. of entries in the heap Existing method (P)
Our method (Q)
Percentage
Peak size of the heap
improvement TI = 100(P−Q) P
Percentage improvement
Existing Method (R)
Our method (S)
TP = 100(R−S) R
100
1000
0.0009
0.0007
1.29X
2001
1132
43.43
1001
136
86.41
100
10000
0.0114
0.0091
1.25X
20001
11078
44.61
10001
1085
89.15
100
100000
0.1076
0.0998
1.08X
200001
109531
45.23
100001
9531
90.47
100
1000000
1.2933
1.0105
1.28X
2000001
1083508
45.82
1000001
83519
91.65
10000000 12.7919
11.5919
1.10X
20000001 10745231
46.27
10000001 745270
92.55
100 200
1000
0.0008
0.00068 1.18X
200
10000
0.0089
0.0072
200
100000
0.1068
200
1000000
200
2001
1102
44.93
1001
107
89.31
1.24X
20001
10848
45.76
10001
854
91.46
0.0819
1.30X
200001
107377
46.31
100001
7381
92.62
1.2451
0.9553
1.30X
2000001
1063607
46.82
1000001
63627
93.64
10000000 12.1236
10.6818
1.13X
20000001 10557415
47.21
10000001 557428
94.43
300
1000
0.0007
0.0006
1.17X
2001
1123
43.88
1001
126
87.41
300
10000
0.0082
0.0066
1.24X
20001
10997
45.02
10001
997
90.03
300
100000
0.1073
0.0798
1.34X
200001
109175
45.41
100001
9175
90.83
300
1000000
1.2064
0.9547
1.26X
2000001
1081802
45.91
1000001
81802
91.82
10000000 12.0638
10.7924
1.12X
20000001 10737463
46.31
10000001 737465
92.63
300 400
1000
0.0007
0.0006
1.17X
2001
1069
46.58
1001
69
93.11
400
10000
0.0081
0.0064
1.27X
20001
10630
46.85
10001
631
93.69
400
100000
0.1065
0.0762
1.40X
200001
105278
47.36
100001
5280
94.72
400
1000000
1.2034
0.8952
1.34X
2000001
1047571
47.62
1000001
47588
95.24
10000000 11.7885
9.5206
1.24X
20000001 10427513
47.86
10000001 427588
95.72
400 500
1000
0.0007
0.00059 1.19X
500
10000
0.0082
0.0062
500
100000
0.1068
500
1000000
500
2001
1076
46.23
1001
77
92.31
1.32X
20001
10523
47.39
10001
525
94.75
0.0755
1.42X
200001
104460
47.77
100001
4460
95.54
1.1767
0.9201
1.28X
2000001
1039204
48.04
1000001
39215
96.08
10000000 11.9332
10.6270
1.12X
20000001 10346223
48.27
10000001 346254
96.54
600
1000
0.0007
0.0006
1.17X
2001
1076
46.23
1001
76
92.41
600
10000
0.0097
0.0088
1.10X
20001
10613
46.94
10001
616
93.84
600
100000
0.1057
0.0774
1.37X
200001
105353
47.32
100001
5355
94.65
600
1000000
1.1732
0.8996
1.30X
2000001
1048164
47.59
1000001
48183
95.18
10000000 12.0589
11.0571
1.09X
20000001 10437258
47.81
10000001 437265
95.63
600 700
1000
0.0007
0.0006
1.17X
2001
1073
46.38
1001
75
92.51
700
10000
0.0078
0.0064
1.22X
20001
10740
46.30
10001
747
92.53
700
100000
0.1060
0.0776
1.37X
200001
106428
46.79
100001
6436
93.56
700
1000000
1.1557
0.9502
1.22X
2000001
1058889
47.06
1000001
58892
94.11
10000000 12.0418
11.3002
1.07X
20000001 10543429
47.28
10000001 543467
94.57
700 800
1000
0.0007
0.0006
1.17X
2001
1053
47.38
1001
53
94.71
800
10000
0.0080
0.0062
1.29X
20001
10346
48.27
10001
347
96.53
800
100000
0.1048
0.0721
1.45X
200001
102680
48.66
100001
2682
97.32
800
1000000
1.1492
0.8254
1.39X
2000001
1024118
48.79
1000001
24123
97.59
10000000 11.8349
9.5207
1.24X
20000001 10213912
48.93
10000001 213916
97.86
800 900
1000
0.0007
0.0006
1.17X
2001
1066
46.73
1001
68
900
10000
0.0079
0.0064
1.23X
20001
10643
46.79
10001
645
93.21 93.55
(continued)
A Space Efficient Metadata Structure for Ranking Subset Sums
275
Table 2 (continued) Total time (s)
Speed up (S)
n
k
Existing method (B)
Our method (V )
B/ V
Total no. of entries in the heap Existing method (P)
Our method (Q)
Percentage
Peak size of the heap
improvement TI = 100(P−Q) P
Percentage improvement
Existing Method (R)
Our method (S)
TP = 100(R−S) R
900
100000
0.1040
0.0768
1.35X
200001
106278
46.86
100001
6279
93.72
900
1000000
1.1248
0.8773
1.28X
2000001
1058804
47.06
1000001
58804
94.12
10000000 11.7640
10.5033
1.12X
20000001 10551227
47.24
10000001 551264
94.49
900 1000
1000
0.0007
0.0006
1.17X
2001
1079
46.08
1001
79
92.11
1000
10000
0.0081
0.0064
1.27X
20001
10622
46.89
10001
631
93.69
1000
100000
0.1052
0.0766
1.37X
200001
105589
47.21
100001
5591
94.41
1000
1000000
1.1618
0.8831
1.32X
2000001
1051421
47.43
1000001
51425
94.86
1000 10000000 12.0205
10.7482
1.12X
20000001 10479269
47.60
10000001 479286
95.21
approach will perform much better than the solution [10] if this is being applied in any practical problem, especially when there will be a requirement of maintaining a lot of satellite data in each node.
5 Conclusion In this manuscript, an efficient algorithm is first proposed to compute the top-k subsets of a set R of n real numbers. The presented algorithm here creates portions of an implicit DAG on demand, that gets rid of the storage requirement of the preprocessing step altogether. In several steps, we have made the DAG as sparse as possible so that the overall run-time complexity improves retaining its useful properties. Our algorithms were implemented in C and the plots of run-time illustrate that the algorithm is performing as expected. Another efficient algorithm is also presented for reporting only the top-k subset sums (not subsets) and we have compared our results with an existing solution. These two algorithms were also implemented and the results establish that our method is consistently performing better than the existing one. Solving the problem for aggregation functions other than summation, and finding other applications of the metadata structure remain possible directions for future research.
References 1. A. Marian, N. Bruno, L. Gravano, Evaluating top-k queries over web-accessible databases. ACM Trans. Database Syst. 29(2), 319–362 (2004) 2. T. Suzuki, A. Takasu, J. Adachi, Top-k query processing for combinatorial objects using Euclidean distance. IDEAS 209–213 (2011)
276
B. Sanyal et al.
3. L. Getoor, C.P. Diehl, Link mining: a survey. ACM SIGKDD Explor. Newslett. 7(2), 3–12 (2005) 4. M. Karpinski, Y. Nekrich, Top-k color queries for document retrieval. ACM-SIAM SODA 401–411 (2011) 5. S. Rahul, P. Gupta, R. Janardan, K.S. Rajan, Efficient top-k queries for orthogonal ranges. WALCOM 110–121 (2011) 6. C. Li, K.C.C. Chang, I.F. Ilyas, Supporting ad-hoc ranking aggregates. SIGMOD 61–72 (2006) 7. I.F. Ilyas, W.G. Aref, A.K. Elmagarmid, Joining ranked inputs in practice. VLDB 950–961 (2002) 8. M.R. Garey, D.S. Johnson, Computers and Intractability: A Guide to the Theory of NPCompleteness (W.H. Freeman, 1979) 9. B. Sanyal, S. Majumder, W.K. Hon, P. Gupta, Efficient meta-data structure in top-k queries of combinations and multi-item procurement auctions. Theoret. Comput. Sci. 814, 210–222 (2020) 10. D. Eppstein. K-th subset in order of increasing sum (2015), https://mathoverflow.net/questions/ 222333/k-th-subset-in-order-of-increasing-sum 11. S. Deep, P. Koutris, Ranked enumeration of conjunctive query results. ICDT 3:1–3:26 (2021) 12. D. Marx, Tractable hypergraph properties for constraint satisfaction and conjunctive queries. J. ACM (JACM) 60(6), 42 (2013)
Brain Tumour Detection Using Machine Learning Samriddhi Singh
Abstract Cerebrum growths are brought about by the strange improvement of cells. This is one of the main sources of grown-ups passing all over the plane. A large number of passes can forestall the early recognition of cerebrum cancers. The recognition of cerebrum growths using attractive reverberation imaging (MRI) can build the recurrence of the patient’s endurance rate. This work intends to recognize cancers at the beginning phase. Computerized discovery of growths utilizing attractive reverberation imaging (MRI) is vital as a need might have arisen for treatment arranging. The conventional method for recognizing deserts in attractive reverberation imaging of the cerebrum is to look at an individual. This strategy isn’t good for a lot of information. In this way, programmed growth identification strategies to acquire radiation experts are being created. The Discovery of cell growth utilizing MRI is worrisome because of the complexities and a variety of cancers. This article uses a formula to focus on equipment that can detect malignancies in the MRI cerebrum. The proposed activity is partitioned into three sections: Apply step changes to the MRI picture of the mind to extricate the surface capacity utilizing the Greeting Matching Matrix (GLCM) and characterized utilizing the AI calculation. Keywords Magnetic resonance · Imaging · Extraction of features · Segmentation · Machine learning · Features texture
1 Introduction Early location of mind growths utilizing attractive reverberation imaging (MRI) can expand the endurance pace of patients. The MRI shows the growth all the more plainly, which assists with the further course of treatment. This work plans to recognize cancers at the beginning phase. Programmed deformity location in clinical pictures has turned into another area in numerous clinical analytic applications. S. Singh (B) Gautam Buddha University, Greater Noida, India e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 A. Bhattacharya et al. (eds.), Innovations in Data Analytics, Advances in Intelligent Systems and Computing 1442, https://doi.org/10.1007/978-981-99-0550-8_22
277
278
S. Singh
As the necessity for treatment planning may have emerged, computerized detection of tumours using attractive reverberation imaging (MRI) is critical. The customary method for distinguishing absconds in attractive reverberation imaging of the mind is to look at an individual (Figs. 1 and 2). This technique isn’t good for a lot of information. Subsequently, programmed growth recognition strategies to acquire radiation experts are being created. This article employs a formula to focus on machines in order to recognize the MRI mind’s development. The proposed activity is partitioned into three sections: Apply step changes to the MRI picture of the cerebrum to liberate the surface capacity utilizing the Greeting Matching Matrix (GLCM) and group utilizing the AI calculation. Mind growths are a strong neoplasm inside the skull. These cancers emerge because of uncontrolled and strange cell division. Normally, they fill in the actual cerebrum, yet additionally fill in different places, for example, in lymphatic tissue, in veins, around
Fig. 1 Brain tumour classification
Fig. 2 Structure of CNN [3]
Brain Tumour Detection Using Machine Learning
279
cranial nerves, and in the mind encompass [1]. Cerebrum tumours can also develop because of the growth of tumours that are in various regions of the body. Order of mind growth relies upon the cancer area, the kind of tissue in which the growth occurs, whether the growth is threatening or harmless, and different contemplations [2]. Essential mind growths are cancers that emerge into the cerebrum and are referred to by the sorts of cells that gave rise to them. It can be harmless (non-carcinogenic), implying that it can’t expand to different spots, for instance, Meningioma. It can likewise be harmful and obtrusive for instance Lymphoma (the exemplary shape of lymphoma most frequently is a ring), cystic oligodendroglioma (it comprises homogeneous, adjusted cells to unmistakable lines and pristine cytoplasm encompassing a thick focal core, giving them the appearance of “seared egg”) (Ependymoma are the growths that emerge from ependymal cells inside the mind. Anaplastic astrocytoma (which is histologically harmless but acts dangerously) (Most commonly, anaplastic astrocytoma are normal cancer of the high-grade astrocytoma) [3]. The flowchart of the proposed methodology is represented in Fig. 4. Brain Tumour Classification method is described in Fig. 1 and the structure of CNN is described in Figs. 2 and 3. Optional cerebrum growths or dangerous cancer begins with disease cells which can expand to the mind from a different part of the body. As a rule, diseases that spread to the cerebrum cause optional mind growths to emerge within the kidney, lumpy and bosom, or from melanomas in the skin [2]. A mind check is an image of the inside life structures of the cerebrum. The most well-known mind filter is Magnetic Resonance Imaging (MRI). MR Images give an unbeaten view of the human body [4]. Two normal strategies (Figs. 3, 5, and 6) used to characterize the MRI images are directed methods such as help vector machine, k-nearest neighbours, and counterfeit brain organizations, and unaided procedures such as fluffy c-means and self-association map (SOM). Many explorations utilized both managed and unaided strategies to characterize common and uncommon MRI images [5]. In this paper, the administered AI procedures are utilized to characterize five kinds of unusual mind MR Images like
Fig. 3 Brain tumour classification method [3]
280
S. Singh
Fig. 4 Flowchart
Lymphoma, anaplastic astrocytoma, ependymoma, and cystic oligodendroglioma, and etc. Figure 7 displays MR images of several types of cerebrum growth that were studied in this study. A self-regulating order calculation for MRI images of the brain was suggested by applying an AI approach method including multi-facet perceptron and C4.5 choice tree computations (MLP).
2 Literature Survey Natarajan et al. [1] proposed a method for detecting cerebrum cancer in brain MRI images. The MRI cerebrum pictures are preprocessed first utilizing the middle channel, then, at that point, the division of the picture is finished utilizing limit division, and morphological activities are applied and afterward, at last, the growth area is gotten utilizing the picture deduction scheme. This approach gives the specific state of growth in the MRI mind picture. Joshi et al. [2] proposed cerebrum growth location and grouping framework in MRI pictures by isolating the cancer section from the mental image first, then removing the surface elements of the distinguished growth utilizing Grey Level Co-event Matrix (GLCM) and afterward arranging to utilize neuro-fluffy classifier. Amin and Mageed [3] proposed a brain organization and division base framework to consequently identify cancer in cerebrum MRI pictures. To include extraction, the Multi-layer Perceptron (MLP) is employed after the Principal Component Analysis (PCA) to define the free highlights of the MRI mind image. The
Brain Tumour Detection Using Machine Learning
281
Fig. 5 Recognizing the data from the image inputs
Fig. 6 Checking the trainable data
average acknowledgment rate is 88.2%, while the highest rate is 96.7%. Sapra et al. [4] developed a method for separating brain cancer from MRI images via picture division, and afterward Probabilistic Neural Network (PNN) is utilized for robotized cerebrum growth characterization in MRI examine PNN framework proposed handles the course of mind growth characterization precisely. Goswami and Bhaiya [5] suggested an unaided brain network learning method for the characterization of cerebrum MRI pictures. First, reprocessing is done on
282
S. Singh
Fig. 7 Showing the results in the form of images
the brain MRI images, which includes noise separation and edge recognition, before the cancer is partitioned and extracted. Self-organizing Maps (SOM) are used to categorize the surface elements after Grey-Level Co-event Matrix (GLCM) separates the cerebrum as ordinary or odd mind, or at the very least, whether it contains cancer. Priyanka and Singh [6] presented pre-handling approaches for improving the nature of MRI images before using them in a software application. The normal, middle, and wiener channels are utilized for commotion expulsion and the insertion-based Discrete Wavelet Transform (DWT) strategy is utilized for goal strengthening. The PSNR, or Peak Signal-to-Noise Ratio, is utilized for the assessment of such methods.
3 Methodology and Materials Algorithm STEP 1: MR Image Acquisition STEP 2: Pre-processing STEP 3: Adaptive Thresholding STEP 4: Region Detection STEP 5: Feature Extraction STEP 6: Classification STEP 7: Results MRI Image Acquisition.
Brain Tumour Detection Using Machine Learning
283
Fig. 8 Live results
4 Results and Discussion This suggested approach is surveyed to the extent that zenith sign to upheaval extent (PSNR), mean squared botch (MSE), and coordinated closeness document (SSIM) produced independent values of 76.38, 0.037, and 0.98 on T2 and 76.2, 0.039, and 0.98 on Flair. Pixels, individual components, and combined characteristics were all considered while evaluating the division outcomes. The proposed approach is connected at the pixel level using ground truth cuts and is endorsed in terms of the front-facing region (F-G) pixels, establishment (Background) pixels, goof area (ER), and pixel quality (Q). The technique attained 0.96 FG and 0.99 BG precision and 0.010 ER on a neighbourhood dataset. On the multi-modal frontal cortex disease division task, dataset BRATS 2013, 0.93 FG and 0.99 BG exactness and 0.005 ER are obtained. Similarly, on BRATS 2015, 0.97 FG and 0.98 BG exactness were obtained, as well as 0.015 ER (Figs. 5, 6, and 8).
5 Conclusion and Future Scope This work proposes two approaches for mind malignant growth requests dependent upon AI computations. Shape features are isolated and used for the course of action. The number of shape features taken into account in this research integrates Major turn length, Minor centre length, Euler Number, Solidity, Area, and Circularity. With the goal of collection in mind, C4.5 and Multi-layer Perceptron are used. The best correctness of around 97% is done by considering 165 instances of brain MRI images and using the MLP estimation. To construct this exactness, one can use a
284
S. Singh
gigantic dataset and include different attributes, for instance, surface and powerbased features, and, in the future, there is still a scope to increase accuracy with less complexity or within less time. So, this is how we can prevent deaths by early recognition of cerebrum cancers. The method was successful in detecting tumours with a level of precision that was equivalent to or greater than existing approaches. Thus, the suggested approach can be used further (Fig. 8).
References 1. P. Natarajan, N. Krishnan, N.S. Kenkre, S. Nancy, B.P. Singh, Tumor detection using threshold operation in MRI brain images, in IEEE International Conference on Computational Intelligence and Computing Research (2012) 2. D.M. Joshi, N.K. Rana, V.M. Misra, Classification of brain cancer using artificial neural network, in IEEE International Conference on Electronic Computer Technology, ICECT (2010) 3. S.E. Amin, M.A. Mageed, Brain tumor diagnosis systems based on artificial neural networks and segmentation using MRI, in IEEE International Conference on Informatics and Systems, INFOS (2012) 4. P. Sapra, R. Singh, S. Khurana, Brain tumor detection using neural network. Int. J. Sci. Mod. Eng., IJISME, 1(9) (August 2013), ISSN: 2319-6386 5. S. Goswami, L.K.P. Bhaiya, Brain tumor detection using unsupervised learning based neural network, in IEEE International Conference on Communication Systems and Network Technologies (2013) 6. J. Priyanka, B. Singh, A review on brain tumor detection using segmentation. Int. J. Comput. Sci. Mob. Comput. 2(7), 48–54 (July 2013)
Implementation of a Smart Patient Health Tracking and Monitoring System Based on IoT and Wireless Technology Addagatla Prashanth, Kote Rahulsree, and Panasakarla Niharika
Abstract The health department plays a crucial role in this pandemic situation these days. Today, health care is of paramount importance in every country. In this situation, the Internet of Things based on current technology plays a big role in health care. The Internet of Things (IoT) is a new Internet revolution that is a rising field of research, particularly in health care. With the increase in wearable sensors and smartphones and the evolution of new and advanced generation of communication, i.e., 5G technology, this may be done swiftly when diagnosing the patient and aids in the prevention of disease transmission and the accurate identification of health concerns even when the doctor is a long distance away. Here, we may continuously monitor the patient’s heartbeat, temperature, and other fundamental data, as well as assess the patient’s status and preserve the patient’s information on a server using remote correspondence (wireless communication) based on the Wi-Fi module. Keywords IoT · Sensors · 5G technology · Wi-Fi
1 Overview Safety is always a major concern with any technological advancement of mankind. In this pandemic time, Universal health care has risen to prominence. In order to avoid the occurrence of the disease and make the consultation for examination more accessible, it is usually preferable to use health surveillance equipment to monitor these people. Based on the result, Internet (based on the Internet of Things condition monitoring system) is essential with wireless accessibility and connectivity is the current solution for it [1–4].
A. Prashanth (B) · K. Rahulsree · P. Niharika Institute of Aeronautical Engineering, Electronics and Communication Engineering, Hyderabad, India e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 A. Bhattacharya et al. (eds.), Innovations in Data Analytics, Advances in Intelligent Systems and Computing 1442, https://doi.org/10.1007/978-981-99-0550-8_23
285
286
A. Prashanth et al.
2 Block Diagram In this healthcare system, multiple sensors are used to analyze patient data. The caretaker is capable of supplying sufficient healthcare supervision. Mobile nodes, which are routinely used by impaired people, need additional supervision [5–7] (Fig. 1). A network of devices that are connected to physical objects that enable remote devices to hear, analyze, and monitor, in this case a computing mechanism for connecting computer hardware to enable communication between sensors and intelligent devices. It links a variety of instruments, intelligent objects, and network devices. Sensor data is collected and transmitted by sensor devices. Communication systems include infrared, Bluetooth, ZigBee, Wi-Fi, UMTS, and 3G/4G/5G (future technologies). However, in this advanced technology, high-speed networking technologies such as 4G/5G, LAN Wi-Fi, and so on are required. With this advanced technology, we can distinguish health data in a doctor’s analysis and diagnosis by
Fig. 1 Basic block diagram of monitoring system by IoT
Implementation of a Smart Patient Health Tracking and Monitoring …
287
systems of physical sensors. The biggest benefit of IoT in health care is a reduction in maintenance, followed by an increase in the likelihood of receiving health care. However, IoT home health monitoring automatically detects emergencies and reports them to caregivers. IoT home health monitoring enables the monitoring of patients’ vital health information through the use of wearable technology that tracks their condition [7–10].
3 Methodology and Results The models were created to predict the parameters, which include health monitoring, and produce both tangible (LCD display) and simulated outputs. This physical output is based on hardware component functionality with direct access to patient parameters; similarly, this output data can be shown in smartphones via web interfaces by establishing a gateway to view the system. The displayed outputs are graphed, allowing for easy access to historical data (Fig. 2).
3.1 Database Tracking and Maintenance ThingSpeak stage is being used as a medium to follow and keep track of the patient’s information in a type of information base without losing the data. ThingSpeak is a cloud-based IoT investigation tool that allows you to sum, visualize, and analyze live data streams (Fig. 3) [11–13]. You can send data from your devices to ThingSpeak, create live data representations, and issue alerts. ThingSpeak is built around a period series data set. ThingSpeak provides its customers with extra energy
Fig. 2 Sensors and working modules
288
A. Prashanth et al.
Fig. 3 Health monitoring channel created in ThingSpeak
series information capacity in channels. Each channel is dependent on eight information fields. ThingSpeak’s highlights include continuous information gathering, information handling, representations, applications, and modules (Fig. 4). If the patient’s Internet connection goes down, the GSM module takes over and reads the patient’s data; if an emergency occurs, the GSM module sends the emergency response and location to the necessary numbers (Fig. 6). Output waveform of the patient’s heartbeat (ECG) is shown in Fig. 5.
Fig. 4 Health parameters of patients and their historic information
Implementation of a Smart Patient Health Tracking and Monitoring …
temperature
humidity
289
pressure
pulse
Fig. 5 Output waveform of patient’s heartbeat (ECG)
This work shows how to use IoT servers in conjunction with sensors and tracking modules. The use of circuit design and measurements simplifies parameter measurement and data analysis. The primary goal should be to provide early warning of serious deterioration and implement monitoring systems to reduce healthcare costs by reducing doctor visits, hospital stays, and diagnostic testing procedures.
290
A. Prashanth et al.
Fig. 6 Working of GSM module
4 Conclusion and Future Scope The purpose of this study is to offer information about a successfully implemented and performed health monitoring and tracking system that can gather and process physiological indicators from the human body. The main goal is to create monitoring systems that reduce physician visits, hospitalizations, and diagnostic testing procedures to save money on health care. We may in the future integrate a solar harvesting system to automatically recharge the DC power source when the user is exposed to the sun, and we can also connect a camera to allow doctors and affected individuals to remotely monitor activity and help incapacitated individuals with IoT arrangements that are mechanized, reliable, and safe. IoT can possibly emphatically further develop this populace’s personal satisfaction. Shrewd gloves, for instance, have been created with minimal expense inactivity sensors to help individuals who have lost their hearing to speak with people who don’t communicate in gesture-based communication successfully. Acknowledgements I would like to thank Mr. A. Prashanth, Assistant Professor of Electronics and Communication Engineering at the Institute of Aeronautical Engineering, who aided the project by providing insight and expertise. I’d want to express my sincere gratitude to the Department of Electronics and Communication Engineering, as well as the Institute of Aeronautical Engineering’s management.
References 1. J. Singh, A. Chahajed, S. Pandit, S. Weigh, GPS and IOT based soldier tracking and health indication system. Int. Res. J. Eng. Technol. 2395-0056 (2019) 2. B. Iyer, N. Patil, IoT enabled tracking and monitoring sensor for military applications, in International Conference on Computing, Communication and Automation (ICCCA), vol. 9, no. 2 (2018), pp. 2319–7242
Implementation of a Smart Patient Health Tracking and Monitoring …
291
3. W. Walker, A.L. Praveen Aroul, D. Bhatia, Mobile health monitoring systems, in 31st Annual International Conference of the IEEE EMBS, Minneapolis, Minnesota, USA (2018), pp. 5199– 5202 4. A. Gondalic, D. Dixit, S. Darashar, V. Raghava, A. Sengupta, IoT based healthcare monitoring system for war soldiers using machine learning. Int. Conf. Robot. Smart Manuf. 289, 323–467 (2018) 5. A. Mdhaffar, T. Chaari, K. Larbi, M. Jamaiel, B. Freisleben, IoT based health monitoring via LoRaWAN. Int. Conf. IEEE EUROCON 115(89), 2567–2953 (2018) 6. V. Armarkar, D.J. Punekar, M.V. Kapse, S. Kumari, J.A. Shelk, Soldier health and position tracking system. Int. J. Eng. Sci. Comput. 3(23), 1314–1743 (2017) 7. S. Nikam, S. Patil, P. Powar, V.S. Bendre, GPS based soldier tracking and health indication. Int. J. Adv. Res. Electr. Electron. Instr. Eng. 288, pp. 161–191 (2017) 8. M.J. Zieniewicz, D.C. Johnson, D.C. Wong, J.D. Flat, The Evolution of Army Wearable Computers, vol. 1, no. 6. Research Development and Engineering Center, US Army Communication (2017), pp. 5133–5442 9. S. Shelur, N. Patil, M. Jain, S. Chaudhari, S. Hande, Soldier tracking and health monitoring system. Int. J. Soft Comput. Artif. Intell. 2532–2878 (2016) 10. A.V. Armarkar, D.J. Punekar, M.V. Kapse, S. Kumari, A. Jayashree, Soldier health and position tracking system. JESC 7(3), 235–312 (2015) 11. S.H. Almotiri, M.A. Khan, M.A. Alghamdi, Mobile health (m-health) system in the context of IoT, in 2016 IEEE 4th International Conference on Future Internet of Things and Cloud Workshops (FiCloudW) (August 2019), pp. 39–42 12. P. Addagatla, Arduino-based-student-attendance-monitoring-system-using-GSM. Int. J. Eng. Res. Technol. (IJERT) 8(7) (2019), ISSN: 2278-0181 13. P. Rizwan, K. Suresh. Design and development of low investment smart hospital using Internet of Things through innovative approaches. Biomed. Res. 28(11) (2017)
Diabetes Disease Prediction Using KNN Makarand Shahade, Ashish Awate, Bhushan Nandwalkar, and Mayuri Kulkarni
Abstract Diabetes is a very common chronic disease that is of rising concern. According to the World Health Organization, it is estimated that approximately 422 million people worldwide suffer from diabetes. By 2040, the number of people suffering from diabetes is estimated to increase to approximately 642 million. Due to diabetes, one person dies every six seconds (five million a year) which is more than HIV, tuberculosis, and malaria combined, and 1.6 million deaths are due to diabetes every year. In the previous part, we have covered some of the traditional ways of diabetes prediction. The use of Machine Learning applications in this disease can reform the approach to its diagnosis and management. Support vector machines, logistics regression, K-Nearest Neighbor (KNN), and decision tree algorithms were used to identify the model. These techniques are more suitable to detect early signs of diabetes based on nine important parameters. Accuracy, F-Measure, Recall, Precision, and Receiver Operating Curve (ROC) measures are used to define the performance of the different machine learning techniques. Keywords K-Nearest Neighbor (KNN) · Receiver Operating Curve (ROC) · Evolutionary Computation (EC)
M. Shahade (B) · A. Awate · B. Nandwalkar · M. Kulkarni Department of Computer Engineering, SVKM’s Institute of Technology, Dhule 424001, Maharashtra, India e-mail: [email protected] A. Awate e-mail: [email protected] B. Nandwalkar e-mail: [email protected] M. Kulkarni e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 A. Bhattacharya et al. (eds.), Innovations in Data Analytics, Advances in Intelligent Systems and Computing 1442, https://doi.org/10.1007/978-981-99-0550-8_24
293
294
M. Shahade et al.
1 Introduction Diabetes is a very common chronic disease that is of rising concern. According to the World Health Organization, it is estimated that approximately 422 million people worldwide suffer from diabetes. By 2040, the number of people suffering from diabetes is estimated to increase to approximately 642 million. Due to diabetes, one person dies every six seconds (five million a year) which is more than HIV, tuberculosis, and malaria combined, and 1.6 million deaths are due to diabetes every year. In the previous part, we have covered some of the traditional ways of diabetes prediction. The use of Machine Learning applications in this disease can reform the approach to its diagnosis and management. Various Machine Learning models have been used for predicting the risks of developing diabetes or consequent complications related to this disease. The care of patients, healthcare professionals, and healthcare systems has been facilitated using AI and machine learning. Clinically case-based reasoning, deep learning, and neural networks enable predictive population risks, enhanced decision-making, and self-management of individuals. Support vector machines, logistics regression, K-Nearest Neighbor (KNN), and decision tree algorithms were used to identify the model. These techniques are more suitable to detect early signs of diabetes based on nine important parameters. Accuracy, F-Measure, Recall, Precision, and Receiver Operating Curve (ROC) measures are used to define the performance of the different machine learning techniques. One of the most important real-world chronic medical problems is the early detection of diabetes in patients. With the development and advent of AI and machine learning approaches, patients can be empowered to manage their diabetes, generate data/parameters, and become their own health experts. Awareness and knowledge of early signs will be useful in the management of diabetes in patients and especially in pregnant women. The end-users of technical advances in diabetes care include healthcare professionals, patients, diabetes care and management center, and data science enthusiasts. AI and machine learning approaches have introduced a quantum of change in healthcare systems, especially in diabetes care, and will continue to evolve. In the future, experience generated from the system will be developed with help to improvise the system further in terms of functionality and utility in diabetes care using concepts of enforcement learning and will be used on a larger scale across the world.
2 Literature Survey In the healthcare industry, data plays a crucial role in bringing improvisation and innovation. The healthcare data generates avenues for several discoveries. It inspires to run the evaluation and produce simpler drugs. It establishes better communication
Diabetes Disease Prediction Using KNN
295
between patients and doctors. It improves the general quality of health care giving a deeper insight into a patient’s health report and the way a selected drug is responding. As the healthcare industry progresses, the utilization of data science has increased and is considered the foremost important aspect of innovation. With the collection of knowledge through many devices and tools that patients are using, researchers, scientists, and doctors are becoming conscious of the healthcare challenges and digging for solutions to supply efficient patient care. According to a study, the information generated by every physical body is 2 terabytes per day. This data includes activities of the brain, strain level, pulse, sugar level, and lots more. To handle such an outsized amount of data now, we’ve more advanced technologies and one among them is Data Science. It helps monitor patients’ health using recorded data. With the assistance of the appliance of Data Science in health care, it’s now become possible to detect the symptoms of a disease at an early stage. Also, with the arrival of varied innovative tools and technologies, doctors are ready to monitor patients’ conditions from remote locations.
2.1 How is Data Science Revolutionizing the Healthcare Industry? In earlier days, doctors and hospital management weren’t ready to handle multiple number of patients at an equivalent time. And thanks to the shortage of proper treatment, the patient’s conditions won’t worsen. The global artificial intelligence in the healthcare market size was valued at USD 3.39 Billion in 2019. This is expected to grow at a compound annual growth rate (CAGR) of 43.6% from 2019 to 2027 with the market size valued at USD 61.69 Billion in 2027. The forecasted difference is more than impressive, and that is why we have prepared this chapter. We have provided a better understanding of data science applications in health care. However, now, the scenario has changed. With the encouragement of Data Science and Machine Learning applications, doctors are often notified about the health conditions of the patients through wearable devices. Then, hospital management can send their junior doctors, assistants, or nurses to those patients’ homes. Hospitals can further install various equipment and devices for the diagnosis of those patients. These devices built on top of knowledge Science are capable of collecting data from the patients like their pulse, vital signs, blood heat, etc. Doctors get this real-time data and live status of the patient’s health through updates and notifications in mobile applications. They will then diagnose the conditions and assist the junior doctors or nurses to offer specific treatments to the patient’s reception. This is often how Data Science helps in caring for patients using technology. Data science is changing the healthcare industry in many ways by improving medical image analysis, providing predictive medication, creating a global database of medical records, drug discovery, bioinformatics, virtual assistants, etc. And that’s just the beginning!
296
M. Shahade et al.
Who knows where this integration of the field of data science and health care may lead in the future.
2.2 Role of Data Science in Health Care In the healthcare industry, data plays a crucial role in bringing improvisation and innovation. Healthcare data generates avenues for many discoveries. It inspires to run the evaluation and produce simpler drugs. It establishes better communication between patients and doctors. It improves the general quality of health care giving a deeper insight into a patient’s health report and the way a selected drug is responding. As the healthcare industry advances, the usage and implementation of the field of data science have increased and are taken into account as one of the important and foremost aspects of innovation. With the emerging collection of data through many devices and tools that patients are using, researchers, scientists, and doctors are becoming conscious of the healthcare challenges and digging for solutions to provide efficient patient care. Health care stores data related to practices like clinical trials, electronic medical records (EMRs), genetic information, care management databases, billing, Internet research, and social media data. Patients used tools like Babylon, DocPlanner, TataHealth, etc. to determine communication between health professionals and to book appointments; through the synchronization of those tools, it becomes easier to manage customer data. One of the major examples of this generation that we have is the COVID-19 pandemic. The COVID-19 pandemic has become unprecedented, and experts say that there are so many unknowns—the virus is moving quite fast, and it’s seemingly unpredictable. “The urgency of this disease means we cannot believe only traditional methods to understand how the disease works and therefore the way it’ll spread— we must utilize all data and advanced tools at our disposal, broadly within various companies and across universities”. A variety of data sources, multiple techniques, and a variety of cross-functional perspectives are helping propel the vaccine program. Data Science has helped us in this pandemic in the following ways: • Using Data collected to trace the Pandemic and Forecast Hotspots • Harnessing Data to find out more about Who Might Be Most at Risk of Getting Infected • Leveraging Data Insights to assist Inform Decisions about Returning to the Workplace As per the U.S. Emerging Jobs Report, the sector of data science has expanded up to 350% since the year 2012 and is expected to grow extensively in the future.
Diabetes Disease Prediction Using KNN
297
2.3 Diabetes Detection Using Data Science What is diabetes? Diabetes is a body condition in which the blood sugar level of the individual is in very high concentration. Most of the food that we consume is broken down to glucose which is a form of sugar. This sugar is dissolved in the blood and circulated throughout the body and when this sugar content is high in the blood, then that individual is said to be diabetic. Insulin is a body fluid that is created by the pancreas. Insulin helps in breaking down sugar present in the blood which provides energy to the human body. Diabetes occurs when either the pancreas does not make enough insulin or the insulin is ineffective in breaking down the sugar present in the blood. As of 2015, 30.3 million people in the United States, or 9.4% of the population, had diabetes. More than 1 in 4 of them didn’t know they had the disease. Diabetes affects 1 in 4 people over the age of 65. About 90–95% of cases in adults are Type 2 diabetes. One may think that increase in sugar concentration is not as harmful to the human body, but the reality is the opposite. Diabetes is associated with many harmful body conditions like heart disease, vision loss, kidney disease, eye problems, dental problems, nerve damage, foot problems, and stroke. The cure for diabetes is yet to be developed by scientists and researchers around the world, but scientists believe losing weight, eating healthy meals, and being active can help in alleviating the diabetic conditions of a person [1]. Traditional approaches for diabetes detection. The first known mention of diabetes symptoms was determined in 1552 B.C., when Hesy-Ra, an Egyptian physician, documented frequent urination as an indication of a bizarre disease that also caused emaciation. Centuries later, people referred to as “water tasters” were used to diagnose diabetes by tasting the urine of individuals suspected to have diabetes. If urine tastes sweet, diabetes was diagnosed. Until the 1800s, scientists haven’t been able to develop any chemical tests to detect the presence of sugar in the urine. Currently, two types of diabetes are Types 1 and 2 diabetes. They are first detected with the Glycated hemoglobin (A1C) test. If the A1C test wasn’t available, or if an individual has certain conditions that can make the A1C test inaccurate such as pregnancy or an uncommon form of hemoglobin then the doctor may use tests such as a Random blood sugar test and a Fasting blood sugar test. A blood sample is taken at various intervals of these tests that differ from test to test to predict diabetes. Two additional tests are being suggested by various doctors for the diagnosis of Type 2 diabetes; they are the Oral Glucose Tolerance test and the Screening test. If an individual is diagnosed with diabetes, the doctor can also run blood tests to check for autoantibodies that are common in Type 1 diabetes. These tests help doctors distinguish between Types 1 and 2 diabetes when the diagnosis is uncertain. The first diabetes treatment involved and the one proved were prescribed exercise, often horseback riding, which was thought to alleviate excessive urination. Today,
298
M. Shahade et al.
Insulin and Metformin are still the primary therapy used to treat Types 1 and 2 diabetes, respectively; other medications have since been developed to help control blood glucose levels. Diabetic patients can now test their blood glucose levels at home, and also follow dietary changes, and carry out regular exercise, insulin, and other medications to precisely control their blood glucose levels, which can thereby reduce their risk of health complications.
2.4 Flow of the Methodology Used for Diabetes Detection 2.4.1
Overview of the Basic Flow of Machine Learning
Machine learning as the name suggests means getting machines to train and learn to act and make decisions like humans. Machine learning has come a long way and has strengthened its roots in all walks of life. Data science and machine learning are being used in every field to make better data-driven decisions. Figure 1 diagrammatically explains the generic workflow of any machine learning project. A generic machine learning workflow always involves five key steps which are as follows: 1. Get Data—Getting data means gathering data from various sources that are available. The data gathered in this step should have all the characteristics that the project requires. The data can be open source and free or can be from a paid source. Many websites like Kaggle and GitHub provide free datasets to facilitate the community. Governments of various countries have also been involved in this data-centric revolution and have made various datasets available to the public free of cost. These datasets are easily accessible and can be used in a wide variety of projects. 2. Clean, Prepare, and Manipulate Data—We often see data in tabular form and it seems as if data is well organized and can be used as it is after gathering it. But in reality, this is not the case. Data exists with attached impurities, imperfections,
Fig. 1 Diagrammatic representation of workflow of any generic machine learning project
Diabetes Disease Prediction Using KNN
299
and discrepancies which can lead to inaccuracy of the results and ultimately the failure of the project. Hence, the data needs to be cleaned by removing outliers and NaN values. Dataset when sourced from a website may not always have the necessary structure that is required for the project. In some cases, the dataset may have some extra columns or in other cases, new columns have to be derived from existing columns. This process is called preparing and manipulating the data, and it helps in structuring data as per the problem statement. 3. Train Model—As it is clear from the earlier section that machine learning means getting machines to train and learn to make decisions like humans, this step trains the machine to make human-like decisions which are commonly referred to as predictions. Before training, an appropriate model is selected based on multiple attributes. The selected model is then trained with a part of the dataset called training data. This enables the machine to initiate the training phase. Initially, the accuracy of the predictions made by the machine may not be high, but it gradually increases as the iterations and the training data size increase. 4. Test Model—In this phase, the model is tested with the remaining part of the dataset commonly referred to as the testing dataset. The model is not familiar with the testing dataset and that is why the testing dataset is employed in knowing the actual accuracy of the dataset. The accuracy of the model obtained with the testing dataset is a close estimate of how the model will perform in the real-world scenario. 5. Improve—As we have discussed earlier, models are not always accurate. Even if they are accurate in the training phase, they might be not accurate in the testing phase. These models have some attributes of their own which can be changed as per the need. In improve phase, if the accuracy of the model is not up to the desired accuracy, then some tweaks and adjustments are made in the model to increase the accuracy. In some cases, the size of the dataset used in the training phase increases which leads to more iterations and ultimately more accuracy. These are five generic key steps that are involved in every machine learning project and are iterated over and over again until desired results are obtained. This process is generally referred to as hyper-parameterization in technical terms.
2.4.2
Overview of the Flow Used for Diabetes Detection
The basic diabetes detection algorithm follows the same workflow that was discussed earlier. The framework remains the same with some changes in the model selection and improvement phase. In the data-gathering stage, the data is obtained from a wide number of sources as many datasets are available for educational purposes. The cleaning and preparing are done as per the requirement. The dataset obtained is not clean and has a lot of missing values which are replaced by a predefined value. Then some outliers are eliminated, and the discrepancies are resolved to make the dataset more usable. The dataset has many health-related indicators, and the irrelevant indicators are filtered out to prepare the dataset for training and testing of the model.
300
M. Shahade et al.
A wide array of models exists that is used for diabetes detection out of which the one with maximum efficiency and accuracy is chosen. The chosen model is then trained with a training dataset and tested with a testing dataset and the overall accuracy is calculated. The model is then fine-tuned to achieve maximum accuracy.
2.4.3
Various Existing Algorithms Used for Diabetes Detection
Researchers and scientists have employed a variety of different algorithms for diabetes detection over some time. Some have developed their hybrid algorithm to overcome the accuracy bottleneck and increase efficiency. Some of the commonly used algorithms for diabetes detection are as follows. 1. Naïve Bayes Classifier—Naïve Bayes Classifier is an algorithm that classifies or labels the outcome with the desired label as per the features that are reflected by that instance. The Naïve Bayes Classifier assumes that all the variables are independent of each other, and hence this classifier works best with data that has imbalancing problems and missing values. At its core, the Naïve Bayes Classifier uses the Bayes theorem to calculate the label [2, 3]. 2. Logistic Regression—Regression is a technique that means establishing a relationship between an independent and a dependent variable. In logistic regression, the scale used to fit the model is a logistic function and hence this regression is called logistic regression. The main reason why logistic regression is used in diabetes detection is that logistic regression is majorly used when the outcome variable is binary, meaning the outcome variable must have only two values like yes or no, male or female, etc. This requirement fits perfectly in diabetes detection and the majority of other healthcare use cases where a disease is detected or not detected, so basically a yes and no use case. Hence, logistic regression fits perfectly with the diabetes detection problem and hence gives the maximum accuracy [3, 4]. 3. Decision Tree—A decision tree is a classifier-based machine learning algorithm that falls under the category of supervised machine learning. A decision tree follows a flowchart-like structure where a branch means an outcome of a test or condition and the node means the test or the condition. The main reason why a decision tree is used in the prediction of the target variable in the case of diabetes detection is that the algorithm can build up the rules and take the necessary conditions along with the required variables by itself with the help of previous data and labels. In each iteration, the highest information gain is used as the criteria for choosing the node in that particular iteration. Due to their ease of use, auto-generation of rules, and efficient handling of all kinds of data, decision trees are one of the most popular tools for the classification and prediction of a variable [3, 5]. 4. Random Forest—We have already discussed decision trees in detail and their uses in classification and prediction. A forest is formed when numerous trees are
Diabetes Disease Prediction Using KNN
301
grouped. Similarly, a Random Forest Classifier is formed by grouping multiple decision tree classifiers together. Similar to the decision trees, the random forest classifier is also a supervised machine learning algorithm that is used for classification and prediction. Random Forest has multiple applications like image classification, feature selection, recommendation engine, fraud detection, disease prediction, etc. The random forest generally employs multiple decision trees in parallel and stores all the outcomes. To choose the most optimal outcome out of all multiple outcomes, it uses the process of voting. The whole workflow of the Random Forest can be seen in Fig. 2. Through this voting process, the final output is chosen as the prediction of the Random Forest classifier [3, 6]. 5. Neural Network—A neural network is a network of neurons or circuits that are interconnected to each other and form a network. A basic neural network contains an input layer, several hidden layers, and an output layer. A basic neural network can be seen in Fig. 3. The overall neural network can be made more complex by adding more hidden layers and various filters and complex intermediate layers to make it more efficient and accurate. There is a term called an activation function that is used quite frequently when talking about neural networks. An activation function is simply a mathematical function that defines how the input can be converted into output over the layers that are present in that particular neural network. Many libraries and tools are available which provide easy creation of complex neural networks with few lines of code like Keras and TensorFlow. Once a basic neural network is designed, it is up to the user to make it as easy or as complicated as he or she wishes as per the accuracy [3, 7, 8]. All
Fig. 2 Random Forest classifier
302
M. Shahade et al.
Fig. 3 Generalized neural network
the above-said algorithms can be easily implemented using the Python programming language along with some libraries that are available for free.
3 Proposed System The framework used to compare and obtain the results uses a dataset called PIMA dataset. This dataset is created by the National Institute of Diabetes and Digestive and Kidney Diseases. The main aim of the dataset is to facilitate diagnostic measurements to predict whether the person has diabetes or not. The dataset contains multiple diagnostic measurement variables like the number of pregnancies, Glucose, Blood Pressure, Skin Thickness, Insulin, Body Mass Index (BMI), Diabetes Pedigree Function, and Age. The outcome is in an Outcome column which is the target variable of the whole study. The Outcome variable is a binary variable that has 1 and 0 as its values. If the outcome for a particular row is 1, then the person has diabetes, and if the outcome is 0 then the person does not have diabetes. Then exploratory data analysis is done to figure out the overall structure and central tendencies of the various columns. The data is then cleaned and prepared for training and testing. The training is carried out with an 80:20 split denoting that 80% of the whole dataset will act as a training dataset and the remaining 20% will act as a testing dataset. Since this study is about making people aware of the use of data science and machine learning in health care, especially diabetes detection, we decided to put forth a comparative study based on the above five algorithms. This will not only help in understanding the importance of data science in health care but will also present a more complete picture of how data science is revolutionizing the healthcare industry.
Diabetes Disease Prediction Using KNN
303
4 Results 4.1 Experimental Results The executed algorithms show respectable accuracies ranging from 75% to more than 90%. But the main result to be noted is that one cannot say one algorithm is superior to all of the other existing algorithms. This is because each algorithm has its specific use case and is used either when the type of dataset available in hand, that is the structure of the dataset is favorable according to the model or a certain kind of output is required. Thus, accuracy is not the parameter through which we can decide whether one algorithm is superior to the other or not. All the algorithms can fairly classify whether the person is diabetic or not. This means, all the executed models can predict whether the person is diabetic or not with a respectable accuracy of more than 85% accuracy in all cases. This will lead to the detection of diabetes or borderline diabetes in a person’s body before the situation is out of hand and the person can take all the necessary precautions to ensure that diabetes is kept under check.
4.2 Visualization of Obtained Results We all know data science is a field of performing operations on a huge amount of data to extract useful information or knowledge from it and use that knowledge for various applications. Generally, the knowledge extracted from the dataset is not readable to a common person who does not know how to read the model summary or the prediction results. Hence, the knowledge obtained from the dataset must be presented in such a manner that every person will be able to understand and draw conclusions from. This process is called data visualization. Data visualization is a part of every data science project and involves several kinds of charts to represent knowledge efficiently. In the same way, their visualization was obtained while doing the comparative study of the diabetes system as well. The first visualization that we will be looking into is a correlation plot. Correlation means that one variable has a relationship with another variable. The correlation may be positive or negative. A correlation of 0.7 or greater is considered to be a strong correlation. Figure 4 denotes a general correlation plot that all the individuals use to get to know the important or relevant variables that affect the outcome variable. The basic overview of the correlation plot in Fig. 4 is that the X- and Y-axes denote the variables that are present in the PIMA dataset. On the rightmost side, we can see a scale that shows the color according to the correlation. The numbers that are present in the individual cells denote the correlation in a numerical format, and the cell color denotes the correlation in a graphical format. From the correlation plot
304
M. Shahade et al.
Fig. 4 Correlation plot
in Fig. 3, there is no strong correlation present in the dataset as no variable shows a correlation of 0.7 and more. So this is the conclusion that can be drawn from Fig. 4. a. Comparison of results After applying the five algorithms which are Naïve Bayes Classifier, Logistic Regression, Decision Tree, Random Forest, and Neural network, the accuracy of all the algorithms is calculated based on the testing dataset as the testing dataset represents the real-life accuracy of the model. Table 1. Denotes all the accuracies of the mentioned models. This table or the comparison of the algorithm by no means shows the superiority of one algorithm over the other. It is neccessary to note that if we use a different dataset or a different use case then other algorithm will perform better. The accuracy mentioned in the table only denotes the performance of the models with the PIMA dataset which contains data on Indian patients. As we can see, the logistic regression outperforms every other selected algorithm with 96% accuracy on the PIMA dataset. A more complex form of Neural Networks can also be used for building a diabetes detection framework. Talking about the predictions made by the logistic regression model, 93 people were predicted as diabetic and they turned out to be diabetic, 5 people were predicted to be diabetic but they were non-diabetic, 4 people were non-diabetic but were classified as diabetic, and 138 were non-diabetic and were correctly predicted as non-diabetic by the logistic regression model. This number establishes the efficacy of the logistic regression model in predicting whether the person is diabetic or not by using the PIMA dataset. Table 1 Accuracy table
Algorithms
Accuracy (%)
Naïve Bayes Classifier
93
Logistic Regression
96
Decision Tree
86
Random Forest
91
Neural Network
90
Diabetes Disease Prediction Using KNN
305
Hence, the logistic regression model can be fed with the inputs of a person and the model will be successfully able to predict whether the person has diabetes or not.
5 Conclusion Hence, we have successfully gone through the journey of how a generic machine learning project works with the basic steps that are involved in every machine learning project. Through these steps and their concise yet accurate definitions, people can implement a basic machine learning project with the concepts discussed in the chapter. The algorithms that we have taken into the study to present a comparison successfully predict whether the person is diabetic or not. For the prediction to be made accurately, the person is required to fill in all the measures and the person can obtain the prediction about his or her diabetic condition well before the symptoms start to show. In this way, the diabetic condition can be mitigated well in advance and the body can be kept out of risk efficiently.
6 Future Scope As we all know, technology, change, and advancement are interleaved with each other. All three things go hand in hand. Many technology leaders like Google, Amazon, and Apple are already working on diabetes detection systems. These giants have realized the importance of data science and the opportunities that are available in the healthcare sector. Apple for example is trying to implement glucose monitoring built into the iWatch so that the user can be notified of the fluctuation in the glucose level. Alphabet, which is the parent company of Google, is also trying to integrate a model with a web interface and provide this product to many healthcare providers for the self-assessment of diabetes for patients. Many major healthcare providers are also working with such tech giants to improve the system and implement it efficiently. They are also collaborating with these companies to provide real-time data from their patients directly to these companies to train and increase the overall accuracy of the system.
References 1. Centers for Disease Control and Prevention. National diabetes statistics report, Centers for Disease Control and Prevention website (2017) 2. www.cdc.gov/diabetes/pdfs/data/statistics/national-diabetes-statistics-report.pdf, External link (PDF, 1.3 MB) Updated July 18, 2017. Accessed August 1, 2017 3. K. Patil, S.D. Sawarkar, S. Narwane, Designing a model to detect diabetes using machine learning, Int. J. Eng. Res. Technol. (IJERT) 8(11) (November 2019)
306
M. Shahade et al.
4. B. Yadav, S. Sharma, A. Kalra, Supervised Learning technique for prediction of diseases, in Intelligent Communication, Control and Devices (Springer, Singapore, 2018), pp. 357–369 5. A. Dagliati, S. Marini, L. Sacchi, G. Cogni, M. Teliti, V. Tibollo, ... R. Bellazzi, Machine learning methods to predict diabetes complications. J. Diabet. Sci. Technol. 12(2), 295–302 (2018) 6. I. Kavakiotis, O. Tsave, A. Salifoglou, N. Maglaveras, I. Vlahavas, I. Chouvarda, Machine learning and data mining methods in diabetes research. Comput. Struct. Biotechnol. J. 15, 104– 116 (2017) 7. A. Choudhury, D. Gupta, A survey on medical diagnosis of diabetes using machine learning techniques, in Recent Developments in Machine Learning and Data Analytics (Springer, Singapore, 2019), pp. 67–78 8. J.A. Carter, C.S. Long, B.P. Smith, T.L. Smith, G.L. Donati, Combining elemental analysis of toenails and machine learning techniques as a non-invasive diagnostic tool for the robust classification of type-2 diabetes. Expert Syst. Appl. 115, 245–255 (2019)
Review: Application of Internet of Things (IoT) for Vehicle Simulation System Rishav Pal, Arghyadeep Hazra, Subham Maji, Sayan Kabir, and Pravin Kumar Samanta
Abstract For the last few years, Internet of Things (IoT) is becoming a quite vernacular phrase. With the increase in the number of experiments and research works, there is an exponential advancement in creating, designing and building various models which can construct a smart and advanced world around us. This can only be achieved by performing a large number of prototype-based experiments and assumptions. This review paper mainly emphasizes automobile simulation and computer-based tests, especially vehicle simulation using IoT. It provides a simple possible way to acquire car simulation informations and also displays presently available interfaces and gives the future plan to utilize them. Keywords Internet of Things · IoT · Advanced and smart cars · Vehicle simulation systems · Automobile
1 Introduction In recent times, the excessive growth in 4G/5G smartphones and electronic gadgets requires stable connections among the devices. The majority of gadgets have their R. Pal · A. Hazra · S. Maji · S. Kabir School of Computer Engineering, KIIT Deemed to be University, Bhubaneswar, India e-mail: [email protected] A. Hazra e-mail: [email protected] S. Maji e-mail: [email protected] S. Kabir e-mail: [email protected] P. K. Samanta (B) School of Electronics Engineering, KIIT Deemed to be University, Bhubaneswar, India e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 A. Bhattacharya et al. (eds.), Innovations in Data Analytics, Advances in Intelligent Systems and Computing 1442, https://doi.org/10.1007/978-981-99-0550-8_25
307
308
R. Pal et al.
own built-in sensing chips (sensors) which are making the device more technically advanced. A quite huge number of digital information is employed to estimate, run, and solve errors within the interface. The wireless devices consist of Internet of Things (IoT) surroundings. Different associations and corporations control the system or the interface between the operating system and the application of devices. The services can be accessed using web browsers or an application and OS is irrelevant to end users so-called Software as a Service (SaaS). The devices consist of a mixture of different kinds and homogeneous hardware and software designs, thus increasing the difficulty of the system which makes them difficult to create, develop, and maintain. Hence, the use of simulation-based tests has achieved great attention referring to the IoT field; having the ability to simulate IoT will become important and essential because of its user-friendly costs and low-risk factors. The paper emphasizes simulating automobiles in various IoT environments. This review paper introduces first by describing the recent developments in technologies for the simulation and gives knowledge of making various simulation-based kits using CAN link and OBD2. The second half of the review paper provides knowledge of vehicles and gives the best way to prevent theft and keep track of the vehicle.
2 Application The Internet is becoming a storehouse of knowledge. Wireless services are built by interconnecting different types of gadgets in a wide geographic region. Raw data identified by the sensors are distributed and collected by the scientific discipline system. The raw data is controlled by a data distribution service which is implemented within an Application Package Interface. The cost of the sensors is low, and thus their high demand has become in most developing fields. The communication between the sensors with one another is to develop a strong sensor network. Huge amounts of data are generated and then stored in the cloud storage platforms. These architectures acquire the potential to receive information or inputs received from the various electronic mobile terminals. These data are processed together forming a Sensing-Based Service Model (Sensor Model) known as SaaS. The actual scenario explains the requirement for a good, accurate, and stable simulation system that takes into consideration different architecture which control the transmission between varieties of devices, ubiquitous computing that focuses on increasing scalability, and also using existing methods.
3 Basic Concepts of IoT Internet of Things or IoT [1] is extensively explained as the connection of various devices with each other along with the help of the web to upload and provide services to the globe thus building the bridge between the present world and the system. The
Review: Application of Internet of Things (IoT) for Vehicle Simulation …
309
use of the entire system is developed by using sensors to gather and store information from the real world, software techniques, and electronic equipment to swap and calculate the information. Extensively, Internet of Things provides advancement in connectivity among devices working on various platforms, environments, architecture, and operating interfaces. The built-in systems attached to the web also are explained as smart objects. There is an alternative technology emphasizing the identical direction which is RFID. This facility consumes cheaper E-tags for making a difference between various objects and identifying them from a distance. Adding a few unique entities into the electronic tag, making it a smart, sensible object, Internet of Things is that influence or physical effect spreading its use through those smart digital objects [2].
4 Simulation Concept Simulations could also be called because of the reality which might be visualized. It’s the visualization of the planet which is utilized in strain and stress on the parts we wish to focus on without utilizing the desired resources while using the identical present world. Simulator is a device that uses light movement and sound which aim to provide an experience for identical conditions. Both online and offline games may be considered the most effective example for simulations. Simulation methods don’t focus on proving experimental theories, in research fields (medical), or developing an oversized network.
5 Simulation-Based Internet of Things The whole skeleton of IoT is quite large; it uses simulations to build the software applications and test the model in the field of business domain around the suggested applications instead of investing money into hardware till the model is according to the user. So the safety precautions which are taken help in performance optimization. Different forms of simulations are supported in IoT systems [3, 3]. These are divided into two parts: generic and advanced (specialized) simulators by Internet of Things OSs as a segment of various developing environments like Contiki OS, Cooja, etc. The largest benefit of those simulators are that they do not require a simulation-based model and there is a utilization of various simulation environments. But the simulators aren’t ascendable [5] because of the entire device getting simulated, whereas generic simulators (ns-2, ns-3) require simulation-based models which are highly ascendable. So we require a correct approach (loop simulation) which enables it to run in real time with a minimum average model size [6]. Some techniques such as multi-level simulation are often utilized in smart territories. Another approach may be to hold various simulations together, each running in its domain simulations [7, 8]. Methods such as multi-agent simulations are accustomed to simulating metropolitan areas. The
310
R. Pal et al.
land-use transport model and cellular automata help in designing support systems [9, 9].
6 Application of Simulated IoT for System Used in Vehicle Automobile industries are a giant center of attention of IoT applications since vehicles represent one of the biggest market expenditures/investments by any consumer. Vehicles are complex machines incorporating multiple micro-controlling chips which can communicate by using onboard buses. Various applications of IoT in vehicles are • • • • • •
It is quite effective and safer driving methods prevent accidents. Timely maintenance alerts are provided, therefore, making the performance safe. The maintenance and operating cost are quite less. It provides information about traffic conditions and drivers. Vehicle-to-vehicle communication. Vehicle-to-Internet communication.
7 Car Simulation Structure and Principle As of now, the most challenging aspect of simulation is the physical connection of the vehicle’s OBD system, which facilitates testing of both the vehicle’s hardware and software. The last thing anyone wants to do when they are driving a car is to keep an eye on their untested micro-controller. We propose a way to resolve this major issue by creating a module that will easily simulate a car under all conditions, so the system will be capable of acting in a similar manner to a real vehicle (Fig. 1). As part of the communication system, the system must respond to CAN bus queries by sending OBD2 data. In order to accomplish this, we use the Beagle Char (BBB). By now, the entire system is capable of emulating the connection of a real automobile. Realistic trip data from the datasets are recorded using Real Trip Runners (RTRs) (Fig. 2), which are sometimes virtualized with Virtual Trip Runners (VTRs) (Fig. 3) in order to verify whether this information corresponds to real object data. As a result, one system is capable of serving both as a visit recorder and simulator at the same time. Data from OBD2 provides much information about a vehicle’s internal functions, but there are some drawbacks, such as location and orientation. BBB with CAN capability is used in conjunction with a custom cape to solve this problem. A GPS module, accelerometers, magnetometers, angular rate sensors, and a K-line interface are all included in the cape, which allows the conversion of the data to OBD2 signals. VTR/RTR capes are also equipped with Bluetooth modules, which will enable the generation of fault codes and control of the VTR via an OBD2 Bluetooth interface (Fig. 4) or a smartphone [11].
Review: Application of Internet of Things (IoT) for Vehicle Simulation …
Fig. 1 A sample of vehicle simulation setup Fig. 2 RTR diagram
Fig. 3 VTR diagram
311
312
R. Pal et al.
Fig. 4 OBD2 connected to car simulator/recorder
8 Embedding Security Features in a Vehicle Using IoT We would be using a Raspberry Pi with a 32-bit ARM-based CPU and a Linuxbased Operating System to handle computation within the chip. This would enable transparency and cross-platform implementation. Other important sensors include a proximity sensor, door lock sensor, PI camera, GPS, and GSM modules along with a USB interface and dongle interface. Python is to be used to program the device’s functionality, and initialization should be carried out during the power-on reset process. A smartphone is used to deliver the Center Locking command to the car. In order to operate the locking mechanism, SMS communication is used in conjunction with the GSM modem. In conjunction with the door lock sensors, proximity sensors are able to detect any unauthorized entry into the vehicle and track the activities of the intruders. An app notifies the owner immediately that the PI cameras have been activated and the message is transmitted via a GSM modem. With the help of the GPS module, the owner will be able to access the camera and track the location of the vehicle. As a result of the camera, both interior and exterior images of the car are captured, along with surrounding images, and sent to the device for display. Obtaining this result will be accomplished by moving the rear camera for the vehicle’s rearview. As part of native integration with the operating system, the Messaging System automatically sends mail to the Centralized Server mail ID present within the Vehicle’s database at the required time interval (Fig. 5).
Review: Application of Internet of Things (IoT) for Vehicle Simulation …
313
Fig. 5 IoT car tracking system
The control system consists of an Android application built in Flutter that provides essential features such as GPS tracking and remote lock/unlock of the car. In order to track the vehicle, GPS Modem Questar-G702-001UB Compact is used, which periodically sends the coordinates of the vehicle to the owner’s email address via the Raspberry Pi [12]. Google Maps integration allows users to view the coordinates within the app itself [13, 14]. Furthermore, in order to avoid overflowing with continuous data frames, the modem device is ported to a renaissance processor. Using previous knowledge gained in the lab, a simulation model can automatically activate the entire process.
9 Conclusion and Future Scope Researchers and industry have focused much attention on new technologies incorporating IoT in recent times. It is expected that the trend will keep on increasing in the near future. In view of the heterogeneous nature of IoT applications and their varying scales, it is imperative that they be thoroughly tested before being deployed, which can be accomplished through the use of simulation analysis. Using the OBD2 links and converting them into real-life scenarios, this review presents an amalgamation of such methods. IBM Bluemix IoT Foundation is one of the many organizations working to enhance this concept. A collection of actual-time vehicle data is present for simulation through IBM Bluemix, an open-source project. Users of Bluemix accounts may register their vehicles, which contributes to the datasets that further improve the device’s functionality and efficiency. The second aspect of importance that has been emphasized in this paper is security, which was demonstrated as an easy and practical method for implementing this technology in any modern-day vehicle. Typically, these types of implementations use technologies like RFID, Bluetooth, and
314
R. Pal et al.
GPS. This review paper, therefore, is intended to spark further interest in the field of smart cars, which will hopefully result in smart cars becoming more commonplace in the future.
References 1. R.H. Weber, R. Weber, Internet of Things, vol. 12 (Springer, New York, NY, USA, 2010) 2. H. Kopetz, Internet of Things, in Real-Time Systems (Springer, US, 2011), pp. 307–323 3. Simulation—Scholarly Journal|Omics Group|Industrial Engineering and Management, Website: https://www.Omicsonline.Org/simulation-scholarly-journal.php 4. O. Oransa, M. Abdel-Azim, VeSimulator a location-based vehicle simulator model for IoT applications 5. M. Brumbulli, E. Gaudin, Towards model-driven simulation of the Internet of Things, in Complex Systems Design & Management Asia (Springer International Publishing, 2016), pp. 17–29 6. G. D’Angelo, S. Ferretti, V. Ghini, Simulation of the Internet of Things, in 2016 International Conference on High-Performance Computing & Simulation (HPCS) (IEEE, 2016) 7. W. El-Medany, et al., A cost-effective real-time tracking system prototype using integrated GPS/GPRS module, in 2010 6th International Conference on Wireless and Mobile Communications (ICWMC) 8. Sending vehicle data to the IBM Watson IoT platform—developer Works Recipes, Website: https://developer.ibm.com/recipes/tutorials/sending-vehicle-data-to-the-iot-foundation/ 9. S.R. Nalawade, A.S. Devrukhkar, Bus tracking by computing cell tower information on Raspberry Pi, in International Conference on Global Trends in Signal Processing, Information Computing and Communication, pp. 87–90 10. P.V. Mistary, R.H. Chile, Real time vehicle tracking system based on ARM7 GPS and GSM technology, in 12th IEEE International Conference Electronics, Energy, Environment, Communication, Computer, Control: (E3-C3), INDICON 2015, pp. 1–6 11. S. Almishari, N. Ababtein, P. Dash, K. Naik, An energy efficient real-time vehicle tracking system, in 2017 IEEE Pacific Rim Conference on Communications, Computers and Signal Processing (PACRIM) 12. S. Sivaraman, M.M. Trivedi, Integrated lane and vehicle detection, localization, and tracking: a synergistic approach. IEEE Trans. Intell. Transp. Syst. 14(2), 906–917 13. L.C.M. Varandas, B. Vaidya, J.J.P.C. Rodrigues, mTracker : a mobile tracking application for pervasive environment, in 24th International Conference on Advanced Information Networking and Applications Workshops, pp. 962–967 14. S. Ahmed, S. Rahman, S.E. Costa, Real-Time Vehicle Tracking System. BRAC University By Saniah
Data Science and Data Analytics
Time Series Analysis and Forecast Accuracy Comparison of Models Using RMSE–Artificial Neural Networks Nama Deepak Chowdary, Tadepally Hrushikesh, Kusampudi Madhava Varma, and Shaik Ali Basha
Abstract Primary importance of our research paper is to demonstrate the time series analysis and forecast accuracy of different selected models based on neural networks. Fundamentally important to many practical applications is time series modeling and forecasting. As a result, there have been numerous ongoing research projects on this topic for many months. For enhancing the precision and efficacy of time series modeling and forecasting, numerous significant models have been put out in the literature. The purpose of this research is to give a brief overview of some common time series forecasting methods that are implemented, along with their key characteristics. The most frugal model is chosen with great care when fitting one to a data set of Pune precipitation data from 1965 to 2002. We have utilized the RMSE (root mean square error) as a performance index to assess forecast accuracy and to contrast several models that have been fitted to a time series. We applied feed-forward, time-lagged, seasonal neural networks, and long short-term memory models on selected dataset. The long short-term memory neural model worked better than other models. Keywords Time series · Artificial neural networks · Long short-term memory · Time-lagged neural networks · Seasonal neural networks
1 Introduction Over the past few decades, the research community’s attention has been drawn to the dynamic field of time series modeling. The basic goal of time series modeling is to meticulously gather and thoroughly analyze historical data from a time series to N. D. Chowdary · T. Hrushikesh · K. M. Varma (B) · S. A. Basha Department of Computer Science and Engineering, Koneru Lakshmaiah Education Foundation, Vaddeswaram, Andhra Pradesh 522502, India e-mail: [email protected] N. D. Chowdary e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 A. Bhattacharya et al. (eds.), Innovations in Data Analytics, Advances in Intelligent Systems and Computing 1442, https://doi.org/10.1007/978-981-99-0550-8_26
317
318
N. D. Chowdary et al.
create a model that accurately captures the series’ underlying structure. The series’ future values are then generated using this model to produce forecasts. Thus, time series forecasting can be defined as the process of making predictions about the future based on knowledge of the past [1]. Given the crucial role that time series forecasting plays in a variety of real-world contexts, including business, economics, finance, science, and engineering [5], due attention should be made to match the right model to the underlying time series. It goes without saying that a good model fitting is necessary for successful time series forecasting. Researchers have put in a lot of work over many years to create effective models that will increase forecasting accuracy. This has led to the development of numerous significant time series forecasting modeling. Neural networks (ANNs) are gaining more and more attention recently in the field of time series analysis. Initially inspired by biology, ANNs have now been successfully used in a wide range of contexts, particularly for predicting and classification applications [2, 3]. When used to solve time series forecasting problems, ANNs shine because of their innate capacity to perform non-linear modeling without making any assumptions about just the statistical distribution that the observations would follow. Based on the provided data, the relevant model is adaptively created. Because of this, ANNs are data driven and self-adaptive. A significant amount of research has been conducted in recent years toward using neural network models for time series modeling and forecasting. In experimental research, we utilized a dataset of Pune precipitation from 1965 to 2002. We applied four neural network-based models like feed-forward artificial neural networks, Time-lagged artificial neural networks (TLNN), seasonal artificial neural networks (SANN), and long short-term memory (LSTM).
2 Related Research In contrast to different conventional linear methods, such as ARIMA methods, ANNs are intrinsically non-linear, making it more practicable and accurate in modeling complicated data patterns [4]. Numerous examples show that ANNs performed analysis and are predicting much better than different linear models. In Hornik and Stinchcombe [4], finally, ANNs are universal function approximators, as argued by [4]. They have demonstrated that any continuous function may be approximated by a network to any desired level of accuracy. ANNs employ parallelization of the data’s information to accurately estimate a broad array of functions. Additionally, they are capable of handling scenarios in which the input is false, lacking, or ambiguous. Hamzacebi [6], to enhance ANNs’ predicting capabilities for seasonality time series data, proposes the SANN structure. No preprocessing of the raw data is necessary for the suggested SANN model. In contrast to certain other conventional approaches, like SARIMA, SANN can also learn the seasonal patterns in the series without eliminating them. On four real-world temporal datasets, the writer
Time Series Analysis and Forecast Accuracy Comparison of Models …
319
Fig. 1 Feed-forward neural network
has empirically confirmed the SANN’s effective predicting abilities. Additionally, we are currently using this approach in our work on two new season time series, and the outcomes are quite satisfactory. Here, we give a succinct summary of the SANN model as it was put forward in [6].
3 Neural Networks Models in Time Series 3.1 Feed-Forward Neural Network Model In Abdulkarim and Engelbrecht [7], multilayer perceptrons (MLPs), which employ a hidden layer feed-forward network, are the ANNs most frequently used in forecasting issues [7]. A network made up of three layers—input, hidden, and output—defines the model. There could be several hidden layers. Here, the prior “p” time steps determine the output (y t). So, the output layer is composed of one component and the input layer comprises p units. We must train a model using various unit densities across all layers and choose the parameters that result in a low RMSE on test data (Fig. 1).
3.2 Time-Lagged Neural Networks Y t is a variable of the earlier “p” time steps in the FNN. Another popular variant of FNN is called TLNN. The time series data at specific lags serve as the input nodes in TLNN. For instance, the input nodes in a conventional TLNN for a time series data with a seasonal period of 12 can be the lagged numbers at time t 1, t 2, and t 12. To use the values at delays 1, 2, and 12, as indicated in the image, forecast the number at time t [8] (Fig. 2).
320
N. D. Chowdary et al.
Fig. 2 Time-lagged artificial neural networks
3.3 Seasonal Artificial Neural Networks The SANN architecture is suggested to enhance ANNs’ ability to forecast seasonal time series. In contrast to certain other conventional approaches, like SARIMA, SANN can understand the seasonal trend in the series without deleting them. The amount of output and input neurons in this model is determined by the seasonal parameter “s.” The values of the output and input neuron in this network topology are indeed the i and (i + 1)-th annual interval observations, respectively. The number of output and input neuron should therefore be assumed to be 12 for monthly time series and 4 with quarter time series when forecasting using SANN. By conducting adequate tests on the training data, it is possible to establish the ideal number of hidden nodes [6].
3.4 Long Short-Term Memory (LSTM) In WANG et al. [9], RNNs can retain key details from the input they receive, allowing them to make extremely accurate predictions about what will happen next. They are utilized for time series data, voice recognition, text prediction, etc. because of this. For time series, the LSTM model is particularly effective. It has the capacity to permanently store dependencies in memory. The data iterates through a loop in an RNN/LSTM. It considers the current input as well as the lessons it has learnt from the inputs it has previously received before deciding. Here, as indicated in the picture, the current timestep is predicted using the prior “p” timesteps [9].
Time Series Analysis and Forecast Accuracy Comparison of Models …
4 Time Series Analysis 4.1 Program for Best Method
4.2 Method-Wise Analysis 4.2.1
Forward Neural Network
See Fig. 3.
4.2.2
Time-Lagged Neural Networks
See Fig. 4.
4.2.3
Seasonal Neural Networks
See Fig. 5.
321
322
N. D. Chowdary et al.
Fig. 3 Time series analysis of FNN
Fig. 4 Time series analysis of time-lagged networks
4.2.4
Long Short-Term Memory
See Fig. 6.
Time Series Analysis and Forecast Accuracy Comparison of Models …
323
Fig. 5 Seasonal neural networks time series analysis
Fig. 6 Long short-term memory
5 Forecast Accuracy Comparison and Result In comparison to time-lagged neural network and seasonal neural networks, the results of the experiment and research demonstrate that long short-term memory model and feed-forward model did quite well. The RMSE score (Table 1) of the model created with feed-forward network is 118, which is less than time-lagged neural network and seasonal neural networks. Like this, our second-best neural model for time series is outperformed by a long short-term memory model, which provides the best RMSE score of 94.24. Sadly, we didn’t even find the seasonal
324 Table 1 Forecast accuracy comparison
N. D. Chowdary et al. Model name
RMSE score
Feed-forward neural model
118.4
Time-tagged neural model
126.4
Seasonal neural model
138.6
Long short-term memory model
94.3
neural network model to be a good fit for our data. Figure 3 describes time series analysis of FNN. Figure 4 displays time series analysis of time-lagged networks. Figure 5 represents seasonal neural network time series analysis. Figure 6 displays long short-term memory. Long short-term memory model performed well for our dataset.
6 Conclusion We have considered RMSE (root mean square error) performance measurements for assessing the precision of forecasting models. More than one metric should be employed in practice, it has been agreed, in order to acquire a meaningful understanding of the overall forecasting inaccuracy. The paper includes the results of our forecasting tests, which were conducted using datasets of Pune precipitation data during 1965–2002. The RMSE performance measurements and the forecast diagrams we generated in each of the four neural network architecture demonstrate our good knowledge of the studied forecasting methods and the successful implementation of those models. The actual observations and the values we predicted deviate from each other. Forecasting is a rapidly expanding field of study, offering numerous opportunities for new research in the future. One of them is the “Combining Approach,” in which a variety of various and unrelated methodologies are combined to increase forecast accuracy. Numerous studies have been conducted in his direction, and several combination techniques have been put forth in the research [5]. We have considered looking for an effective combining model in the future, if that is at all possible, along with other researches in time series forecasting.
References 1. A. Tealab, Time series forecasting using artificial neural networks methodologies: a systematic review. Futur. Comput. Inform. J. 3(2), 334–340 (2018). https://doi.org/10.1016/j.fcij.2018. 10.003 2. S. Athiyarath, M. Paul, S. Krishnaswamy, A comparative study and analysis of time series forecasting techniques. SN comput. sci. 1(3), (2020). https://doi.org/10.1007/s42979-020-001 80-5.
Time Series Analysis and Forecast Accuracy Comparison of Models …
325
3. F. Dube, N. Nzimande, P-F. Muzindutsi, Application of artificial neural networks in predicting financial distress in the JSE financial services and manufacturing companies. J. Sustain. Finance Invest. 1–21. (2021). https://doi.org/10.1080/20430795.2021.2017257 4. K. Hornik, M. Stinchcombe, H. White, Multilayer feedforward networks are universal approximators. Neural Netw. 2(5), 359−366 (1989) https://doi.org/10.1016/0893-6080(89)900 20-8 5. G.P. Zhang, “A neural network ensemble method with jittered training data for time series forecasting”. Inf. Sci. 177(2007), 5329–5346 (2007). https://doi.org/10.1016/j.ins.2007.06.015. 6. C. Hamzacebi, Improving artificial neural networks’ performance in seasonal time series forecasting. Inf. Sci. 178(23), 4550–4559 (2008). https://doi.org/10.1016/j.ins.2008.07.024 7. S. Abdulkarim, A.P. Engelbrecht, Time series forecasting with feedforward neural networks trained using particle swarm optimizers for dynamic environments. Neural Comput. & Appl. 33(7), 2667−2683 (2020). https://doi.org/10.1007/s00521-020-05163-4. 8. O. Surakhi, M.A. Zaidan, P.L. Fung, N. Hossein, S. Serhan, M. AlKhanafseh, R.M. Ghoniem, T. Hussein, Time-lag selection for time-series forecasting using neural network and heuristic algorithm. Electronics 10(20), 2518 (2021). https://doi.org/10.3390/electronics10202518 9. W. F. Wang, X. U. Qiu, C. S. Chen, B. O. Lin, H. M. Zhang, Application research on long short-term memory network in fault diagnosis, in 2018 International Conference on Machine Learning and Cybernetics (ICMLC).(2018). https://doi.org/10.1109/icmlc.2018.8527031
A Non-Recursive Space-Efficient Blind Approach to Find All Possible Solutions to the N-Queens Problem Suklav Ghosh and Sarbajit Manna
Abstract N-Queen’s problem is the problem of placing N number of chess queens on an NxN chessboard such that none of them attack each other. A chess queen can move horizontally, vertically, and diagonally. So, the neighbors of a queen have to be placed in such a way so that there is no clash in these three directions. Scientists accept the fact that the branching factor increases in a nearly linear fashion. With the use of artificial intelligence search patterns like Breadth First Search (BFS), Depth First Search (DFS), and backtracking algorithms, many academics have identified the problem and found a number of techniques to compute possible solutions to nqueen’s problem. The solutions using a blind approach, that is, uninformed searches like BFS and DFS, use recursion. Also, backtracking uses recursion for the solution to this problem. All these recursive algorithms use a system stack which is limited. So, for a small value of N, it exhausts the memory quickly though it depends on the machine. This paper deals with the above problem and proposes a non-recursive DFS search-based approach to solve the problem to save system memory. In this work, Depth First Search (DFS) is used as a blind approach or uninformed search. This experimental study yields a noteworthy result in terms of time and space. Keywords N-Queen’s problem · Non-Recursive algorithm · Iterative algorithm · DFS · Artificial intelligence
S. Ghosh (B) School of Mathematical and Computational Sciences, Indian Association for the Cultivation of Science, Kolkata, West Bengal, India e-mail: [email protected] S. Manna Department of Computer Science and Electronics, Ramakrishna Mission Vidyamandira, Howrah, West Bengal, India e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 A. Bhattacharya et al. (eds.), Innovations in Data Analytics, Advances in Intelligent Systems and Computing 1442, https://doi.org/10.1007/978-981-99-0550-8_27
327
328
S. Ghosh and S. Manna
1 Introduction The N-Queen’s problem was first proposed by famous German mathematician and physicist Carl Gauss in the mid-nineteenth century. Scientists from various fields had been studying it for more than a century. The input of the problem is the NxN chess board. The output is to find the configuration of the board and compute the total number of solutions possible. To find the solution, the goal of the N-Queen’s problem is to find an arrangement of the N-queens such that none of them attack each other where the chess queens can move horizontally, vertically, and diagonally. So, the attacking positions with respect to a queen are the positions with the same rows, same columns, and same diagonal [1–3]. Additionally, there are some values of N for which no solution is possible, for example, N = 2. So, that is needed to keep track as well. Numerous studies in this field have been published in the literature. The N-Queens problem is solved using backtracking algorithms in the traditional method, which covers all potential solutions [4]. The solution vector is created by the backtracking algorithms one component at a time, and it is then put to the test. The criterion function is used to determine whether there is still a probability for the vector being constructed to succeed. However, the huge size n-queens problem cannot be resolved by backtracking. Consider the following example of a 4 × 4 chess board that is a 4-queen’s problem. If a queen is placed in (1,1) cell then no other queens can be placed in the same row that is (1,2), (1,3), and (1,4) cells, in the same column that is (2,1), (3,1), and (4,1) cells, and in the same diagonal that is (2,2), (3,3), and (4,4) cells as shown in Table 1. To find out the solution, a column (or a row) is fixed, and the first queen is placed. Second queen is positioned in the following column (or row), while continuing to adhere to the previous constraints. If it is discovered that there is no “safe” place available for the next queen for a given queen, then backtracking is applied. The last queen is moved to a new position before the process is continued. If all the queens are successfully placed on the board and no queens are assaulting one another, then a solution has been found, and success is counted [1–3, 5]. To solve the aforementioned issue, an iterative non-recursive algorithm is employed. Otherwise, the memory will be quickly used up by a recursive algorithm. Table 1 Configuration of a 4x4 chessboard
(1,1)
(1,2)
(1,3)
(1,4)
(2,1)
(2,2)
(2,3)
(2,4)
(3,1)
(3,2)
(3,3)
(3,4)
(4,1)
(4,2)
(4,3)
(4,4)
A Non-Recursive Space-Efficient Blind Approach to Find All Possible …
329
The N-queens issue has traditionally been solved using search heuristics, local searches, and conflict minimization techniques [9]. Furthermore, in the past few years, it has already been thought that new Artificial Intelligence (AI) techniques and tactics are preferable over conventional methods to tackle the N-Queens problem. In the technical and practical fields of traffic control, VLSI testing, parallel photonic computing, communications, and others, the N-Queens problem has several applications. This paper addresses the N-Queens problem with a non-recursive solution to the problem. Here a blind approach or uninformed search is utilized while conducting DFS.
2 Literature Survey Since the inception of the N-Queens problem, many techniques are now made available for its solution. An assortment of strategies for tackling the problem is shown as part of the literature survey. For a particular NxN chessboard, a variety of methods have already been designed to yield all feasible solutions. Backtracking [6–8] is among the very popular methods for resolving the N-Queens problem which produces every workable solution methodically. Backtracking techniques’ fundamental concept includes developing the solution vector single component at a time and evaluating it in adherence to the criterion function to see if the vector still has a chance of succeeding. To solve this problem, other researchers have suggested additional effective search methods. Such strategies comprise local search, conflict minimization techniques [9], and search heuristics [6]. A stochastic method makes use of metaheuristics depending on gradients. For a very large value of N, this method can produce a solution. The following Table 2 shows the challenges in existing solutions and the proposed method tries to address a maximum of them.
3 Proposed Methodology Here the same methodology is employed as discussed in the introduction. But the trick is to unfold the recursion and solve the problem in a non-recursive manner. The program’s limited stack area is used by the recursive technique. If the highest amount of expected recursive call is not known beforehand, then a stack overflow will happen, and also if there are much more calls than the stack’s allotted area can manage at one instance, a stack overflow will also result. But instead of this, this proposed methodology defines an external stack, which is implemented on the heap area that the compiler allots to the program at the time of execution. The heap area is
330
S. Ghosh and S. Manna
Table 2 Challenges in existing solutions Paper name
Description of the proposed technique
Challenges
A new solution for N-queens problem using blind approaches: DFS and BFS algorithms [10]
Combination of DFS and BFS is used to improve the performance and runtime with respect to classical approach
Uses system stack and Performance and runtime is not good and sometimes not feasible after certain number of queens
Review on N-queen optimization using tuned hybrid technique [11]
Tried to optimize performance with respect to existing algorithms using tuned hybrid technique
Uses System stack and runtime is poor even for lesser number of queens.
A novel approach to solving N-queens problem [12]
Proposed a fast algorithm using the properties of a given series. Directly places queens and finds a solution very quickly
Finds only one solution and Can not be used to find all possible solution even for a very less number of queens
N-queens solving algorithm Used a hybrid method to by sets and backtracking [13] improve the performance of the traditional solution
Performance is poor even on a machine with high computing power, Runtime can be improved and Bad choice for data structure in the implementation
Comparative study of different algorithms to solve N-queens problem [14]
Fails for larger values of queens and Performance is good but for the values of queens greater than 10, performance becomes very poor
Proposed genetic algorithm to solve the N-queens problem efficiently
not static and not limited, and can, if necessary, grow dynamically while the program is running. Actually, programmers should not be concerned about the external stack overflow. The search tree is a DFS-based search tree. The proposed algorithm not only returns the total number of solutions, but also shows every possible solution. The following is the algorithm for the proposed methodology.
A Non-Recursive Space-Efficient Blind Approach to Find All Possible …
331
4 Implementation of the Proposed Solution The proposed solution is implemented using C + + in a system with the following specifications. Processor: Intel® CoreTM i3-7020U CPU @ 2.30 GHz 4x. Operating System: Ubuntu 20.04 LTS. RAM: 4 GB. OS Type: 64-bit GNOME Version: 3.36.8 Graphics: Mesa Intel® HD Graphics 620 (KBL GT2). Two C + + header files are used here. One is the < iostream > and the other is the .
332
S. Ghosh and S. Manna
The vector is used in the standard template library (STL) of C++. Even in the case of storing the states, only the row numbers are stored. Because, since there will be only one queen in each row, if we had stored it in matrix form, all the remaining cells of the matrix would be zeros. Also, storing column numbers is not necessary because the Nth element of the vector stands for the Nth column of the matrix. So only by storing the row number, it is done and gives both time and memory-efficient solution 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
24 25 26
//myvec vector records the row numbers only #include #include using namespace std; void solvenqueen(int n); bool ifsafe(int n, vector &myvec); int main() { int n; char c; coutn; solvenqueen(n); return 0; } bool ifsafe(int n, vector &myvec) { int row = myvec.back(); int col = myvec.size() - 1; //If the queen is being placed outside the nxn board if(row >= n) return false; //Otherwise check if it’s being attacked by other queens for(unsigned int i = 0; i < myvec.size() - 1; i++) //If a queen is already there in the same row
A Non-Recursive Space-Efficient Blind Approach to Find All Possible …
333
//If a queen is already there in the same row 27 if (row == myvec[i]) 28 return false; 29 //If diagonally it is attacking position 30 if (abs(row - myvec[i]) == abs(col - int(i))) 31 return false; 32 } 33 return true; 34 } 35 void solvenqueen(int n) 36 { 37 //myvec vector records the row numbers only 38 vector myvec; 39 int row,flag=0,total=0; 40 //Total is used to count the total number of solutions 41 //flag is used to check if all possible solutions have been found 42 //The initial queen is placed 43 myvec.push_back(0); 44 // Loop until one solution is found 45 while(!flag) 46 { 47 //If the last queen is not in a safe place 48 if(!ifsafe(n,myvec)) 49 { 50 //Two cases: 1)out of bound, 2) last pos was bad 51 row=myvec.back(); // Record last row tried 52 //Backtrack to last valid position without recursion (using loop) 53 if(row>=n) 54 { 55 do 56 { 57 if (myvec.size()>1) //All solutions have not been found 58 { 59 myvec.pop_back(); //Last pos removed 60 row = myvec.back();//Now, work with new last position 61 myvec.pop_back(); //Since this was the last bad.. removing this and 62 myvec.push_back(row+1); //Trying with the next row 63 row = myvec.back(); //The do-while will check if it’s out of bound 64 }
334 65
66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91
92 93 94 95 96
S. Ghosh and S. Manna
else //Found all solutions { flag=1; break; } //Until it comes within the bound } while (myvec.back() >= n); } //Second case: within bound but bad position else { myvec.pop_back(); myvec.push_back(row + 1); //Next row } } //If last queen is safe else { //Goal-test if(myvec.size() == n) { total++; row = myvec.back(); myvec.pop_back(); myvec.push_back(row + 1); } //If not a goal.. proceed with next column’s initial row else myvec.push_back(0); } } cout