Intelligent Computing and Networking: Proceedings of IC-ICN 2021 (Lecture Notes in Networks and Systems, 301) 981164862X, 9789811648625

This book gathers high-quality peer-reviewed research papers presented at the International Conference on Intelligent Co

152 41 9MB

English Pages 288 [287] Year 2022

Table of contents :
Preface
Contents
Editors and Contributors
Performance Evaluation of an IoT Edge-Based Computer Vision Scheme for Agglomerations Detection Covid-19 Scenarios
1 Introduction
2 Related Work
2.1 IoT and Egde Applications
2.2 Algorithm of Computer Vision to Detect Objects
3 Proposed Edge Architecture
4 Computer Vision Based on Difference of Frame in Single Shot Detection Algorithm
4.1 Evaluation of Frames Based on Difference of Frame
4.2 SSD Algorithm
5 Middleware Connectivity
6 Performance Evaluation and Results Analysis
7 Conclusion and Future Works
References
Comparative Analysis of Classification and Detection of Breast Cancer from Histopathology Images Using Deep Neural Network
1 Introduction
1.1 Symptoms of Breast Cancers Include
1.2 Stages of Breast Cancer
2 Related Work
3 Literature Survey
4 Conclusion
References
Determining Key Performance Indicators for a Doctoral School
1 Introduction
2 Indicators Study for Evaluating a DS Performance: State of the Art
2.1 Concept of Performance Indicator and Research Methods
2.2 Presentation of Criteria for Evaluating a Doctoral School Performance
3 Survey Procedures and Analysis of Results
3.1 Data Collection
3.2 Results Analysis
4 Determination of Performance Indicators for a Doctoral School
4.1 Doctoral School Governance
4.2 Satisfaction with Material and Immaterial Infrastructures
4.3 Results of the Defenses and Prospects of a Doctoral School
5 Discussion
6 Conclusion and Future Works
References
Cloud Attacks and Defence Mechanism for SaaS: A Survey
1 Introduction
2 Cloud Issues Attack and Defence Mechanism
2.1 SaaS Security Issues
2.2 Cloud Attacks and Defence Mechanism for SaaS
3 Conclusion and Future Work
References
“Palisade”—A Student Friendly Social Media Website
1 Introduction
2 Related Study
2.1 Security in Social Media
2.2 Disadvantages of Existing System
2.3 Proposed Model
2.4 Objectives
3 The Architecture of the Project
3.1 Methods of Implementation
3.2 Test Cases and Scenarios
3.3 How to Run MERN Application and Use It:
4 Results and Discussion
5 Conclusion and Future Scope
References
Comparative Review of Content Based Image Retrieval Using Deep Learning
1 Introduction
2 Research Background
3 Comparison of Existing Methods
4 Experimentation
4.1 Decision Tree
4.2 Feed Forward Neural Network
4.3 Convolutional Neural Network
5 Results and Evaluation
6 Conclusion
References
Fuzzy-Logic Approach for Traffic Light Control Based on IoT Technology
1 Introduction
2 Related Work
3 Proposed Approach
3.1 Design of Fuzzy Inference System
3.2 Fuzzy Inference Engine
4 Implementation and Initial Results
5 Conclusion
References
Data Clustering Algorithms: Experimentation and Comparison
1 Introduction
1.1 Working of Data Mining Algorithm
2 What is Clustering?
3 Clustering Algorithms
3.1 Partition Based Method
3.2 Hierarchical-Based Method
3.3 Density-Based Method:
3.4 Grid-Based Method
3.5 Model-Based Method
4 Algorithms
4.1 K-Means Algorithm
4.2 Agglomerative Clustering Algorithm
5 About the Dataset
5.1 First Dataset
5.2 Second Dataset
6 Validity Measures
7 Result And Discussion
7.1 K-Means Algorithm
References
Design and Development of Clustering Algorithm for Wireless Sensor Network
1 Introduction
1.1 Enhancement of an Algorithm for WSN
2 Ease of Use
2.1 K-means Generalization Benefits
2.2 Modified K-means (Initialization Method)
2.3 Clustering in WSN
3 Abbreviations and Acronyms
3.1 Equations
3.2 Complexity
4 Proposed Methodology
4.1 Dataset
4.2 Implementation
4.3 Result and Discussion
5 Conclusion
References
Mitigate the Side Channel Attack Using Random Generation with Reconfigurable Architecture
1 Introduction
2 Cryptography
3 Chaotic Circuit
4 Results and Comparison Analysis
5 Conclusion
References
A Statistical Review on Covid-19 Pandemic and Outbreak
1 Introduction
1.1 The Novel Coronavirus or COVID-19
1.2 Formation of COVID 19
1.3 Symptoms
1.4 Treatment for outbreak of virus COVID-19
2 Literature Survey
3 Statistical Analysis
4 Results and Discussion
References
Performance Evaluation of Secure Web Usage Mining Technique to Predict Consumer Behaviour (SWUM-PCB)
1 Introduction
2 Literature Review
3 Apriori Algorithm
4 Proposed Architecture
5 Proposed Algorithm
6 Experimental Results and Discussion
7 Conclusion
References
Quantum Computing and Machine Learning: In Future to Dominate Classical Machine Learning Methods with Enhanced Feature Space for Better Accuracy on Results
1 Introduction
1.1 Quantum Computing and Concepts
2 Quantum Enhanced ML
3 Quantum ML Research Domains
3.1 Survey on Quantum Literature
3.2 Comparison of Classical and Quantum Models
3.3 Motivation of Quantum Computing in ML
3.4 Platform Used for Proposed Implementation
4 Dataset Selection and Implementation
5 Conclusion
References
Design and Develop Data Analysis and Forecasting of the Sales Using Machine Learning
1 Introduction
2 Problem Statement
3 Proposed Methodology
3.1 Exploratory Data Analysis (EDA)
3.2 RFM (Recency, Frequency, Monetary) Model
3.3 Market Basket Analysis (MBA)
3.4 Time Series Forecasting
3.5 SARIMA (Seasonal Autoregressive Integrated Moving Average)
4 Proposed Algorithm
5 Conclusion
References
Prediction of Depression Using Machine Learning and NLP Approach
1 Introduction
2 Related Work
3 Problem Statement
4 Proposed System
5 Proposed Methodology Architecture
6 Results
7 Conclusion
References
Detection and Performance Evaluation of Online-Fraud Using Deep Learning Algorithms
1 Introduction
2 Literature Survey
3 Problem Statement
4 Data Collection and Visualization
5 Proposed System
6 Model Evaluation and Result
7 Conclusion
References
Data Compression and Transmission Techniques in Wireless Adhoc Networks: A Review
1 Introduction
2 Review on Recent Works for Lossless Compression
3 Conclusion
References
Message Propagation in Vehicular Ad Hoc Networks: A Review
1 Introduction
2 Survey on Recent Works for Vehicular Adhoc Networks
3 Conclusion
References
A Comparative Study of Clustering Algorithm
1 Introduction
2 Related Work
3 Proposed Work
3.1 Technology and Tools
3.2 Dataset
3.3 Clustering
3.4 Cluster Formation Methods
3.5 Clustering Algorithms:
3.6 Different Comparison Metrics
3.7 Comparing the Clustering Quality Measure
4 Implementation
4.1 K-means Algorithm
4.2 Hierarchical Clustering
4.3 DBSCAN Algorithm
5 Result
6 Future Scope
7 Conclusion
References
Refactoring Faces Under Bounding Box Using Instance Segmentation Algorithms in Deep Learning for Replacement of Editing Tools
1 Introduction
2 Methodology
3 Implementation
4 Conclusion
References
Motion Detection and Alert System
1 Introduction
2 Related Study
2.1 Advantages
2.2 Disadvantages
3 Proposed Model
3.1 Proposed Model Objectives
3.2 Proposed Model Outcomes
3.3 Proposed Model Advantages
4 Experimental Setup
4.1 Sensors from Which Data is Taken
4.2 Taking Video Input
4.3 Motion Detection
4.4 Sending Mails
4.5 Local Storage
5 Proposed Algorithm
6 Experimental Procedure
6.1 Pre-processing
6.2 Simple Thresholding
6.3 Smoothing
6.4 Finding Contour
6.5 Contour Approximation
6.6 Contour Area
6.7 Drawing Square Shapes and Output
6.8 Sending Mails
6.9 Experimental Data Used
6.10 Control Data
7 Results and Discussions
7.1 Existing System Outcomes
7.2 Proposed System Outcomes
8 Future Scope
9 Conclusion
References
Feasibility Study for Local Hyperthermia of Breast Tumors: A 2D Modeling Approach
1 Introduction
2 Mathematical Modeling and Analysis for Hyperthermia
2.1 Equations for Effective Temperature Distribution
2.2 Equations for Computational Optimization
2.3 Heat Flow Modeling for Breast Tumor
3 Results and Discussion
4 Conclusion
References
Author Index

Recommend Papers

Intelligent Systems: Proceedings of ICMIB 2021 (Lecture Notes in Networks and Systems, 431) 9811909008, 9789811909009

This book features best selected research papers presented at the International Conference on Machine Learning, Internet

127 122 21MB Read more

Intelligent Sustainable Systems: Proceedings of ICISS 2021 (Lecture Notes in Networks and Systems, 213) 9811624216, 9789811624216

This book features research papers presented at the 4th International Conference on Intelligent Sustainable Systems (ICI

124 55 25MB Read more

Intelligent Computing: Proceedings of the 2021 Computing Conference, Volume 3 (Lecture Notes in Networks and Systems, 285) 3030801284, 9783030801281

This book is a comprehensive collection of chapters focusing on the core areas of computing and their further applicatio

108 45 102MB Read more

Pervasive Computing and Social Networking: Proceedings of ICPCSN 2021 (Lecture Notes in Networks and Systems, 317) 9811656398, 9789811656392

The book features original papers from International Conference on Pervasive Computing and Social Networking (ICPCSN 202

102 76 21MB Read more

Intelligent Computing and Innovation on Data Science: Proceedings of ICTIDS 2021 (Lecture Notes in Networks and Systems, 248) 9811631522, 9789811631528

This book gathers high-quality papers presented at 2nd International Conference on Technology Innovation and Data Scienc

99 75 16MB Read more

Intelligent Computing & Optimization: Proceedings of the 4th International Conference on Intelligent Computing and Optimization 2021 (ICO2021) (Lecture Notes in Networks and Systems, 371) 303093246X, 9783030932466

This book includes the scientific results of the fourth edition of the International Conference on Intelligent Computing

124 68 114MB Read more

Intelligent Computing: Proceedings of the 2022 Computing Conference, Volume 3 (Lecture Notes in Networks and Systems, 508) 3031104668, 9783031104664

The book, “Intelligent Computing - Proceedings of the 2022 Computing Conference”, is a comprehensive collection of chapt

100 12 82MB Read more

Intelligent Systems: Proceedings of ICMIB 2020 (Lecture Notes in Networks and Systems, 185) 9813360801, 9789813360808

This book features best selected research papers presented at the International Conference on Machine Learning, Internet

126 112 22MB Read more

Intelligent Computing: Proceedings of the 2022 Computing Conference, Volume 1 (Lecture Notes in Networks and Systems, 506) 3031104609, 9783031104602

The book, “Intelligent Computing - Proceedings of the 2022 Computing Conference”, is a comprehensive collection of chapt

123 69 86MB Read more

Intelligent Computing: Proceedings of the 2022 Computing Conference, Volume 2 (Lecture Notes in Networks and Systems, 507) 3031104633, 9783031104633

The book, “Intelligent Computing - Proceedings of the 2022 Computing Conference”, is a comprehensive collection of chapt

102 16 100MB Read more

Intelligent Computing and Networking: Proceedings of IC-ICN 2021 (Lecture Notes in Networks and Systems, 301)
981164862X, 9789811648625

Author / Uploaded
Valentina Emilia Balas (editor)
Vijay Bhaskar Semwal (editor)
Anand Khandare (editor)

0 0 0
Like this paper and download? You can publish your own PDF file online for free in a few minutes! Sign Up

File loading please wait...

Citation preview

Lecture Notes in Networks and Systems 301

Valentina Emilia Balas Vijay Bhaskar Semwal Anand Khandare Editors

Intelligent Computing and Networking Proceedings of IC-ICN 2021

Lecture Notes in Networks and Systems Volume 301

Series Editor Janusz Kacprzyk, Systems Research Institute, Polish Academy of Sciences, Warsaw, Poland Advisory Editors Fernando Gomide, Department of Computer Engineering and Automation—DCA, School of Electrical and Computer Engineering—FEEC, University of Campinas— UNICAMP, São Paulo, Brazil Okyay Kaynak, Department of Electrical and Electronic Engineering, Bogazici University, Istanbul, Turkey Derong Liu, Department of Electrical and Computer Engineering, University of Illinois at Chicago, Chicago, USA Institute of Automation, Chinese Academy of Sciences, Beijing, China Witold Pedrycz, Department of Electrical and Computer Engineering, University of Alberta, Alberta, Canada Systems Research Institute, Polish Academy of Sciences, Warsaw, Poland Marios M. Polycarpou, Department of Electrical and Computer Engineering, KIOS Research Center for Intelligent Systems and Networks, University of Cyprus, Nicosia, Cyprus Imre J. Rudas, Óbuda University, Budapest, Hungary Jun Wang, Department of Computer Science, City University of Hong Kong, Kowloon, Hong Kong

The series “Lecture Notes in Networks and Systems” publishes the latest developments in Networks and Systems—quickly, informally and with high quality. Original research reported in proceedings and post-proceedings represents the core of LNNS. Volumes published in LNNS embrace all aspects and subfields of, as well as new challenges in, Networks and Systems. The series contains proceedings and edited volumes in systems and networks, spanning the areas of Cyber-Physical Systems, Autonomous Systems, Sensor Networks, Control Systems, Energy Systems, Automotive Systems, Biological Systems, Vehicular Networking and Connected Vehicles, Aerospace Systems, Automation, Manufacturing, Smart Grids, Nonlinear Systems, Power Systems, Robotics, Social Systems, Economic Systems and other. Of particular value to both the contributors and the readership are the short publication timeframe and the world-wide distribution and exposure which enable both a wide and rapid dissemination of research output. The series covers the theory, applications, and perspectives on the state of the art and future developments relevant to systems and networks, decision making, control, complex processes and related areas, as embedded in the fields of interdisciplinary and applied sciences, engineering, computer science, physics, economics, social, and life sciences, as well as the paradigms and methodologies behind them. Indexed by SCOPUS, INSPEC, WTI Frankfurt eG, zbMATH, SCImago. All books published in the series are submitted for consideration in Web of Science.

More information about this series at https://link.springer.com/bookseries/15179

Valentina Emilia Balas · Vijay Bhaskar Semwal · Anand Khandare Editors

Intelligent Computing and Networking Proceedings of IC-ICN 2021

Editors Valentina Emilia Balas Department of Automatics and Applied Software Aurel Vlaicu University of Arad Arad, Arad, Romania

Vijay Bhaskar Semwal Department of Computer Science and Engineering National Institute of Technology Bhopal, India

Anand Khandare Department of Computer Engineering Thakur College of Engineering and Technology Mumbai, India

ISSN 2367-3370 ISSN 2367-3389 (electronic) Lecture Notes in Networks and Systems ISBN 978-981-16-4862-5 ISBN 978-981-16-4863-2 (eBook) https://doi.org/10.1007/978-981-16-4863-2 © The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2022 This work is subject to copyright. All rights are solely and exclusively licensed by the Publisher, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed. The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use. The publisher, the authors and the editors are safe to assume that the advice and information in this book are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or the editors give a warranty, expressed or implied, with respect to the material contained herein or for any errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional claims in published maps and institutional affiliations. This Springer imprint is published by the registered company Springer Nature Singapore Pte Ltd. The registered company address is: 152 Beach Road, #21-01/04 Gateway East, Singapore 189721, Singapore

Preface

“International Conference on Intelligent Computing and Networking (IC-ICN2021)” is a global platform for conducting conferences, sponsored by the All India Council for Technical Education (AICTE), with an objective of strengthening the research culture by bringing together academicians, scientists, researchers in the domain of intelligent computing and Networking. The 12th annual event IC-ICN-2021 in the series of international conferences organized by Thakur College of Engineering and Technology (TCET) under the umbrella of MULTICON ever since the first event of International Conference and Workshop on Emerging Trends in Technology (ICWET- 2010) was conducted online on 26th and 27th February, 2021. The IC-ICN-2021 event is organized with the insightfulness of providing not only a great platform to think innovatively but also bringing in sync the theory and applications in the field of intelligent Computing and Networking for the students, faculty, scientists, researchers from industry as well as the research scholars. This platform is an efficacious link for the students / authors / researchers for collaborations and enhancing the network with the peer universities and institutions in India and aboard in the respective domain. The basic aim is to hold the conference where the participants present their Research Papers, Technical Papers, Case Studies, Best and Innovative Practices, Engineering Concepts and Designs so that the applied study or research can be sopped up into the real world. Technological development in the domain of intelligent computing and Networking is the need of the hour, which will simplify our life in the eco-friendly environment with better connectivity and security and this conference facilitates the pathway to the purpose. Not just inculcating the research culture, IC-ICN 2021 has gained wide publicity through website, social media coverage as well as the vigorous promotion by the team of faculty members to the various colleges. The IC-ICN-2021 has affiliation with Scopus Indexed journal for intelligent systems, leading publication house Springer, Tata McGraw Hill, IOSR, and Conference proceeding with ISBN number. TCET’s efforts have been applauded for making the event successful and are appreciated for its sound belief in building strong relationship and bonding by taking apt care of each participant’s requirement throughout the event. The two days event comprises four conferences and three workshops with multiple tracks. During these v

vi

Preface

two days, there were 150 presentations by national as well as international researchers and industrial personnel. Also the idea presentations with deliberation by the delegates were part of the event. We are grateful for the efforts of all the members of the organizing and editorial committee for supporting the event and extending their cooperation to make it a grand success. Arad, Romania Bhopal, India Mumbai, India

Valentina Emilia Balas Vijay Bhaskar Semwal Anand Khandare

Contents

Performance Evaluation of an IoT Edge-Based Computer Vision Scheme for Agglomerations Detection Covid-19 Scenarios . . . . . . . . . . . . . Werner Augusto A. N. da Silveira, Samuel B. Mafra, Joel J. P. C. Rodrigues, Mauro A. A. da Cruz, and Eduardo H. Teixeira Comparative Analysis of Classification and Detection of Breast Cancer from Histopathology Images Using Deep Neural Network . . . . . . Pravin Malve and Vijay Gulhane

1

13

Determining Key Performance Indicators for a Doctoral School . . . . . . . Aminata Kane, Karim Konate, and Joel J. P. C. Rodrigues

24

Cloud Attacks and Defence Mechanism for SaaS: A Survey . . . . . . . . . . . Akram Harun Shaikh and B. B. Meshram

43

“Palisade”—A Student Friendly Social Media Website . . . . . . . . . . . . . . . . Nithin Katla, M. Goutham Kumar, Rohithraj Pidugu, and S. Shitharth

53

Comparative Review of Content Based Image Retrieval Using Deep Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Juhi Janjua and Archana Patankar

63

Fuzzy-Logic Approach for Traffic Light Control Based on IoT Technology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Guan Hewei, Ali Safaa Sadiq, and Mohammed Adam Tahir

75

Data Clustering Algorithms: Experimentation and Comparison . . . . . . . Anand Khandare and Rutika Pawar

86

Design and Development of Clustering Algorithm for Wireless Sensor Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100 Pooja Ravindrakumar Sharma and Anand Khandare

vii

viii

Contents

Mitigate the Side Channel Attack Using Random Generation with Reconfigurable Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111 A. E. Sathis Kumar and Babu Illuri A Statistical Review on Covid-19 Pandemic and Outbreak . . . . . . . . . . . . . 124 Sowbhagya Hepsiba Kanaparthi and M. Swapna Performance Evaluation of Secure Web Usage Mining Technique to Predict Consumer Behaviour (SWUM-PCB) . . . . . . . . . . . . . . . . . . . . . . . 136 Sonia Sharma and Dalip Quantum Computing and Machine Learning: In Future to Dominate Classical Machine Learning Methods with Enhanced Feature Space for Better Accuracy on Results . . . . . . . . . . . . . . . . . . . . . . . . 146 Mukta Nivelkar and S. G. Bhirud Design and Develop Data Analysis and Forecasting of the Sales Using Machine Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157 Vinod Kadam and Sangeeta Vhatkar Prediction of Depression Using Machine Learning and NLP Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172 Amrat Mali and R. R. Sedamkar Detection and Performance Evaluation of Online-Fraud Using Deep Learning Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 182 Anam Khan and Megharani Patil Data Compression and Transmission Techniques in Wireless Adhoc Networks: A Review . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 194 V. Vidhya and M. Madheswaran Message Propagation in Vehicular Ad Hoc Networks: A Review . . . . . . . 207 G. Jeyaram and M. Madheswaran A Comparative Study of Clustering Algorithm . . . . . . . . . . . . . . . . . . . . . . . 219 Khyaati Shrikant, Vaishnavi Gupta, Anand Khandare, and Palak Furia Refactoring Faces Under Bounding Box Using Instance Segmentation Algorithms in Deep Learning for Replacement of Editing Tools . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 236 Raunak M. Joshi and Deven Shah Motion Detection and Alert System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 248 M. D. N. Akash, CH. Mahesh Kumar, G. Bhageerath Chakravorthy, and Rajanikanth Aluvalu

Contents

ix

Feasibility Study for Local Hyperthermia of Breast Tumors: A 2D Modeling Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 260 Jaswantsing Rajput, Anil Nandgaonkar, Sanjay Nalbalwar, Abhay Wagh, and Nagraj Huilgol Author Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 273

Editors and Contributors

About the Editors Valentina Emilia Balas is currently Full Professor at “Aurel Vlaicu” University of Arad, Romania. She is author of more than 300 research papers. Her research interests are in Intelligent Systems, Fuzzy Control, Soft Computing. She is Editor-in Chief to International Journal of Advanced Intelligence Paradigms (IJAIP) and to IJCSE. Dr. Balas is member of EUSFLAT, ACM and a SM IEEE, member in TC – EC and TC-FS (IEEE CIS), TC – SC (IEEE SMCS), Joint Secretary FIM. Vijay Bhaskar Semwal is working as Assistant professor (CSE) at NIT Bhopal since 5 February 2019. Before joining NIT Bhopal he was working at NIT Rourkela. He has also worked with IIIT Dharwad as Assistant Professor(CSE) for 2 year (20162018) and he has also worked as Assistant professor (CSE) at NIT Jamshedpur . He has earned his doctorate degree in robotics from IIIT Allahabad (2017), M.Tech. in Information Technology from IIIT Allahabad (2010) and B.Tech. (IT) from College of Engineering Roorkee (2008). His areas of research are Bipedal Robotics, Gait Analysis and synthesis, Artificial Intelligence, Machine Learning and Theoretical Computer Science. He has published more then 15 SCI research papers. He has received early career research award by DST-SERB under government of India. Dr. Anand Khandare Dy. HOD, Computer Engineering, Thakur college of engineering and Technology, Mumbai with 15 years of teaching experience. He completed Ph.D. in Computer Science and Engineering in the domain Data Clustering in Machine Learning from Sant Gadge Baba Amravati University. He has total 50+ publications in national, international conferences and journals. He has 1 copyright and 2 patents. He guided various research and funded projects. He is worked as volume editor in springer conference on Intelligent Computing and Networking 2020. He is also reviewer in various journal and conferences.

xi

xii

Editors and Contributors

Contributors M. D. N. Akash Department of CSE, Vardhaman College of Engineering, Hyderabad, India Rajanikanth Aluvalu Department of CSE, Vardhaman College of Engineering, Hyderabad, India S. G. Bhirud Veermata Jijabai Technological Institute, Mumbai, India G. Bhageerath Chakravorthy Department of CSE, Vardhaman College of Engineering, Hyderabad, India Mauro A. A. da Cruz National Institute of Telecommunications (Inatel), Santa Rita do Sapucaí, Brazil Dalip Department of MMICT&BM, Maharishi Markandeshwar Deemed to be University, Mullana (Ambala), Ambala, Haryana, India Werner Augusto A. N. da Silveira National Institute of Telecommunications (Inatel), Santa Rita do Sapucaí, Brazil Palak Furia Department of Computer Science, Thakur College of Engineering and Technology, Mumbai, India M. Goutham Kumar Department of Computer Science and Engineering, Vardhaman College of Engineering, Hyderabad, Telangana, India Vijay Gulhane Professor, Department of Information Technology, Sipna College of Engineering, Amravati, Maharashtra, India Vaishnavi Gupta Department of Computer Science, Thakur College of Engineering and Technology, Mumbai, India Guan Hewei School of Information Technology, Monash University, Monash, Malaysia Nagraj Huilgol Dr. Balabhai Nanavati Hospital, Mumbai, India Babu Illuri Department of Electronics and Communication Engineering, Vardhaman College of Engineering, Hyderabad, Telangana, India Juhi Janjua Department of Computer Engineering, Thadomal Shahani Engineering College, Mumbai, Maharashtra, India G. Jeyaram Department of Computer Science and Engineering, M.E.T Engineering College, Kanyakumari District, India Raunak M. Joshi Thakur College of Engineering and Technology, Mumbai, MH, India Vinod Kadam Thakur College of Engineering and Technology, Mumbai, Kandivali(E), India

Editors and Contributors

Sowbhagya Hepsiba Kanaparthi Vardhaman R.R.District, Kacharam, Shamshabad, India

xiii

College

of

Engineering,

Aminata Kane Department of Mathematics and Computer Science, Cheikh Anta DIOP University of Dakar, Dakar, Senegal Nithin Katla Department of Computer Science and Engineering, Vardhaman College of Engineering, Hyderabad, Telangana, India Anand Khandare Department of Computer Science, Thakur College of Engineering and Technology, Mumbai, India; Department of Computer Engineering, Thakur College of Engineering & Technology Mumbai University, Mumbai, Maharastra, India Anam Khan Thakur College of Engineering and Technology, Mumbai University, Mumbai, India Karim Konate Department of Mathematics and Computer Science, Cheikh Anta DIOP University of Dakar, Dakar, Senegal CH. Mahesh Kumar Department of CSE, Vardhaman College of Engineering, Hyderabad, India M. Madheswaran Department of Electrical and Communication Engineering, Muthayammal Engineering College, Namakkal, India Samuel B. Mafra National Institute of Telecommunications (Inatel), Santa Rita do Sapucaí, Brazil Amrat Mali Thakur College of Engineering and Technology, Mumbai University, Mumbai, India Pravin Malve Lecturer, Department of Computer Engineering, Government Polytechnic, Arvi, Maharashtra, India B. B. Meshram Department of Computer Engineering, Veermata Jijabai Technological Institute (VJTI), Mumbai, India Sanjay Nalbalwar Dr. Babasaheb Ambedkar Technological University, Lonere, India Anil Nandgaonkar Dr. Babasaheb Ambedkar Technological University, Lonere, India Mukta Nivelkar Veermata Jijabai Technological Institute, Mumbai, India Archana Patankar Department of Computer Engineering, Thadomal Shahani Engineering College, Mumbai, Maharashtra, India Megharani Patil Thakur College of Engineering and Technology, Mumbai University, Mumbai, India

xiv

Editors and Contributors

Rutika Pawar Department of Computer Engineering, Thakur College of Engineering and Technology, Mumbai, India Rohithraj Pidugu Department of Computer Science and Engineering, Vardhaman College of Engineering, Hyderabad, Telangana, India Jaswantsing Rajput Dr. Babasaheb Ambedkar Technological University, Lonere, India Joel J. P. C. Rodrigues Federal University of Piauí (UFPI), Teresina, PI, Brazil; Instituto de Telecomunicações, Aveiro, Portugal Ali Safaa Sadiq School of Mathematics and Computer Science, University of Wolverhampton, Wolverhampton, UK A. E. Sathis Kumar Department of Electronics and Communication Engineering, Vardhaman College of Engineering, Hyderabad, Telangana, India R. R. Sedamkar Thakur College of Engineering and Technology, Mumbai University, Mumbai, India Deven Shah Thakur College of Engineering and Technology, Mumbai, MH, India Akram Harun Shaikh Department of Computer Engineering, Veermata Jijabai Technological Institute (VJTI), Mumbai, India Pooja Ravindrakumar Sharma Department of Computer Engineering, Thakur College of Engineering & Technology Mumbai University, Mumbai, Maharastra, India Sonia Sharma Department of MMICT&BM, Maharishi Markandeshwar Deemed to be University, Mullana (Ambala), Ambala, Haryana, India Khyaati Shrikant Department of Computer Science, Thakur College of Engineering and Technology, Mumbai, India M. Swapna Vardhaman College of Engineering, R.R.District, Kacharam, Shamshabad, India Mohammed Adam Tahir Technology Sciences, Zalingei University, Zalingei, Sudan Eduardo H. Teixeira National Institute of Telecommunications (Inatel), Santa Rita do Sapucaí, Brazil Sangeeta Vhatkar Thakur College of Engineering and Technology, Mumbai, Kandivali(E), India V. Vidhya Department of Computer Science, M.E.T Engineering College, Kanyakumari District, India Abhay Wagh Directorate of Technical Education, Mumbai, Maharashtra, India

Performance Evaluation of an IoT Edge-Based Computer Vision Scheme for Agglomerations Detection Covid-19 Scenarios Werner Augusto A. N. da Silveira1(B) , Samuel B. Mafra1 , Joel J. P. C. Rodrigues2,3 , Mauro A. A. da Cruz1 , and Eduardo H. Teixeira1 1 National Institute of Telecommunications (Inatel), Santa Rita do Sapucaí, Brazil

[email protected], [email protected], [email protected] 2 Federal University of Piauí (UFPI), Teresina, PI, Brazil [email protected] 3 Instituto de Telecomunicações, Aveiro, Portugal

Abstract. Edge architectures have emerged as a solution for the development of Internet of Things (IoT) applications, especially in scenarios with ultra-low system latency requirements and a huge amount of data transmitted in the network. This architecture aims to decentralize systems and extend cloud resources to devices located at the edge of networks. Various benefits regarding local processing, lower latency, and better communication bandwidth can be highlighted. This study proposes an edge architecture which uses computer vision to detect people in agglomerations. To evaluate the performance of the proposed architecture, an use-case for agglomeration detection in the Covid-19 scenarios is presented. A comparative analysis of the detection is performed through videos from a public database. The obtained results demonstrate a gain in terms of computational performance with a video analysis in comparison to the best solutions available in the literature. The proposed solution can be a powerful edge tool to support the combat against Covid-19 Pandemic. Keywords: Edge architecture · Internet of Things · Computer vision · Covid-19

1 Introduction Internet of Things (IoT) has gained great attention due to the proliferation of a huge quantity of data generated from connected devices which allows the monitoring and processing of sensors data anywhere and at any time. The internet has revolutionized the world through the offering of global connectivity [1] and IoT is set to sustain significant change on the evolution known as the next generation of the internet. Moreover, the combination among IoT and different technologies has contributed to raise the Quality of Services (QoS) in several applications in modern days, such as smart home, connected cars, smart cities, IoT in agriculture, energy management and IoT health care. The internet is the most important and powerful creation made by humans and the use of © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2022 V. E. Balas et al. (eds.), Intelligent Computing and Networking, Lecture Notes in Networks and Systems 301, https://doi.org/10.1007/978-981-16-4863-2_1

2

W. A. A. N. da Silveira et al.

IoT-based projects is taking us on the journey where the main goal is for transforming an essentially [2]. The concept of edge architecture was introduced in order to mitigate this issue. Processing of data is realized at the edge of network employing the existent resources locally and just the result of processing is sent to the cloud. Therefore, it reduces the network resource occupancy drastically, because only a small fraction of all existent data is transmitted [3]. For example: a hypothetical application where a smartphone records a video to detect an object is present to highlight the benefits of the edge processing. This device uses its own local camera to make a video, applies filters on the image, performs a detection algorithm and only then report the result to the cloud [4]. In this example, it is possible to save energy resources through reducing the amount of data and computing power and get the result much faster in comparison to the cloud architecture. Due to a massive production of IoT devices, the world just arrived in the post-Cloud era [5], in which the devices localized on the edge of networks consumes these data and report only status to the cloud servers. Some applications allow your information to be transmitted and processed by a server on the cloud, but other applications, such as video processing, produce a large amount of data and require very short response time. Also, these applications may contain private data that need to be processed before the transmission to the cloud. Cloud architectures are not recommended for applications in which uploading and processing a huge quantity of data could lead to long time operations due to the transmission latency. The year 2020 has been an atypical year because the world is facing a new and huge threat, the Covid-19 Pandemic, as named by the World Health Organization (WHO). In September 2020, numbers have risen over 27 million confirmed cases, and the number of deaths has exceeded 900 thousand. The America Continent is leading in number of confirmed and death cases, which is concerning. Brazil is the third most affected country after the United States and India [6], because of that, many cities have been locked down due to this rapidly spreading virus [7, 8]. The technology has taken an important role in the mission of combating this virus by the creation of new mechanisms to control Covid-19 spread. Significant efforts are being made on the available healthcare facilities, treatment systems and many advanced solutions are appearing in order to reduce the problems related to the pandemic [9, 10]. In this context, the main objective of this work is to carry out performance evaluation of a computer vision algorithm to detect agglomeration using edge architecture responsible for send alert messages to the authorities about people proximity with the objective to avoid Covid-19 spread. The proposed approach aims to minimize the necessary network resources usage to process the images on the cloud due to the huge quantity of data from surveillance cameras. This can be achieved by shifting the resources hosted on the servers for near the edge of the network. Moreover, a motivation for deploying the new computer vision algorithm comes from the need of achieving high accuracy by the usage of an equipment with low computational power to detect people. Then, the most important contributions of this study are shown below: • The benefits of reducing the amount of data transmitted because of the processing in the edge of networks in contrast to sending all data to cloud server to be processed;

Performance Evaluation of an IoT Edge-Based Computer Vision

3

• Deployment of a new computer vision algorithm for detecting people in real-time applications employing a low computational power equipment; • A comparative analysis among the state-of-the-art computer vision algorithms to detect objects with the proposed scheme; • Performance assessment of the proposed solution, which can serve a basis for the development of other systems in Covid-19 Scenarios. The remainder of this paper is as follows. The most relevant works on the topic are discussed in Sect. 2. Section 3 presents the edge scheme used to deploy the proposed system and its benefits. Section 4 describes the proposed algorithm, explaining the motivation for its usage and providing details of its deployment. Section 5 shows the middleware communication, demonstrates how data will be exchanged (end-to-end). In Sect. 6 the experimental results are presented and, then, the work is concluded in Sect. 7, including suggestions for further works.

2 Related Work 2.1 IoT and Egde Applications Weisong Shi et al. introduce the concept of Edge Computing in [11], in which the main applications according to the metrics of computing power, latency, battery life of mobile devices, bandwidth costs, security, and privacy are shown. According to Shi, Cloud Computing is a centralized architecture which receives data of all the devices connected and processes the information. In the opposite way, Edge Computing is a decentralized architecture which makes use of the resources locally, consequently the computing happens nearer the data sources, and transmit only the results to the Cloud. Shi also describes two examples optimized by Edge Computing implementation. The first one is an example of an online shopping and the second one is an example of searching the missing child through the surveillance cameras. In [12], Zhao et al. also propose a IoT based system which contains heterogeneous interconnected devices on the Edge Server, but the main goal of this work is the routing of data packets. Zhao et al. utilize traffic metrics of diversity and wireless diversity to increase the network throughput to deploy a three-phases approach: (i) Discretization of the whole IoT network into small network and the definition of the candidate route; (ii) The analysis of each candidate route by utilizing metrics of performance gain, link quality and link correlation; (iii) The deployment algorithm focuses on optimizing the utility metric. The experiments results show a greatly improvement on the routing packets and, consequently, on the throughput. Ksentini and Brik in [13] propose a new edge scheme to detect social distancing, which consists to send the geolocation coordinates from user’s smartphone to an application sitting at the edge. The distance euclidean among everyone located under the edge coverage is calculated and the users that are breaking the social limit are warned by an app at their smartphones. A case study utilizing the European Telecommunications Standards Institute (ETSI) multi-access edge computing (MEC) ecosystem is presented. Important parameters for edge scheme as low latency, near-real-time reaction, privacy and anonymity are highlighted. The obtained results show that proposed scheme is an efficiency option to inform people about the distance each other.

4

W. A. A. N. da Silveira et al.

2.2 Algorithm of Computer Vision to Detect Objects Several algorithms for detect object have been proposed in the literature in order to improve the performance of the used device and minimize human interaction in this process, among which it could be mentioned R-CNN, Yolo and SSD. In [14], an algorithm called Regions with Convolutional Neural Network (R-CNN) is presented. It considers segmentation of each frame in 2000 regions and applying into a convolutional neural network (CNN) to identify the target through selective search. However, it was identified a low performance due to the processing of multiple regions [15, 16] and an improvement was made using the full frame in CNN instead of segmentation. The result is a map of characteristics to make the target prediction and it was called Fast R-CNN [16]. In order to increase more efficiency to this algorithm, a new improvement was performed and selective search was replaced by another CNN [17]. This new algorithm received the name of Faster R-CNN. Yolo (You Only Look Once) was developed by Joseph Redmon et al. [15, 18], in 2015. Yolo analyzes a full image to make object detection. The task of object detection, determines the location of the target into the image and classifies it according to a class. Yolo “only looks once” on the image in the sense that it requires only one forward propagation pass through the CNN to make the detection. A stage of non-max suppression is executed and the objects detected are showed together with bounding boxes on the output of the process. Redmon et al. [19] enhanced Yolo Algorithm with the detection of over 9000 object classes by jointly optimizing the detection and classification stages. Another enhancement in Yolo is shown in [20], called YoloV3. Redmon and Farhadi propose a new network for performing feature extraction through a CNN of 53 layers. Yolov3-Tiny is a variant of YoloV3, which takes less time to run, but its results have less accuracy. In [21], Wei Liu et al. propose the Single Shot Multibox Detector (SSD) Algorithm to detect the target using a full frame. Its operation is based on the extraction of features through several CNN in order to identify the content of the images. Its accuracy and processing rate are higher than Yolo and lower than Faster R-CNN. The SSD Algorithm will be further explained in Sect. 4.2. The present work considers the SSD Algorithm as it requires a lower processing in comparison to R-CNN and Yolo, in consequence of that, it can be used in embedded hardware equipment with low computation power. It is seen that there are relevant contributions in the literature related to edge/cloud architecture, computer vision algorithm, and Technology for Covid-19 scenarios. These works are considered in the proposal of an Edge Architecture to detect agglomeration presented in this study.

3 Proposed Edge Architecture This section presents the proposed Edge Architecture to detect agglomerations on Covid19 Scenario through the use of a computer vision algorithm. The Fig. 1 illustrates the whole architecture composed by devices on the edge of networks, cloud and user application. The largest computational effort is localized on the edge, where the videos are captured from surveillance cameras, the images are processed, and the people contained on

Performance Evaluation of an IoT Edge-Based Computer Vision

5

each image are checking. Then, the SSD Algorithm is started up to detect the localization and the reliability of the data generated from these people previously checked. Finally, the quantity of individuals, the distance among them are calculated. The information are sent to cloud server, and then to the monitoring center. The data are exchanged through the well- known Message Queuing Telemetry Transport (MQTT) Protocol [22].

Fig. 1 Overview of the proposed edge architecture to detect agglomeration

When any distance is less than one meter, a warning event is sent to indicate agglomeration with information about geolocation and the number of people. These information allow the responsible person to monitor locations in order to mitigate the risk of contamination. The main advantages of the proposed Edge Architecture are local processing of videos [23], energy consumption, latency and bandwidth [24].

4 Computer Vision Based on Difference of Frame in Single Shot Detection Algorithm In this section, an enhancement on the SSD Algorithm is proposed and named Computer Vision based on Difference of Frame in Single Shot Detection Algorithm (CVDF-SSD). Firstly, the images are captured from surveillance cameras in RGB color domain and a pre-processing is realized to evaluate if there are people in the current frame. Based on this analysis, the employment of a detection algorithm on every frame could be eliminated. This step is very important, as it allows an increased processing rate on the Raspberry PI without losing accuracy. Then, SSD Algorithm is applied to analyze the frame, to detect people and count the amount of people in the images. When this detection is concluded, a rectangle is drawn around each person and its central coordinate (x,y) is identified. Finally, the distance among people are calculated. When there is a proximity of less than one meter, the respective distance is highlighted, which may indicate a potential risk of Covid-19 contamination. The structure of CVDF-SSD is discussed, in detail, as follows. The Fig. 2 describes the flow of whole system. 4.1 Evaluation of Frames Based on Difference of Frame This stage has the objective of enhancing the processing rate through a selective application of SSD Algorithm. The current and previous frames are analyzed to trigger person

6

W. A. A. N. da Silveira et al.

Fig. 2 Flow chart of end-to-end solution

detection through deep learning. It is an important requirement for IoT equipment with limited computational power. Initially, the video is analyzed and the frames are captured in the RGB (Red, Green and Blue) color domain. For example, the black color is the RGB (0,0,0), the white color is the RGB (255,255,255), the yellow color is the RGB (0,180,180) and the violet color is the RBG (180,0,180). Another option is using the gray color domain, where the three colors of the RGB domain have the same intensity and they are processed by an array of a single dimension. The scale of gray color starts with black color, represented by RGB (0,0,0), goes through several shades of gray and reaches the saturation in white color, represented by RGB (255,255,255). The equation to convert RGB domain to the Gray domain is shown by (1), where Gray, B, G and R are matrices of the same dimensions. Gray = 0.299R + 0.587G + 0.114B.

(1)

Despite the benefit of gray color domain, a domain conversion from RGB to Gray domain and some time is spent in this task. So, for reducing the processing time at this task, it is used the RGB domain frames and it performs mathematical operation of the absolute difference between current and previous frame. As a result, a predominantly black frame which has colors only in the variations areas is obtained. This behavior occurs because the operation of the absolute difference is simpler than the operation for converting to Gray domain, resulting in the hardware processing it quicker. Equation (2) shows the operation of the difference of the frames: D = |C − P|.

(2)

where D, C and P represent the difference of frame matrix, current frame matrix and previous frame matrix, respectively. Noise cancellation is executed to mitigate false detection of people and avoid an unnecessary processing time. It is initiated by background subtraction [25], which eliminates the background static parts and leaves only the foreground mobile parts. Then, the colored remained are converted to black and white image through an operation called binarization. It consists of defining a reference color threshold, in which the brightest

Performance Evaluation of an IoT Edge-Based Computer Vision

7

colors are changed to white color and the darkest colors are changed to black color. The threshold used is RGB (100,100,100). Finally, the operation of morphology of the binarized frame cancels the remaining noise, which reduces white points of black background, black points within the target detected and exhibits the people well defined in the frame. 4.2 SSD Algorithm The Single Shot Detection (SSD) Algorithm [21, 26] divides the task of people detection in two distinct stages: classify the image on classes and detecting the object on image. Classify stage [27]: The image is classified among one thousand pre defined classes, where one class is defined as background to represent the image which does not contain the other 999 classes. Each image creates a vector of 1000 elements and each one contains the probability of the classes on the image. This classifying stage is quick because it divides the convolution operation in two layers: depthwise and pointwise convolutions. Depthwise convolutions apply a single filter per each input channel (input depth) to split into smaller channels. For example, the result of a Depthwise convolutions applied in the input of M channels will be M outputs of single channel. Pointwise convolution is a 1 × 1 convolution used to create a linear combination with the M outputs of depthwise layer. Detect stage [21]: This stage is responsible to detect the object on the image through its reliability and location. It is based on a feed-forward CNN which produces bounding boxes and object scores contained in the boxes according to the pre-trained classes. After that, the operation of non-maximum suppression is applied to produce the final detection. During the classification time, SSD Algorithm uses 8,732 boxes to find the box that most overlaps with the bounding box, which contains the objects according to the aspect ratio, location and scale. Then, a non-maximum suppression layer is used to remove other overlapping boxes in order to keep only one bounding box per object contained in the image. Non-maximum suppression layer uses the Intersection over Union (IoU) as a parameter to decrease the quantity of boxes.

5 Middleware Connectivity The use of a middleware connectivity is important to warn public authorities when a place is identified with people agglomeration. This warning event minimizes the probability of contamination by Covid-19 and increases the health security for these places because the monitoring is made remotely, which avoids the employment of people to surveillance these locations. In.IoT is an middleware which performs the tasks for receiving, storing, and publishing information to subscribers [28]. The communication protocol chosen to access In.IoT is the MQTT [22] because it requires low complexity deployment and low computational power. Figure 3 shows a case study of the proposed Edge Architecture which utilizes the CVDF-SSD Algorithm. The agglomeration monitoring is made through video surveillance of cameras installed at the Municipal Market of São Paulo and accessed from a

8

W. A. A. N. da Silveira et al.

public database of São Paulo, Brazil [29]. The original frame captured from cameras is shown in Fig. 3a. When the CVDF-SSD Algorithm is applied on the video, the number of people and the distance between them are calculated, as it is depicted in Fig. 3b. So, the information is published and the public authorities responsible can plot it in a city map to know where the agglomerations are occurring, as it is illustrated in Fig. 3c.

Fig. 3 Agglomerations analysis of a specific frame. a Original frame, b Agglomerations checked, c Geolocation map

6 Performance Evaluation and Results Analysis The proposed CVDF-SSD Algorithm for Edge Architecture is evaluated and the results are presented in this section. The images processing is made by a Raspberry Pi 3 Model B + board and the reason by your choice is because it is a cheaper, portable and small equipment and it could be easily installed in several places. In IoT Middleware is the backbone utilized to interconnect devices on the edge with applications of users and its role in this work is to transmit the agglomeration status. An additional communication overhead introduced by it and by transport protocol is not assessed. Videos from a public database of São Paulo [29] were used for comparison purposes. The solution was carefully studied to be as less invasive as possible and ensure privacy of people. Then, the faces of people used were blurred to maintain them anonymous, as shown in Fig. 3b.

Performance Evaluation of an IoT Edge-Based Computer Vision

9

The experiment results are presented in order to compare the people detection task of CVDF-SSD Algorithm with traditional and state-of-the-art SSD Algorithm and YoloV3Tiny Algorithm in terms of processing rate and time. It is using the Deep Neural Network (DNN) module from Open Source Computer Vision (OpenCV) library. CVDF-SSD Algorithm, SSD Algorithm and YoloV3-Tiny Algorithm are using pre-trained model on the Common Objects in COntext (COCO) [15]. Table 1 demonstrates a notable difference of processing among the schemes. The processing time is measured running the same part of video for the three analyzed algorithms. The measurements are performed thirty times and the result is an average of them. This proves that is possible to detect people using deep learning algorithms in an equipment of low computational power without the loss of accuracy in the detection. Table 1 Comparison of algorithms in terms of accuracy and processing time Accuracy Algorithm

Processing rate

Person 1 Person 2 Algorithm

YoloV3-Tiny 60,16%

53,03%

Person 1

YoloV3-Tiny 60,16%

SSD

86,86%

57,43%

SSD

86,86%

CVDF-SSD

86,86%

57,43%

98

1,01

The pre-processing stage of CVDF is evaluated in terms of reliability [30]. According to Wiedemann et al., the metrics to indicate the level of accuracy of the system is divided in three statistical measures and are defined as follows. Precision = TP/(TP + FP)

(3)

Recall = TP/(TP + FN)

(4)

Quality = TP/(TP + FP + FN).

(5)

In order to guarantee the results reproducibility and mitigating any variation, CVDF is executed one thousand times. The frames are sampled and the number of people detected correctly, wrongly and the number of people not detected are counted manually. Table 2 shows the CVDF pre-processing detected people on images with Precision of 92%, Recall of 85%, and Quality of 83%. Therefore, it demonstrates a powerful option to trigger the people detection algorithm and enhances the processing in equipment with low computation power.

7 Conclusion and Future Works In this paper, a new computer vision algorithm to detect agglomerations using an edge scheme was proposed. The motivation and main benefits by employing the edge architecture were described. Also, it was shown the role of IoT technology in the Covid-19 scenarios which motivates to deploy a new algorithm to detect agglomeration.

10

W. A. A. N. da Silveira et al. Table 2 CVDF reliability results Frame #

TP

FP

FN

Precision (%)

40

0

0

1

0

0

0

60

1

0

0

100

100

100

80

1

0

0

100

100

100

100

1

0

0

100

100

100

120

1

0

0

100

100

100

140

1

0

1

100

50

50

160

1

0

2

100

33

33

180

3

0

0

100

100

100

200

3

0

0

100

100

100

220

5

0

0

100

100

100

240

6

0

0

100

100

100

260

3

0

1

100

75

75

280

5

1

1

83

83

71

300

5

0

0

100

100

100

320

4

0

1

100

80

80

340

4

0

0

100

100

100

360

2

0

0

100

100

100

380

2

0

0

100

100

100

400

2

1

0

67

100

67

92

85

83

Result

Recall (%)

Quality (%)

It is possible to observe from the results that CVDF-SSD Algorithm has an excellent performance due its efficient way of identifying humans on images. Compared to other solutions, the proposed algorithm is capable of enhancing the processing and of achieving high accuracy. Finally, a case study to alert public authorities at the Municipal Market of São Paulo city, in Brazil, was presented and some scenarios to avoid contamination by Covid-19 in agglomerations was described. By the combination of Raspberry Pi on the edge of network and the proposed CVDF-SSD Algorithm, the people social distance is found with high accuracy and the surveillance people are warned in real-time. Thus, the proposed architecture achieves its purpose of processing the videos locally to reduce latency, bandwidth and consummation of energy. As a future work, it is suggested an intelligent integration among the edge devices for identifying a metropolitan area in risk of Covid-19 contamination and alert the population nearby in their smartphone about this risk.

Performance Evaluation of an IoT Edge-Based Computer Vision

11

Acknowledgements. This work was partially supported by RNP, with resources from MCTIC, Grant No. No 01250.075413/2018-04, under the Radiocommunication Reference Center (Centro de Referência em Radiocomunicações - CRR) project of the National Institute of Telecommunications (Instituto Nacional de Telecomunicações - Inatel), Brazil; by FCT/MCTES through national funds and when applicable co-funded EU funds under the Project UIDB/50008/2020; and by Brazilian National Council for Scientific and Technological Development - CNPq, via Grant No. 313036/2020-9.

References 1. Weber M, Luˇci´c D, Lovrek I (2017) Internet of Things context of the smart city. In: International conference on smart systems and technologies (SST) 2017:187–193 2. Katare G, Padihar G, Quereshi Z (2018) Challenges in the integration of artificial intelligence and Internet of things. Int J Syst Softw Eng 6(2):10–15 3. Evans D (2011) The internet of things: How the next evolution of the internet is changing everything. CISCO White Paper 1(2011):1–11 4. C. V. Networking (2013) Cisco global cloud index: Forecast and methodology, 2014--2019. White Paper 5. Shi W, Cao J, Zhang Q, Li Y, Xu L (2016) Edge computing: Vision and challenges. IEEE Internet Things J 3(5):637–646 6. W. H. Organization and others (2020) Coronavirus disease (COVID-19): Weekly Epidemiological Update, 6 September 2020 7. Manawadu L, Gunathilaka K, Wijeratne S (2020) Urban agglomeration and COVID-19 clusters: strategies for pandemic free city management. Int J Sci Res Publ 10:769–775. https:// doi.org/10.29322/IJSRP.10.07.2020.p10385 8. Braga JU, Ramos AN, Ferreira AF, Lacerda VM, Freire RMC, Bertoncini BV (2020) Propensity for COVID-19 severe epidemic among the populations of the neighborhoods of Fortaleza, Brazil 9. Ting DSW, Carin L, Dzau V, Wong TY (2020) Digital technology and COVID-19. Nat Med 26(4):459–461 10. Javaid M, Haleem A, Vaishya R, Bahl S, Suman R, Vaish A (2020) Industry 4.0 technologies and their applications in fighting COVID-19 pandemic. Diabetes Metab Synd Clin Res Rev 11. Shi W, Dustdar S (2016) The promise of edge computing. Computer (Long Beach Calif) 49(5):78–81 12. Zhao Z, Min G, Gao W, Wu Y, Duan H, Ni Q (2018) Deploying edge computing nodes for large-scale IoT: a diversity aware approach. IEEE Internet Things J 5(5):3606–3614 13. Ksentini A, Brik B (2020) An edge-based social distancing detection service to mitigate COVID-19 propagation. IEEE Internet Things Mag 3:35–39 14. Girshick R, Donahue J, Darrell T, Malik J (2014) Rich feature hierarchies for accurate object detection and semantic segmentation. In: Proceedings of IEEE computer society conference on computer vision and pattern recognition, pp 580–587. https://doi.org/10.1109/CVPR.201 4.81 15. Liu L et al (2020) Deep learning for generic object detection: a survey. Int J Comput Vis 128:261–318 16. Girshick R (2015) Fast R-CNN. In: Proceedings of IEEE international conference on computer vision, vol 2015 Inter, pp 1440–1448. https://doi.org/10.1109/ICCV.2015.169 17. Ren S, He K, Girshick R, Sun J (2017) Faster R-CNN: towards real-time object detection with region proposal networks. IEEE Trans Pattern Anal Mach Intell 39(6):1137–1149. https://doi. org/10.1109/TPAMI.2016.2577031

12

W. A. A. N. da Silveira et al.

18. Redmon J, Divvala S, Girshick R, Farhadi A (2016) You only look once: unified, real-time object detection. In: Proceedings of the IEEE computer society conference on computer vision and pattern recognition, pp 779–788. https://doi.org/10.1109/CVPR.2016.91 19. Redmon J, Farhadi A (2017) YOLO9000: better, faster, stronger. In: Proceedings of 30th IEEE conference on computer vision and pattern recognition, CVPR, vol 2017-Janua, pp 6517–6525. https://doi.org/10.1109/CVPR.2017.690 20. Redmon J, Farhadi A (2018) YOLOv3: an incremental improvement [Online]. http://arxiv. org/abs/1804.02767 21. Liu W et al (2016) SSD: single shot multibox detector. In: Lecture Notes Computer Science (including subseries Lecture Notes in artificial intelligence and Lecture Notes in bioinformatics) 9905:21–37. https://doi.org/10.1007/978-3-319-46448-0_2 22. Soni D, Makwana A (2017) A survey on mqtt: a protocol of internet of things (iot) 23. Kolhar M, Al-Turjman F, Alameen A, Abualhaj MM (2020) A three layered decentralized IoT biometric architecture for city lockdown during COVID-19 outbreak. IEEE Access 8:163608– 163617 24. Hu W et al (2016) Quantifying the impact of edge computing on mobile applications. In: Proceedings of the 7th ACM SIGOPS Asia-Pacific workshop on systems, 1–8 25. Zivkovic Z, Van Der Heijden F (2006) Efficient adaptive density estimation per image pixel for the task of background subtraction. Pattern Recognit Lett 27(7):773–780 26. Biswas D, Su H, Wang C, Stevanovic A, Wang W (2019) An automatic traffic density estimation using single shot detection (SSD)and MobileNet-SSD. Phys Chem Earth 110:176–184. https://doi.org/10.1016/j.pce.2018.12.001 27. Howard H, Andrew G, Zhu M, Chen B, Kalenichenko D, Wang W, Weyand T, Andreetto M, Adam (2017) Mobilenets: efficient convolutional neural networks for mobile vision applications. arXiv Prepr. arXiv1704.04861 28. da Cruz MA, Rodrigues JJ, Lorenz P, Korotaev V, de Albuquerque VHC (2020) In. IoT—A new middleware for Internet of Things. IEEE Internet Things J. https://doi.org/10.1109/JIOT. 2020.3041699 29. City Camera SP. https://www.citycameras.prefeitura.sp.gov.br. Accessed 03 Sep 2020 30. Wiedemann C, Heipke C, Mayer H, Jamet O (1998) Empirical evaluation of automatically extracted road axes. Empir Eval Tech Comput Vision 12:172–187

Comparative Analysis of Classification and Detection of Breast Cancer from Histopathology Images Using Deep Neural Network Pravin Malve1(B) and Vijay Gulhane2 1 Lecturer, Department of Computer Engineering, Government Polytechnic, Arvi 442201,

Maharashtra , India 2 Professor, Department of Information Technology, Sipna College of Engineering, Amravati

444607, Maharashtra , India

Abstract. Breast cancer is one of the most widely recognized diseases among women across the globe. It is cancer that develops in breast cells. Histopathological and cytological images contain adequate phenotypic data, which defines their vital role in the analysis and cure of breast cancer. The emergence of deep neural networks with the rapid advancement in the Computational resources automatically inclines the accurate detection and classification of the Breast Histopathological Images. It assists the histopathologists to achieve accurate results through progressively fast, steady, objective, and measured examination. In this paper, a systematic survey has been conducted to introduce the improvement history and locate the future competence of Deep Learning calculations in the Breast Histopathological Image Analysis (BHIA) field. This study also involves comparative analysis of the most recent related works alluding to classical Artificial Neural Networks (ANNs) and Deep ANNs. Keywords: Breast cancer · Classification · Histopathology image · Artificial neural networks · Convolutional neural networks · Deep learning · Feature extraction

1 Introduction Breast cancer (Bosom malignancy) is a kind of ailment that begins with the uncontrolled proliferation of breast cells. Breast cancer can occur in women and even occasionally in men. Side effects of breast cancer growth are visualized as a bump for the breast, gruesome release from the breast, and significant changes in the shape or surface of the areola or breast. Its treatment relies upon the phase of cancer. Hematoxylin and Eosin (H&E) contain bosom tissue tests from biopsies, seen under a magnifying lens instrument for the essential conclusion of the bosom cancer [1]. For identification of the bosom malignant growth, mammographic pictures are being utilized [2] in most of the cases. However, contemplates have demonstrated that thermographies pictures © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2022 V. E. Balas et al. (eds.), Intelligent Computing and Networking, Lecture Notes in Networks and Systems 301, https://doi.org/10.1007/978-981-16-4863-2_2

14

P. Malve and V. Gulhane

(warm infrared pictures) bring about the instance of bosoms of youthful females [3]. The dynamic thermography strategy is cheaper than the mammography and attractive reverberation imaging strategies [4]. Mammography is one of the gold standard image capture technique used for detecting breast-related problems. However, this technique has some limitations like it may increase false-positives and false negatives because mammograms cannot detect denser breasts in women, which can hide tumors [5]. Also, the imaging procedure done in mammography causes discomfort to the women, and this image formation depends on the X-ray radiation, posing a risk for patients [6]. The histological type of Breast cancer was performed according to WHO (World Health Organization) classification, and grading was performed according to the Modified Bloom-Richardson grading system [7]. Breast cancer is categorized into two phases as in-situ carcinoma and invasive (infiltrating) carcinoma. In situ carcinoma contain two parts: Ductal and lobular. Ductal has divided into five categories—Micropapillary, Cribiform, Solid, Comedo, and Papillary, which are based on tumor architectural features. Invasive carcinomas are the gathering of tumors which are additionally grouped into subtypes. The invasive tumor types incorporate penetrating ductal, mucinous, cylindrical, obtrusive lobular, ductal/lobular and medullary papillary carcinomas. The invasive tumor type like Infiltrating Ductal, based on different levels of tubule formation, nuclear pleomorphism, and mitotic index, are further sub-divided as Grade 1 known as separated, Grade 2 as reasonably separated and Grade 3 as ineffectively separated [8] as represented in Fig. 1.

Fig. 1 The histological characterization of bosom malignancy subtypes [8]

Comparative Analysis of Classification and Detection of Breast Cancer

15

1.1 Symptoms of Breast Cancers Include • a bosom irregularity or tissue thickening that feels not an equivalent encompassing tissue and has grown lately • breast torment • red, pitted heal the entire whole bosom • swelling in bosom • bloody release from areola aside from bosom milk • peeling, scaling, or chipping of skin on areola or bosom • inverted areola • change fit as a fiddle or size of bosoms • swelling in armpit. 1.2 Stages of Breast Cancer Bosom disease development is often isolated into stages subject to how colossal the tumor or tumors are and, therefore, the sum has spread. Huge malignant growth developments have assaulted almost all tissues or organs at a better stage than infections that are on the brink of nothing but still contained within the bosom. To mastermind a phase of bosom malignancy, we would like to understand: • • • •

if the malignancy is obtrusive or noninvasive how significant the tumor is to check whether the lymph center points are incorporated if the malignancy has spread to close tissue or organs (Fig. 2).

Fig. 2 Fundamental stages of breast cancer: stages from 0 to 4 [9]

16

P. Malve and V. Gulhane

Stage 0 breast malignancy DCIS is in Stage 0. The cancer cells in DCIS stay limited to the channels in the breast, and they do not spread into close-by tissue. Stage 1 breast malignancy • Stage 1A: The size of the essential tumor is two centimeters wide or less, and the lymph hubs are not influenced. • Stage 1B: The cancer is found close by lymph hubs, and in the breast, there is either no tumor or tumor that is smaller than two cm. Stage 2 breast malignancy. Stage 2 breast malignancy • Stage 2A: The tumor has spread to one–three close-by lymph hubs and is tinier than two cm, or it is somewhere in the range of 2–5 cm and has not spread to any lymph hubs. • Stage 2B: The tumor is somewhere in the range of 2–5 cm and has spread to 1–3 axillary (armpit) lymph hubs, or it is more significant than five cm and has not spread to any lymph hubs. Stage 3 breast malignant growth • Stage 3A: The malignant growth has spread to four–nine axillary lymph hubs or has accelerated the indoors mammary lymph hubs, and the essential tumor can be of any size. • The tumors are extra noteworthy than 5 cm, and the disease has spread to one-three axillary lymph hubs or any breastbone hubs. • Stage 3B: A tumor has attacked the chest divider or skin and might have attacked up to nine lymph hubs. • Stage 3C: Cancer is found in at the last ten axillary lymph hubs, lymph hubs close to the collarbone, or inner mammary hubs. Stage 4 breast malignant growth Stage 4 breast cancer growth can have a tumor of any size, and its disease cells have spread to close by and removed lymph hubs just as far off organs.

2 Related Work By using noninvasive and biopsy methods, we can detect and diagnose breast cancer. The Noninvasive method is image procedure, including diagnostic mammograms such as x-ray, Magnetic Resonance Imaging (MRI) of the breast, sonography, and thermography. This procedure is used for screening cancer, whereas the biopsy method confirms the presence of cancer. The biopsy method includes Core Needle Biopsy, Fine Needle Aspiration (FNA), Vacuum-Assisted Breast Biopsy and Surgical (open) biopsy stand out.

Comparative Analysis of Classification and Detection of Breast Cancer

17

For those, we need samples of cells or tissue which are collected on a glass microscope slide for detection, as shown in Fig. 3

Fig. 3 Categories of related works on breast cancer types [10]

The term cytology is derived from the Greek word kytos, which is a combination of two parts. First is cyto means ‘cell’, and ology means the study. So, it is a study of cells that define how cells grow, work, and proliferate as shown in Fig. 4

Fig. 4 Standard cytology image of breast cancer [11]

Deep learning: Deep learning is a method of multilayer architecture where the input data is applied or passed at different levels of abstraction, resulting in the extraction

18

P. Malve and V. Gulhane

of features. Each layer is a “distributed representation” of given inputs, and this representation is a unit of vectors for extract results with specific features. Each input has separate features for the layer of each unit, and these results are not mutually derived. Two significant utilization of deep learning is relapsed and characterization. For each situation, there is prepared information, utilized to alter loads that limit misfortune work (target work). Most normal malignant growth among ladies is breast disease. There is a consistent need for progression related to clinical imaging. Early discovery of malignant growth, followed by the best possible treatment, can lessen the mortality rate. AI can assist clinical experts in diagnosing the disease with more accuracy where profound learning or neural systems is one of the strategies which can be utilized for the characterization of ordinary and strange bosom recognition. CNN can also be utilized for this study. Fig. 5 represents that deep Learning approach as a subset of the Artificial Intelligence domain.

Fig. 5 Deep learning

The necessary steps of the Deep Learning approach are depicted in Fig. 6. The image dataset is split into train and test dataset. Training samples are taken to train the deep learning model along with the class labels. Unknown samples are taken from the test datasets, and their class labels are predicted by the trained Deep learning model to detect and classify cancer.

3 Literature Survey This section presents the prior techniques as proposed by various researchers. Wang et al. [13] investigated a bosom CAD technique reliant on the combination with Convolutional Neural Network (CNN), where features like morphology, texture, and density

Comparative Analysis of Classification and Detection of Breast Cancer

19

Fig. 6 Necessary automation steps for deep learning approach [12]

were selected and fused. Then, it was classified as benign (or) malignant using the ELM classifier. They have worked with 400 mammograms that contain 200 malignant and 200 benign images along with the performance metrics, such as detection and diagnosis metrics. Detection metrics like misclassified error, area over-covered metrics, area over-segmentation metrics, and under-segmentation metrics and comprehensive metrics, whereas diagnostics metrics such as accuracy, specificity, sensitivity, and area under curves were measured and achieved with better detection and diagnostic rate. However, the challenging part is the selection of the appropriate features. The development of a feature for a fused set enhanced the complexity of the training classifier. Xu et al. [14] demonstrated a critical step in automatic nuclei detection as an evaluating of bosom malignant growth tissue examples. Stacked Sparse Autoencoder was acquainted with distinguishing the high-goals histopathological pictures of bosom malignant growth. He worked using a set of 537 H&E stained histopathological images. Based on generated ground truth details, metrics such as precision, recall, and F-measure were evaluated. The autoencoders have achieved 78% accuracy comparing with other models. The major limitation is sensitivity analysis as smaller the window size settings, more with be the complexity. Saha and Chakraborty [15] observed a profound (deep) learning based Her2Net [15] that was created for cell layer, core location, division, and characterization. It is made of different convolution layers, max-pooling layers, spatial pyramid pooling layers, deconvolution layers, up-inspecting layers, and trapezoidal long momentary memory. The performance of LSTM has enhanced to improve the system performance through

20

P. Malve and V. Gulhane

the utilization of the Her2 Image database. Patches of images are created with size 251 * 251 of 2048 pixels with 98% accuracy. However, the limitation of the system is that they have achieved an exceptionally low false-positive rate. Patch-based segmentation models should be enhanced further for decision-making systems. Brancati et al. [16] defined a Deep Learning approach, used for breast invasive ductal carcinoma detection and lymphoma multi-classification in histological images [16]. The authors explored an automatic analysis of hematoxylin and eosin-stained in breast cancer images using deep learning methods. It has been projected in two cases, namely, the presence of invasive ductal carcinoma and its lymphoma classification. FusionNet was developed from Convolutional Neural Networks. It has been applied to public datasets, namely, Unet and Resnet. In Comparison with previous algorithms, the detection rate has increased by 5.06%. FusionNet demonstrated a solution for the segmentation process, but still, feature selection remains unresolved. Based on votes received in each patch, the class of the cell is detected. Qi et al. [17] developed learning models for enhancing the detection accuracy of the system. Here, two learning models were suggested, first is the entropy-based strategy, and the second is the confidence-boosting strategy. Specifically, in every query, the deep model is fine-tuned with both high-confidence and entropy samples. Breakhis is an annotated dataset, in which image-level accuracy and patient-level accuracy were tested. The limitation is that on increasing the sample size, the classifier’s performance gets reduced. Limited labeling degrades the efficiency of the classifiers. The convergence of class speed appears to be low. Carneiro et al. [18] observed that unregistered craniocaudal and mediolateral oblique mammography views had increased the risk rate of breast cancers. The recommended framework is equipped for utilizing the division maps created through mechanized mass and small scale calcification recognition frameworks and delivered precise outcomes. The semi-robotized approach (utilizing physically characterized mass and microcalcification division maps) is applied on two freely accessible informational indexes (INbreast and DDSM). The outcomes concluded that the volume under the ROC surface (VUS) for a 3-class issue (ordinary tissue, benevolent, and harmful) is over 0.9 and the region under the ROC bends (AUC) for the 2-class. CC and MLO mammograms of the Inbreast dataset were studied and explored the minimized false positive rate. Since the appearance and cohesion of the cancerous cells are dynamic, leading to a decline in the performance of classifiers. Zhang et al. [19] stated about the classification of whole mammogram and tomosynthesis images using deep convolutional neural networks approach [19]. An approach of mammogram and tomosynthesis classification based on convolutional neural networks were explored. Here, 2D mammograms and 3D tomosynthesis were used to build a classifier. In specific, transfer learning was used to reuse the information in trained models. The high-quality mammogram data from the University of Kentucky medical center was assessed to investigate the mammogram images. The accuracy rate is greatly enhanced using transfer learning process. The limitation is that the formation of feature

Comparative Analysis of Classification and Detection of Breast Cancer

21

maps in transfer learning challenges the detection of the key elements. Imbalanced data still pertain to machine learning algorithms. Kumar et al. [20] designed a framework based on VGGNet 16, which encounters the benefits of data augmentation, stain normalization, and magnification of the cancer images. CMT histopathological images and human breast cancer dataset was used for study purpose. The system has achieved a 97% accuracy of the systems. Overfitting issues on defining the classifiers increased the false positive rate. Some higher-level discriminating features have slowed down the learning models. Kaur et al. [12] presented the k-mean clustering algorithms, multi-class SVM, and deep neural networks under decision tree models, which enhanced the accuracy of the decision systems. Ten cross-validation models were analyzed using MLP, J48+, and KMean Clustering. The system enhanced the sensitivity and specificity of the designed deep learning models. Region of Interest (ROI) has improved segmentation outcomes. Deep CNN requires a more extensive training model, where a more significant number of decision rules decrease the performance of prediction models. Saha et al. [21] defined that mitosis detection is one of the critical factors in cancer prediction systems. A supervised model was designed using deep learning architecture to overcome the shortages in mitosis detection. It was assessed on datasets such as MITOS -ATYPIA, ICPR-2012, and AMIDA-13. By differentiating between the mitosis and nonmitosis, the increase in the detection rate is observed. The correlation index was less during mitosis prediction. The incompleteness of the histopathological images degrades the pre-processing units with enhanced computational time. The following Table 1 gives a summary of the literature survey.

4 Conclusion In this paper, we have discussed various methods for examining the breast cancer cytological images using the artificial neural network (ANN), and deep neural network techniques. From the analysis, it is inferred that in the examination of histopathological images of breast cancer, MLP and PNN are the most classical ANNs techniques which are widely used. Whereas, along with the classifier, it is analyzed that appropriate feature extraction also plays a significant role as texture features and morphological features are the most widely extracted features from the breast cancer images. Among the deep learning-based techniques, as discussed in the paper, particularly deep convolutional neural networks, has made amazing accomplishments in the detection and classification of breast histopathological images, which will help with early recognition, finding, and treatment of breast cancer.

22

P. Malve and V. Gulhane Table 1 Summary table of the literature survey

Authors

Observations

Wang et al. [13]

Worked with CNN Feature and classified using the ELM classifier. He observed that the fused set enhanced the complexity of the training classifier

Xu [14]

Detected the high-resolution histopathological images of breast cancer. Smaller window size enhanced the complexity defined its limitation

Saha and Chakraborty [15]

The performance of LSTM has enhanced to improve the performance of the system

Brancati et al. [16]

Worked with two cases as invasive ductal carcinoma and lymphoma classification. FusionNet was developed from Convolutional Neural Networks

Qi et al. [17]

Worked with two strategies as entropy-based strategy and confidence-boosting strategy. In particular, the profound model is calibrated with both high-certainty and entropy tests

Carneiro et al. [18]

Worked on an updated system that is equipped for utilizing the division maps created via mechanized mass and miniaturized scale calcification location frameworks and delivering the exact outcomes

Zhang et al. [19]

Stated the use of transfer learning, where we can re-use the information in trained models

Kumar et al. [20]

Stated about VGGNet 16 that encounters the benefits of stain normalization, magnification of the cancer images, and data augmentation

Kaur et al. [12]

By using k-mean clustering algorithms, multi-class SVM, and deep neural networks enhanced the accuracy of the decision systems

Saha et al. [21]

Supervised models used to overcome the shortages in mitosis detection

References 1. Golatkar A, Anand D, Sethi A (2018) Classification of breast cancer histology using deep learning. In: International conference image analysis and recognition. Springer, Cham, pp 837–844 2. Abdel-Nasser M, Moreno A, Puig D (2016) Temporal mammogram image registration using optimized curvilinear coordinates. Comput Methods Prog Biomed 127:1–14 3. Chiarelli AM, Prummel MV, Muradali D, Shumak RS, Majpruz V, Brown P, Yaffe MJ (2015) Digital versus screen-film mammography: impact of mammographic density and hormone therapy on breast cancer detection. Breast Cancer Res Treatment 154(2):377–387 4. de Vasconcelos JH, Dos Santos WP, De Lima RCF (2018) Analysis of methods of classification of breast thermographic images to determine their viability in the early breast cancer detection. IEEE Lat Am Trans 16(6):1631–1637

Comparative Analysis of Classification and Detection of Breast Cancer

23

5. Jeyanathan JS, Jeyashree P, Shenbagavalli A (2018) Transform based classification of breast thermograms using multilayer perceptron back propagation neural network. Int J Pure Appl Math 118(20):1955–1961 6. Etehadtavakol M, Ng EY (2013) Breast thermography as a potential non-contact method in the early detection of cancer: a review. J Mech Med Biol 13(02) 7. Gulzar R, Shahid R, Saleem O (2018) Molecular subtypes of breast cancer by immunohistochemical profiling. Int J Pathol 16(2):129–134 8. Ahtzaz K, Ali M, Arshad A (2017) Comparative analysis of Exon 11 mutations of BRCA1 gene in regard to circulating tumor DNA (CTDNA) & Genomic DNA in a cohort of breast cancer patients in Pakistan 9. https://images.app.goo.gl/8zUj3hjesev372d1A 10. http://www.inf.ufpr.br/lesoliveira/download/TeseFabioSpanhol.pdf 11. Gong Y (2013) Breast cancer: pathology, cytology, and core needle biopsy methods for diagnosis. In: Breast and gynecological cancers. Springer, New York, pp 19–37 12. Kaur P, Singh G, Kaur P (2019) Intellectual detection and validation of automated mammogram breast cancer images by multi-class SVM using deep learning classification. Inform Med Unlocked 16 13. Wang Z, Li M, Wang H, Jiang H, Yao Y, Zhang H, Xin J (2019) Breast cancer detection using extreme learning machine based on feature fusion with CNN deep features. IEEE Access, 1–1 14. Xu J, Xiang L, Liu Q, Gilmore H, Wu J, Tang J, Madabhushi A (2015) Stacked sparse autoencoder (SSAE) for nuclei detection on breast cancer histopathology images. IEEE Trans Med Imaging 35(1):119–130 15. Saha M, Chakraborty C (2018) Her2Net: A deep framework for semantic segmentation and classification of cell membranes and nuclei in breast cancer evaluation. IEEE Trans Image Process 27(5):2189–2200 16. Brancati N, De Pietro G, Frucci M, Riccio D (2019) A deep learning approach for breast invasive ductal carcinoma detection and lymphoma multi-classification in histological images. IEEE Access 7:44709–44720 17. Qi Q, Li Y, Wang J, Zheng H, Huang Y, Ding X, Rohde GK (2018) Label—efficient breast cancer histopathological image classification. IEEE J Biomed Health Inform 23(5):2108– 2116 18. Carneiro G, Nascimento J, Bradley AP (2017) Automated analysis of unregistered multi-view mammograms with deep learning. IEEE Trans Med Imag 36(11):2355–2365 19. Zhang X, Zhang Y, Han EY, Jacobs N, Han Q, Wang X, Liu J (2018) Classification of whole mammogram and tomosynthesis images using deep convolutional neural networks. IEEE Trans Nanobiosci 17(3):237–242 20. Kumar A, Singh SK, Saxena S, Lakshmanan K, Sangaiah AK, Chauhan H, Singh RK (2020) Deep feature learning for histopathological image classification of canine mammary tumors and human breast cancer. Inform Sci 508:405–421 21. Saha M, Chakraborty C, Racoceanu D (2018) Efficient deep learning model for mitosis detection using breast histopathology images. Comput Med Image Graph 64:29-40

Determining Key Performance Indicators for a Doctoral School Aminata Kane1(B) , Karim Konate1 , and Joel J. P. C. Rodrigues2,3 1 Department of Mathematics and Computer Science, Cheikh Anta DIOP University of Dakar,

Dakar, Senegal [email protected], [email protected] 2 Federal University of Piauí (UFPI), Teresina, PI, Brazil [email protected] 3 Instituto de Telecomunicações, Aveiro, Portugal

Abstract. In recent years, the use of Decision-Making Information System (DIS) has become a necessity for monitoring and maintaining a company’s performance. Several studies and contributions have been carried out in various fields. Most of solutions are not suitable for all the areas, and a little attention has been given to research. A lack of DISs is generally noted in doctoral schools (DS). The development of a new decision support methodology (DSM) in research institutions is crucial. This paper aims to determine the predominant performance indicators for handling the overall DS performance. It considers criteria that cover all the definitive stakeholders (DSH) like administration, teaching research staff (TRS), Ph.D. students, and administrative and technical service staff (ATSS). Thus, a research study was carried out regarding DSHs based on collected data from public universities, in Senegal. The DSHs difficulties, and their satisfaction degree have been enlighted. Forty-eight (48) indicators that impact a DS efficiency have been determined. The major criteria for the development of a new DSM in research are analyzed and the main findings are presented in this study. Keywords: Decision-making · Indicator · Performance · Doctoral School · Research

1 Introduction Decision-Making Information Systems (DIS) have become essential for company’s performance [1–4]. Therefore, it is the subject of several works in various areas [4–12]. However, no method can solve all problems and in all areas. But the major problem detected is generally the unavailability of DSM in research for management of the overall Doctoral School (DS) performance [13]. In addition, as pointed out in [12, 13], a literature review on working conditions at the DS level shows that almost all of studies and surveys carried out have focused only on Ph.D. students or, at a pinch, teacher-researchers and supervisors [14–17]. They do not take into account the parameters relating to a DS’ ATSS and Administration. To overcome these limits, a DIS, Decision Support System (DSS) © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2022 V. E. Balas et al. (eds.), Intelligent Computing and Networking, Lecture Notes in Networks and Systems 301, https://doi.org/10.1007/978-981-16-4863-2_3

Determining Key Performance Indicators for a Doctoral School

25

is important and must be designed for a more complete resolution of the performance DS. Therefore, the identification of situation explanatory parameters is necessary. The main objective of this paper is to define a DS’s performance indicators. These ones take into account the four (4) major DS components, which represent its definitive stakeholders (DSH). This study is carried out on the basis of four questionnaires sent to the DSHs of DS in Senegal. Among approximately 2 980 characters concerned, 378 had then answered the questionnaires. Therefore, the main contributions of this paper are the following: (i) Highlighting the main factors that impact the thesis defense rate and therefore on a DS performance; (ii) Determination of the main DS performance indicators of a DS, taking into account the definitive stakeholders. The remainder of this paper is organized as follows: In Section II, a literature review on DS performance indicators, and presentation and analyze of different investigative approaches available is performed. Then this section is concluded by giving the performance criteria list of a DS detected in the literature. Our approach, devoted to the second Section III, revolves around two main stages: (i) data collection through surveys carried out at the level of the four DSH, with the mixed investigation approach; and (ii) processing and analysis of results using the full-fledged Business Intelligence data analysis tool, Excel. Section IV, identify the performance indicators deemed most relevant from the analyzes performed; indicators that cover the four main components of a DS. Section IV in turn summarizes our contributions. And finally, the document is concluded in Section V through perspectives.

2 Indicators Study for Evaluating a DS Performance: State of the Art 2.1 Concept of Performance Indicator and Research Methods The indicator concept is ubiquitous and widely studied in the literature, in a variety of areas [18–20]. It is defined by the ISO standard as a "numerical, symbolic or verbal expression, used to characterize activities (events, objects, people) in quantitative and qualitative terms (yes / partially / no) with the aim of determining their value" [18, 19]. In other words, an indicator is a meaningful and objective representation that allows us to synthetically measure. Whether or not defined objectives are achieved leads to an interest in the concepts of "performance"; and "performance indicators" (PI). Performance is then relative to the objective’s definition. A structure performance is its ability to implement mechanisms to achieve standardized objectives, well defined nationally and internationally in a given time [18]. PI is widely studied [18–25]. Lorino, 2003 characterizes it as "information that should help an actor, individual or generally collective, to lead the course of an action towards an objective achievement or to allow him to evaluate the result» [21]. In other words, a PI is "quantified data that allows assessing thing

26

A. Kane et al.

competitiveness in terms of profitability, efficiency and productivity by measurement, over a given period”. In 2019, A. Fernandez underlines that ultimately, PIs help support strategic objectives and check whether the strategy is validated in reality [26]. Mendoza and Al. (2002), identify three essential questions allowing to mark out this strategic reflection: What does the service do? (Action: Train doctors); who are its stakeholders, their expectations and respective contributions (Actors)? What are the expected results? (Objectives) [27]. Figure 1 below is an illustration.

Fig. 1 The indicator’s "triangle": strategy translated into "objective", "action process" and "actor" (collective). Source Lorino (2001, pp 7) https://halshs.archives-ouvertes.fr/halshs-005 84637 Submitted on 9 Apr 2011

There are different types of PI such as key performance indicators (KPIs), (which are those chosen to monitor the service performance). They are of four types: (i) Productivity PIs; (ii) Quality IPs; (iii) Capacity PIs; and (iv) Strategic IPs. But depending on their purpose, there are: (i) Financial IP; (ii) Commercial IP; (iii) Corporate social responsibility (CSR); and (iv) Organizational IP. Characteristics of a good IP are to be: (i) Clear, Objective and Relevant; (ii) Quantifiable (measurable); (iii) Faithful, reliable, and sufficiently "robust"; (iv) Sensitive specific and focused; (v) Informative in its content; (vi) Meaningful and user-friendly; (vii) Timely [23, 24, 27]. Investigation modes are determined by the research paradigms and researcher ambitions. The latter has the choice between three investigation modes: (i)quantitative approach; (ii) qualitative approach; and (iii) mixed approach [28]. Compared to the investigative methods developed, several studies types and analysis of survey data can be used. The most relevant are the following: (i) Exploratory and explanatory studies; (ii) Descriptive and correlational studies; (iii) Correlationalexplanatory studies; (iv) Experimental research; (v) Test research; (vi) Mass research or investigation; (vii) Observational research; and (viii)Applied research [28, 29]. 2.2 Presentation of Criteria for Evaluating a Doctoral School Performance A literature review shows that indicators for evaluating the higher education performance are submitted various organizations like Agency for the Evaluation of Research and Higher Education (AERES), an independent French administrative authority created in 2006 which became the High Council for the Evaluation of Research and

Determining Key Performance Indicators for a Doctoral School

27

Higher Education (Hcéres) in 2014, United Nations Educational, Scientific and Cultural Organization (UNESCO), Organization for Economic Co-operation and Development (OECD), and others [30–36]. The most relevant for DS are PI proposed in the CAMES Repository-Evaluation-Doctoral-Schools-CAMES (REED-CAMES) produced in 2017 by UNESCO and those expressed in the most recent editions of the OCDE. They are listed below. 2.2.1 UNESCO’s Performance Criteria [31] They concern the following themes and sub-themes: • Organization and operation: (i) Achievement degree of defined missions and objectives; (ii) Institutional relevance of the DS: Coherence of DS missions with those of institutions that carry it; (iii) Staff and students representativeness in decision-making bodies; (iv) Frequency of meetings; (v) Clarity of display of science policy and DS activities; and (vi) Others. • Scientific life: (i) Accreditation rate of national and international DS doctoral programs at CAMES level; (ii) Students admission regularity; (iii) Average duration of theses defended; (iv) Effectiveness of quality assurance measurement, self-evaluation of doctoral programs; (v) Compliance with the rules and standards for the theses supervision; (vi) Organization level of scientific events; (vii) Accessibility of complementary, diversified and credited training (seminars, conferences, teaching) for Ph.D. students; (viii) Degree of contact between Ph.D. students and the professional world; and (ix) Others. • Outreach and attractiveness: DS’s ability to make itself influential on the international level: (i) Degree of international openness of the DS; (ii) Quality and duration of partnership relationships; (iii) Selectivity and importance of national and international events; (iv) Contribution degree to the dissemination of scientific culture; and (v) Others. • Resources: means available to the DS to ensure its functioning, management and scientific and educational activities: (i) Quality and condition of pedagogical equipment and tools; (ii) Scientific productions quality (papers, books, and Others.) drawn from the theses defended; (iii) Qualification and sufficiency of administrative and technical staff; (iv) Qualification and sufficiency of teaching and research staff responsible for doctoral supervision; (v) Existence of sustainable, sufficient and predictable funding: DS budget. • Organizational and scientific perspectives: (i) Skills and resources availability; (ii) Coherence degree of the new trainings (seminars, workshops, courses, and Others.) with regard to the new DS objectives; (iii) Engagement degree of new partners: Presentation of partners, and (iv) Others. 2.2.2 OECD Performance Criteria • The OECD performance indicators listed cover the following themes: • Results of educational institutions and learning impact: (i) Percentage of graduates per age group; (ii) Rate of tertiary teaching graduates; (iii) Influence degree of training level compared to the employment rate; and (iv) Others [32].

28

A. Kane et al.

• Financial and human resources invested in education: (i) Annual expenditure rate of institutions per pupil / student, per service type, from primary to tertiary teaching [33]; (ii) Percentage (%) of GDP for education; (iii) Tuition fees rate in relation to the percentage of recipients of study loans, scholarships or grants; (iv) Breakdown of operating expenses of primary, secondary and post-secondary non-tertiary education establishments; and (v) Others. • Access to education, participation, and progression: (i) Enrollment rate per age group; (ii) Rate and profile of new entrants in tertiary education; (iii) Rate and profile of tertiary education graduates; (iv) Rate, profile and professional outlook of Doctorate holders; and (v) Others [34]. • Learning environment and school organization: (i) Supervision rate and class size; (ii) Teacher salary level; (iii) Rate of qualified Research Teaching Staff; (iv) Evaluation systems rate for teachers and Establishment heads; (v) Existence of key decisionmaking systems in education systems; and (vi) Others [35]. Faced with a set of investigation and study methods presented, to meet this mission of characterizing performance indicators for a DS, the mixed approach is the most appropriate, with the use of mass research or survey.

3 Survey Procedures and Analysis of Results This part presents the procedures used to achieve this study’s purpose, from the survey to the results analysis. In this context, based on the guidelines defined by the Ministry of Higher Education, Research and Innovation (MESRI), requests that the DS themselves had defined, but above all those requested by the DS’ actors who intervened in the investigations, are maintained the specific objectives, as follows: • Ensure good and efficient governance and the DS organization; • Preserving and strengthening satisfaction in material and immaterial administrative infrastructures; • Manage and guarantee the sufficiency of material and immaterial research infrastructures; • Propose timely initiatives. This study process consists of two main steps: (i) data collection and (ii) analysis of the results obtained. 3.1 Data Collection To achieve this, a strategy for collecting and studying survey information is carried out with four (4) questionnaires. These surveys were performed on ten (10) DS, representative of those in Senegal, namely: • The seven (07) DS of Cheikh Anta DIOP University of Dakar (UCAD): (i) “DS of Arts, Cultures and Civilizations” (ED ARCIV); (ii) “DS of Water, Quality and Uses

Determining Key Performance Indicators for a Doctoral School

29

of Water” (EDEQUE); (iii) “DS of Studies on Person and Society” (ED ETHOS; (iv) “DS of Legal, Political, Economic and Management Sciences” (ED JPEG); (v) “DS of Mathematics and Computer Science” (EDMI); (vi) “DS of Life, Health and Environmental Sciences” (EDSEV); and (vii) “DS in Physics, Chemistry, Earth, Universe and Engineering Sciences” (ED-PCSTUI); • The two (02) DS of Gaston Berger University in Saint Louis (UGB): (i) “DS of Science and Technology” (EDST); (ii) “DS of Sciences on Person and Society” (ED SHS); • The Doctoral School of Sustainable Development and Society “(E2DS) of University of THIES. They are addressed to the four DSH of Senegalese DS (Administration, TRS, ATSS, and Ph.D. students). 3.2 Results Analysis Processing and analysis of results is performed using the full-fledged Business Intelligence data analysis tool, Excel. It is accompanied by a commentary on impacts DS performance. The main themes addressed in this study are the followings: (i) Characterization and presentation of concerned population; (ii) Working modalities and quality of services; (iii) Research and work arrangements; (iv) Funding; (v) Theses and defenses. The investigation covered all the components of the ten Senegalese DS selected. It reached a target population of 1618 Ph.D. students, 1342 Framers and teacher researcher, 10 ATSS, and 10 Directors. Or a total of 2980 asked. However, the number of returns on the questionnaires distributed, as presented in Table 1 below, only corresponds to: (i) 320 among Ph.D. students; (ii) 44 from Framers, Supervisors or Teaching Research Staff; (iii) 5 from the Administrative and Technical Service Staff (ATSS) and (iv) 8 from Directors. That is to say a total of 377 respondents. This makes an overall participation rate of 12.65%, better exposed on Table 1. An overall participation rate not very satisfactory, in particular on research professors and supervisors with whom, among 1342 requests, it is only obtained 44 responses, i.e., a rate of 3.28%. This testifies and reveals the difficulty and delicacy of the collection phase. Statistics for Supervisors are not determined per DS in this table because a teacher and Framer can be from multiple DS. This table reveals that the Ph.D. students are unevenly allocate in DSs. All the DSHs were represented. But at UGB, only the EDST Director came back to us. At UCAD, the ED ARCIV secretary and the ED ETHOS Director at the time, did not respond to questionnaires even after several requests. • In addition to 65% of Ph.D. students who state that they do not have an information platform, the non-availability of classrooms and defense lecture halls, the lack of programming and information in time, is source of problems for following modules among 42.1% of Ph.D. students. It would therefore be interesting to offer the possibility of taking distance courses in order to allow Ph.D. students outside Dakar or unavailable to attend; but above all to implement new websites, regularly updated to: (i) Access information in time; (ii) Download documents and forms as needed; (iii) Archive and access DS articles and theses.

30

A. Kane et al.

Table 1 Presentation of responses to the questionnaires sent to the DSHS of Senegalese DS Doctoral Schools Directors Quest Resp %

Ph.D. students

ATSS

Totals

Quest Resp %

Quest Resp %

Quest Resp %

ED-ARCIV

1

1

100

215

34

15.81 1

0

0

EDEQUE

1

1

100

123

16

13

1

1

100

ED-ETHOS

1

0

0

202

15

7.42

1

1

100

ED- JPEG

1

1

100

105

17

16.19 1

1

100

ED-MI

1

1

100

66

7

10.6

–

–

–

ED-PCSTUI

1

1

100

256

92

34.76 1

0

0

ED-SEV

1

1

100

282

101

35.81 2

2

100

ED-ST

1

1

100

169

9

05.34 1

0

0

ED-SHS

1

0

0

154

21

13.6

1

0

0

ED-2DS

1

1

100

46

8

17.39 1

0

0

TOTAL

10

8

80

1618

320

19.59 10

5

50

2980

378

12.68

• The rare participation of Ph.D. students in seminars requires a more frequent organization of these scientific activities, but in particular of doctoriales by each DS with new initiatives: (i) Mandatory participation of the Ph.D. student before being able to defend the thesis; (ii) Involvement and presence of teachers and framers; (iii) Publication in an international journal of the best works; (iv) Ph.D. students motivation by a prize-giving ceremony, for the best; (v) Greater involvement of the socio-professional world. • Several phenomena are the cause of a long duration of thesis preparation, among which: (i) 60% of responding Ph.D. students are in “Professional activity”; (ii) Among women, 19.55% have children; (iii) The funding lack of 69% and subsidies is the source of the blockage for publications, since most journals are chargeable; (iv) A high number of framing that exceeds the standards (more than 10) on the part of the Supervisor; (v) Theses interruptions due to several factors; (vi) The low equipment level of research structures underlined by 66.46% of Ph.D. students, supported by the TRS and witnessed by almost all DS Directors; (vii) Late payment of jury members; (viii) failure to cover external jury members • The financial problems suffered by the Ph.D. students and Supervisors are deplored by the ATSS (which does not receive compensation from DSs), but especially by the DS Directors who do not manage to meet the needs of their institute in material and immaterial infrastructures on time, to meet the expectations of the other three DSHs and therefore not to achieve their objectives; financial management lack of Senegalese DSs by the Rectorate. The performed analysis reveals that this dependence of almost all DS in Senegal from the Rectorates is not limited to a financial plan but is also administrative. These institutes have an ATSS deemed insufficient by the Directors, due to an availability lack

Determining Key Performance Indicators for a Doctoral School

31

of financial manager, mail agent and even administrative assistant. Defense rooms are not available; Senegalese DS do not have a defense amphitheater. But the number of Ph.D. students able to defend thesis per year, given by the Directors, varies from one DS to another. And these figures are just approximate. This is why it has been found it appropriate to strengthen and complete our research with a second expertise. This One is established from exact data and statistics, taken at the source, in order to have confirmed and decisive information. This investigation is being carried out at the level of a university in Senegal: Cheikh Anta DIOP University of Dakar (UCAD). This choice is justified by the fact that it represents the largest University in Senegal and one of the largest in Africa, and thus brings together a very large number of students, of many nationalities. In addition, 70% of DSs on which our research relates are lodged there. The studies carried out therefore focused on: (i) a promotion of students registered in the first year in 2011, who arrived in doctorates (an examination which goes from cycle L to cycle D); (ii) all Ph.D. students who must defend their thesis and Doctors trained in 2018. They reveal us that on the one hand, among 29,134 students enrolled in the first year, 1,089 pass the M cycle to enroll in a Doctorate; i.e., a rate of 3.74%. On the other hand, they reveal that among these 29,134 registered students, 11,315 are female gender; i.e., a rate of 38.84%. And better still, it shows that of these 11,315 students enrolled in a Senegalese public university, 400 i.e., a rate of 3.53%, are enrolling in a doctoral thesis; Which is contrary to Objective N° 5 of Sustainable Development Goals (SDGs) by 2035 for a country (Gender equality). However, this rate is almost equal to the average; hence the need to encourage research among women. This represents an interesting discovery for donors, funders, and international organizations such as UNESCO. This low rate of students entering cycle D means two things: • Either students are not at all interested in the doctorate; • Either the success rate of Master students is low: which shows a performance lack in the L and M cycle; and therefore, has its impact the D cycle. This UCAD promotion of 2011, which enters the first year of a thesis in 2015–2016, is followed along with all those registered. Analysis of results shows us that among 2377 Ph.D. students enrolled in 2017–2018: • Total number of Ph.D. students totaling three (3) or more registrations (i.e., “who had to defend their thesis”) is 1975; i.e., an overall rate of 83.08%. Among the latter, those of the “Feminine” gender are 516; or 26.12%; • Total number of Ph.D. students who have defended at the end of 3 years is 19; or an overall percentage of 0.96%, of which 4 are Female Gender; or 21.05%. Among 100 Ph.D. students able to defend their thesis, 0.77 support their thesis on third registration (after 3 years); • Total number of Ph.D. students who have defended at the end of 4 years is 44; or an overall percentage of 2.22%, of which 9 are female Gender; or 20.45%. Among 100 Ph.D. students able to support their thesis, 1.74 defend after four registration;

32

A. Kane et al.

• Total number of Ph.D. students who have defended at the end of 5 years is 57; or an overall percentage of 2.88%, of which 11 are female gender; or 19.29%. Among 100 Ph.D. students able to defend their thesis, 2.13 support in five years; • Total number of Ph.D. students who have defended at the end of (AEO) 6 years is 41; or an overall percentage of 2.076%, of which 14 are female gender; or 36.58%. Among 100 Ph.D. students able to support their thesis, 2.90 defend on their sixth registration; • Messed up theses (failed) constitute those of Ph.D. students registered more than six (6) times (who have not defended their thesis after 6 years. Even if the Ph.D. student returns one or two years later to defend, his thesis is considered messed up.). This study shows that 101 theses are failed; or an overall percentage of 5.11%, of which 26 are female gender; or 25.74%. This reveals that the total number of Ph.D. students having defended compared to the 1975 who should defend their thesis in 2017–2018 at UCAD is 200; i.e., an overall rate of 10.13%. These results are better presented for each DS in Table 2. By categorizing this study, results in Table 3 are obtained and presented. The results from Table 3 show on the one hand that those who support at the end of three (3) years are often, at 69% less than thirty-five (35) years old. So, they are single people without children and not in professional activity (PA). And on the other hand, that those who support at the end of six (6) years or have been victims of messed up theses, are in 70.3% over thirty-five (35) years old; So, they are often married with children, and especially among women. In addition to a family to take care of, the ladies are in PA. This shows once again that the factors “marital status”, PA, and financing have a great impact the Ph.D. students success. A study over four (4) years of theses defended at the DS level shows us that it is the same framers who complete the theses of their Ph.D. students on time and that the same ones once again drag up to five, six years or screw up those of their students. In this context, it is given the example of a teacher researcher who: (i) Before going to the class “Professor” had 9 Ph.D. students supported on time; 2 at the end of 5 to 6 years; 1 messed up; (ii) Become a “Professor”, had 2 Ph.D. students supported on time; 11 at the end of 5 and 6 years; 3 messed up thesis. The reason is that it was demanded of them. So, teacher researchers do half the number of theses defended on the time when they become tenured. So, the motivation of supervisors and framers has a big impact Ph.D. students success. At this stage, it can be proceeded to the specification of the decisive indicators for the performance of a DS.

4 Determination of Performance Indicators for a Doctoral School In this section, a sampling of the main performance indicators deemed to be the most relevant will be carried out. These indicators are determined on the one hand from a combination of the most relevant among the existing ones, and on the other hand, from their impact, effectiveness, efficiency, and efficiency on the limits to resolve. The most

182

88

724

184

155

304

338

1975

EDE QUE

ED-JPEG

ED ETHOS

ED-MI

ED-PCSTUI

ED-SEV

TOTAL

83.08

83.87

84.4

78.28

70.76

87.86

81.48

81.25

19

2

5

3

0

8

0

1

0.96

0.59

1.64

1.93

0

1.10

0

0.55

%

44

12

12

5

1

10

0

4

Number

2.22

3.55

3.94

3.22

0.54

1.38

0

2.2

%

57

20

12

6

2

8

3

6

Number

2.88

5.91

3.94

3.87

1.08

1.1

3.4

3.29

%

41

15

4

3

6

8

1

4

Number

2.07

4.43

1.31

1.93

3.26

1.1

1.13

2.2

%

101

7

11

8

4

58

5

8

Number

Number

Number

%

Having defended Having defended Having defended Having defended Messed up AEO 3 years AEO 4 years AEO 5 years AEO 6 years theses

Able to defend their thesis

ED-ARCIV

DS

Ph.D. students

5.11

2.07

3.6

5.16

2.17

8.01

5.68

4.39

%

Table 2 Presentation of Ph.D. students who defended their thesis compared to those able to defend/DS

200

52

39

22

13

42

9

23

Number

10.13

15.38

12.83

14.19

7.06

5.8

10.23

12.64

%

Total having defended

Determining Key Performance Indicators for a Doctoral School 33

83.08 19

Total

1975

83.33 9

Political ED-ARCIV 1090 and Human ED-ETHOS Sciences ED-JPEG 0.96

0.82

1.13

Number %

Number %

82.78 10

Having defended AEO 3 years

Able to defend their thesis

885

Sciences ED-MI and ED-SEV Techniques PCSTUI EDEQUE

Domains - DS

Ph.D. students

44

15

29

2.22

1.37

3.27

Number %

Having defended AEO 4 years

57

16

41

2.88

1.46

4.63

Number %

Having defended AEO 5 years

18

23

41

1.65

2.6

Number %

Having defended AEO 6 years

Table 3 Presentation of thesis defenses according to the field of study

2.07

70

31

122

101

5.11

200

7.16

13.78

Number %

Total Having defended

6.42 78

3.5

Number %

Messed up theses

34 A. Kane et al.

Determining Key Performance Indicators for a Doctoral School

35

relevant are defined on the basis of the analysis carried out; which makes the originality of the adopted strategy. They cover the four major components of a DS. The specified indicators must meet the SMART criteria (Specific, Measurable, Achievable, Relevant and Time-oriented) and are forty- eight (48) in number. Depending on thematics, the criteria retained in this research work are listed below: 4.1 Doctoral School Governance 4.1.1 Organization (1) Constitution of decision-making bodies. It is an important performance criterion that influences decision-making, and therefore the governance of a DS. It is measurable by the percentage, appointment and presence of Ph.D. students’ representatives, TRS, and ATSS on the Scientific and Pedagogical Council (SPC) of DS; (2) Existence (Yes/No) of an admission campaign for Ph.D. students; This is a good indicator for recruiting Ph.D. students. It is also very important given the marketing aspect to develop; (3) Existence of an admission system for Ph.D. students (Yes / No); indicator that facilitates the task; (4) DS training accreditation rate at CAMES and national quality agencies: Important indicator for the DS training initiative. 4.1.2 Research and Work Facilitation Devices (1) Regular holding of meetings: Average rate per number. This indicator is convincing in the context of the participation of DSH in decision-making; (2) Regular website update. It is a fundamental cross-cutting indicator. It can be measured by the rate of satisfied or from the last access of the administrator including: (i) Information and procedures for the selection and evaluation of Ph.D. students; (ii) “Adequacy of conditions for managing Ph.D. students’ registrations and files: (3) Existence of an archiving system for thesis defense. It is an interesting indicator of the performance or otherwise of a research institute. It allows on the one hand, to have the number of defended theses with all the corresponding information, and on the other hand, the availability of specialized bibliographic resources. It thus facilitates bibliographic research for Ph.D. students preparing their thesis, and helps to avoid or reduce the possibilities of plagiarism; (4) Availability of business intelligence systems or models. It is an essential indicator for evaluating and managing the performance of a service. This, by helping to make adequate decisions when faced with a problem. 4.1.3 Financial Management (1) DS administration financial reports. This is a decisive indicator for having transparency in the management of financial resources in order to know the decisions to be made regarding the use of the budget;

36

A. Kane et al.

(2) Rate of financing of the DSH material needs and scientific activities. This is an indicator for estimating the DS’ level of equipment» and support for scientific and research activities; (3) Level of compensation for thesis jury members. Determining indicator which impacts the motivation of the PER and the supervisors and consequently on the number of theses defended. (4) Payment of thesis allowances on time and support for external jury members. This is an influential indicator, which has an impact whether or not theses are held; therefore, on the number of theses defended. It is measurable by the % of paid on the thesis defense day; (5) Fairness and equity in personnel management in relation to responsibilities: ATSS motivation indicator that impacts the work progress; (6) Substantial granting of compensation to ATSS: Motivation indicator for ATSS, influencing the DS performance. 4.2 Satisfaction with Material and Immaterial Infrastructures 4.2.1 Working Methods: Training, Supervision, Administration (1)

(2)

(3)

(4)

(5)

(6)

(7)

Internet connection infrastructure in DS premises. It is essential for administrative work and research. It can be measured by good accessibility to network (throughput); Qualified Administrative and Technical Service Staff (ATSS) / Ph.D. students rate. This indicator is eminent insofar as it gives an idea of the number of Ph.D. students attended by an ATSS; which expresses the sufficiency or not of the ATSS responsible for Ph.D. students; Satisfaction on the secretariat of the help with administrative procedures. This is a considerable indicator as it impacts the number of registrants. It is measurable by the percentage of satisfaction with the secretariat (Ph.D. students, TRS, visitors); Rate of Teaching Research Staff (TRS) authorized to supervise research work (Researchers of masterly rank, Professors, Lecturers and Research Masters CAMES) / “Ph.D. students enrolled”. This is a fundamental indicator which gives an idea of the number of Ph.D. students in supervision per TRS; therefore, the number of Ph.D. students per unit of TRS. Therefore, it helps to measure whether or not the framing rules and standards are respected. It also gives a better view of the “rate of TRS authorized to supervise” remaining; Framer availability. The importance of this indicator is explained by its influence on thesis duration of Ph.D. students. It is measurable by the number of work sessions/month; Rate of Ph.D. students per unit of local. This indicator measures the availability of premises in relation to the total number of Ph.D. students. Its importance is also justified by its direct impact the duration of Ph.D. student’s thesis; Equipment level of premises: (machines, IT equipment and supplies, office equipment and furniture, availability of: “logistical means (vehicles)”, and others). The relevance of this indicator is demonstrated once again by its direct impact the tasks duration to be accomplished by the ATSS, the TRS and on the duration of

Determining Key Performance Indicators for a Doctoral School

(8)

(9)

(10)

(11)

(12)

(13) (14)

37

Ph.D. students theses, and therefore the graduates rate. It is measurable by the percentage of Satisfied with the equipment; Organization frequency of diversified additional training, for Ph.D. students or employees (seminars, workshops, conferences, capacity building for ATSS, and Others.). This good criterion allows to measure the scientific activities development from the number organized; Participation rate in scientific events: This is an interesting indicator that provides an opinion on the success of scientific activities. It is measurable by: Favorable responses (presence of people who are references in the field (at least 2); Presence of TRS and Ph.D. students: 50 to 100%; national and international participation) or Satisfaction of the organization; Rate of DS partnership with spinoffs (advantages) for the socio-economic development of the country: decisive indicator for access to diversified financial resources; International openness degree of the DS: This is an appropriate indicator that can be measured by many criteria such as: (i) the rate of foreign Ph.D. students received, the rate of co-supervision, the granting of open training (s), (ii) the rate of DS partnership with national and international research structures; (iii) the proportion of Ph.D. students participation in national and / or international activities, (iv) the research projects quota funded at the international level, (v)the rate of Ph.D. students awarded at national or international level, (vi) the Ratio “Number of indexed publications in international journals. The success of scientific activities is a good indicator of the international openness degree of a DS; Quality of the supervision methodology: Important indicator measurable by: (i) Participation degree of Ph.D. students in the life of research structures: measurable by the number (list) hosted by each of the research structures; (ii) Existence of a monitoring mechanism for the research work: (DS internal regulations; Document presenting the monitoring mechanism; Meeting minutes of the monitoring mechanism); Frequency of presentation sessions for Ph.D. students: Indicator, influencing the progress of the Ph.D. students’ thesis and thus on the number of theses defense; Availability of teachers per modules: Important indicator for monitoring modules.

4.2.2 Financial Resources (1) Existence of a sufficient and predictable budget. Essential criterion for the DS performance. It is measurable by the availability through Educational Registration Rights; the distribution percentage and the management report; (2) Diversified financial resources: This is an influential indicator, which can be measured from: (i) research contracts, the provision of expertise or resources, (ii) support for the organization of scientific projects or activities, (iii) donations from individuals obtained with or without academic partners, (iv) funding mobilized for thesis work, and (v) Others.

38

A. Kane et al.

4.3 Results of the Defenses and Prospects of a Doctoral School 4.3.1 Theses Defended: Graduates (1) Duration of thesis preparation: Important indicator of the specification of a DS’ performance. (2) Rate of publications in ISI indexed journals / Ph.D. student. It is a fundamental performance indicator as it advises on the international openness degree of a DS and gives a good idea on its effectiveness; (3) Rate of theses defended at the end of three (3) years compared to the “Rate of registrants able to support their thesis” according to the: (i) Nature (or source) of funding (Fellows and Ph.D. students in PA); (ii) Marital status (Married with children); (iii) Number of theses supervised by the framer; (iv) Laboratory equipment level; (v) Interruptions during the doctoral course; (4) Rate of theses defended at the end of four (4) years compared to the “Rate of registrants able to support their thesis” according to the: (i) Nature (or source) of funding (Fellows and Ph.D. students in PA); (ii) Marital status (Married with children); (iii) Number of theses supervised by the framer; (iv) Laboratory equipment level; (v) Interruptions during the doctoral course; (5) Rate of theses defended at the end of five (5) years compared to the “Rate of registrants able to support their thesis” according to the: (i) Nature (or source) of funding (Fellows and Ph.D. students in PA); (ii) Marital status (Married with children); (iii) Number of theses supervised by the framer; (iv) Laboratory equipment level; (v) Interruptions during the doctoral course; (6) Rate of theses defended at the end of six (6) years compared to the “Rate of registrants able to support their thesis” according to the: (i) Nature (or source) of funding (Fellows and Ph.D. students in PA); (ii) Marital status (Married with children); (iii) Number of theses supervised by the framer; (iv) Laboratory equipment level; (v) Interruptions during the doctoral course; (7) Rate of messed up theses (having lasted more than 6 years) compared to the “Rate of registrants able to defend their thesis” according to the: (i) Nature (or source) of funding (Fellows and Ph.D. students in PA); (ii) Marital status (Married with children); (iii) Number of theses supervised by the framer; (iv) Laboratory equipment level; (v) Interruptions during the doctoral course. These are decisive result indicators, presented as a ratio between primary indicators. They make it possible to measure the success rate of Ph.D. students, the average duration of theses defended, and the rate of theses having failed, according to a set of criteria. Therefore, they give an idea of decisions to be taken to remedy problems and ensure a good performance of the institute. (8) Thesis defense rate / framer given. It is an interesting indicator which gives an idea of the PV and the progress of the teacher-researcher. 4.3.2 Objectives and Perspectives of Operation and Evolution of a Doctoral School (1) (2) (3)

Temporal percentage of profit of websites with publication capacity and updated; Advantages of increasing the regulatory deadline for a thesis; Advantages of granting doctoral degrees with scientific scope of DS;

Determining Key Performance Indicators for a Doctoral School

39

(4)

Time percentage of management profit and independent financial availability: Timely response to the infrastructure needs of all stakeholders; Subsidy and even funding for Ph.D. students; Payment of thesis jury members on time; Granting of compensation to deserving ATSS, Organization and financing of scientific activities; (5) Percentage of productivity of training created over time; (6) Advantages of work platforms for efficient programming and calculations; (7) Percentage of profit from the establishment and compulsory follow-up of a course in “Thesis methodology”; (8) Percentage of profit from the presence of supervisors and Ph.D. students at doctoral studies; (9) Percentage of interest time of a DIS. These are appropriate criteria which give an idea of the degree of a DS evolution. (10) Development degree of the institute. With the rapid evolution of training and research needs in new fields, a DS must adopt new alternatives. In addition to the above-mentioned outlook indicators, the number of programs or streams created, substantially modified or closed, and the average duration of new programs creation, on an annual basis, are considerable indicators of development degree of an institute.

5 Discussion To determine the performance criteria of a DS, a study of indicators relevant to this context, existing in the literature, was performed. Thus, those submitted in UNESCO’s REED-CAMES are listed with those from the most recent OECD editions; which allowed for a substantial collection of indicators. As the latter are not yet systematized for a SIAD, several convictions arise: a limit to be regulated. Therefore, suitable indicators, meeting the expectations of all DSHs of a DS are necessary to be determined. In this context, surveys concerning the four major components were carried out at the level of the ten largest DS in Senegal. The analysis of these surveys results allowed us to examine a set of questions and to discover several realities. In addition to these innovations, a second expertise with exact data, taken at source, is carried out for reinforcement. It concerns a university promotion (of UCAD - 2011, arrived in Doctorate in 2015), and the Doctors rate trained three years later (in 2018). Analysis of this data revealed accuracy. It is shown that: • The thesis defense rate in three (3) years is 0.96%; too low percentage. It is AEO four (4) years, or even five (5) to six (6) years that that there is a Doctor rate of 2.22; 2.88; and 2.076%; This witnesses that the regulatory deadline for a thesis set at three (3) years is short for Senegalese Ph.D. students; hence the need to increase it. • However, it should also be noted that the thesis preparation duration as well as Ph.D. student’s success rate and thus that of DS performance do not totally depend on the deplorable factors with which DSHs are confronted but also on the DS’ organization and serious in work. This is firstly proven by the higher Ph.D. students rates able

40

A. Kane et al.

to support in literary disciplines, while the thesis defense rate is higher in scientific fields. This, as well at the end of 3 years, 4 years, 5 years as at the end of 6 years. And in return, by a higher rate of messed up theses in these literary fields; Confers Table 3. The factors and criteria that impact a DS performance, deemed to be the most relevant and consistent are as follows: (i) Sufficiency of ATSS and TRS, financial autonomy, availability and equipment of premises, holding of a course in "Thesis methodology" at the DS’ level, organization of DSs in the work, relations with internal or external cooperation, integration of the SPC in decision-making, provision of a DIS; (ii) The marital status, professional activity, and funding of Ph.D. students; (iii) improving the title of research professors, the maximum number of Ph.D. students to follow, the level of compensation and the on-time payment of jury members and the supported for externs; (iv) additional training, compensation for ATSS; (v) the regulatory deadline for a thesis, and (vi) others. They constitute parameters that have a direct impact the thesis defense rate during the year, and therefore a DS performance.

6 Conclusion and Future Works In order to achieve this study’s objective, one of the greatest contributions of this study consist in the discovery and demonstration of the main factors which impact the thesis defense rate, and therefore, the DS performance. Faced with the situation, it is proposed: Firstly: • An increase in the regulatory deadline for a doctoral thesis in Senegal to four (4) years. This represents an important decision for a DS performance because the thesis in three years in Senegal is a “chimera” given: (i) the gaps that students here have from cycle L to cycle M; (ii) Ph.D. student must be empowered to become an expert in his field, and not of his subject; (iii) Senegal’s working conditions, as demonstrated, are far from being met; • A revaluation and harmonization of the level of compensation for jury members: Fundamental and motivating initiative for PER, which develops thesis defenses and thus the number of doctors trained; • A decisive granting of indemnities to the ATSS: Act encouraging them to better respond to their duties and tasks; • Autonomous financial management: This will allow Directors to meet the expressed needs and thus, satisfaction of the DS DSHs; • Better organization of DS at work. On the other hand: the implementation of a tool to overcome the identified factors. This is the subject of our future work. How? ‘Or’ What? A new decision support methodology in this direction: a Prospective Dashboard [27, 37, 38] is proposed, based on a choice of the preponderant indicators for the DS’ performance, on those determined [24].

Determining Key Performance Indicators for a Doctoral School

41

Acknowledgements. Many thanks to all those who supported and assisted us in the publication of this article: the DS Directors and Head of Laboratories. This study was done at the Dakar Computer Laboratory (LID), Faculty of Sciences and Techniques (FST), Cheikh Anta Diop University of Dakar (UCAD), Senegal, and it is partially funded by FCT/MCTES through national funds and when applicable co-funded EU funds under the Project UIDB/50008/2020; and by Brazilian National Council for Scientific and Technological Development - CNPq, via Grant No. 313036/2020-9.

References 1. Naeem M, Moalla N, Ouzrout Y, Bouras A (2017, Mar 12) A business collaborative decision making system for network of SMEs. In: International conference on product lifecycle management 2017, PLM 2017, book series IFIPAICT, vol 492, pp 99–107, First online: 12 March 2017 2. Didier V (2021) Business Intelligence. http://perso.univ-lyon1.fr/haytham.elghazel/BI/presen tation.html. Accessed March. 2021 3. Kimball R, Reeves L, Ross M, Thornthwaite W (2008) The data warehouse project conduct guide. In: Eyrolles, Feb. 2005. “The Data Warehouse lifecycle toolkit second edition” book, Available, in amazon.fr 4. Qushem UB, Zeki AM, Abubakar A (2017) Successful business intelligence system for SME: an analytical study in Malaysia 5. Taher Y (Guidance and Educational Planning Center - Azzaitoune Street, 2014) Architecture of an election-oriented business intelligence system. J Inform Syst Technol Manag ISSN 11(3):1809–2640 6. Niaz Arifin SM, Madey GR, Vyushkov A, Raybaud B, BurKot TR, Collins FH (2017) An online analytical processing multi-dimensional data warehouse for malaria data. In: Database, Vol 2017, bax073, In progress, Published: 01 Jan. 2017 7. Ayadi MG, Bouslimi R, Akaichi J (Irstea, Campus des CEZEAUX, 63173 Aubiere, France, 2016) A framework for medical and health care databases and data warehouses conceptual modeling support 8. Boulil K, Bimonte S, Pinet F (2020) Conceptual model for spatial data cubes: a UML profile and its automatic implementation 38:113–132. Accessed Aug 2020 9. Bimonte S, Boucelma O, Machabert O, Sellami S (2016) A new spatial OLAP approach for the analysis of volunteered geographic information. Comput Environ Urban Syst 48:111–123 10. Vernier F, Miralles A, Pinet F, Gouy V, Carluer N, Molla G, Petit K (2020) EIS pesticides: an environmental information system to characterize agricultural activities and calculate agroenvironmental indicators at embedded watershed scales. Accessed November 2020 11. http://igm.univmlv.fr/~dr/XPOSE2006/DELTIL_PEREIRA/processus.html. Accessed March 2021 12. Kim JA, Choi MH, Cho I (2020) Implementation of clinical DSS architecture. In: Proceedings of international conference on future generation information technology, book series LNCS 7105:371–377. Accessed December 2020 13. Order of August 7, 2006 relating to doctoral training. Off J French Repub. August 24, 2006, Version abrogated on September 01 2016. https://www.legifrance.gouv.fr/affichTexte.do?cid Texte=JORFTEXT000000267752 14. Braida B, Peyaud JB (2021) Enquête sur la situation des doctorants. http://droit.dentree.free. fr/hopfichiers/enquetesurlasituationdesdoctorants.pdf. Accessed Feb 2021

42

A. Kane et al.

15. Maresca B, Dupuy C, Cazenave A (2021) Enquête sur les pratiques documentaires des étudiants, chercheurs et enseignants chercheurs de l’Université Pierre et Marie Curie (paris 6) et de l’Université Denis Diderot (PARIS 7), CREDOC. Accessed Feb 2021. 16. Calmand J, Epiphane D, Al (2014) Analyse critique des indicateurs d’établissements et méthodologie des enquêtes auprès des recruteurs. GTES, Relief 47, Marseille, Mai 2014 17. Paris-Est University (2011) Résultats de l’enquête sur les pratiques et besoins documentaires auprès des doctorants et chercheurs du PRES Université Paris-Est 18. Galdemar V, Gilles L, Simon MO (2012) Performance, efficacité, Efficience 19. Standard B (2014) ISO “Information and documentation - Library performance indicators”. https://www.ugc.edu.co/pages/juridica/documentos/institucionales/Norma_BS_ISO_ 11620_informacion_indicadores_rendimiento.pdf. Second edition 2008–08–15, revised by ISO 11620:2014 20. Joutei HB (2009) Guide des indicateurs de performance. UM5A 21. Lorino (2003) Méthodes et pratiques de la performance, 3 rd. Organization Publishing, Paris 22. Tavenas F (2003) Assurance qualité : Référentiel partagé d’indicateurs et de procédures d’évaluation", for Europe Latine Universitaire 23. 2012. http://tssperformance.com/les-caracteristiques-dun-bon-indicateur-de-performance/#. V-zu5IjJyM8 24. Voyer P (2021) Tableaux de bord de gestion et indicateurs de performance, 2nd ed. Presses de l’Université du QUEBEC, pp 446. Accessed January 2021 25. Fahd M, Bouchra L, Elaami S (2014) Designing a system of performance indicators in industrial safety. Int J Innov Appl Stud 7(2):571–587, Aug. ISSN 2028–9324 26. Fernandez Nodesway A (2014) Balanced Scorecard. https://www.etudier.com/dissertations/ Balanced-Scorecard/290829.html 27. Mendoza C, Delmond MH, Giraud F, Löning H (2002) Tableaux de bord et balanced scorecards. Groupe Revue Fiduciaire, Paris 28. Roger MAG, Raoul KR (2020) Cours d’initiation a la methodogie de recherche. https://www. dphu.org/uploads/attachements/books/books_216_0.pdf. Accessed August 2020 29. El Morhit M (2015) The research method. In: Conference: ISAG at: HAY Riad RABAT MAROC, Vol 3. https://www.researchgate.net/publication/272579936 30. Borras I, Boudier M, Calmand J, Epiphane D, Ménard B (2014) Analyse critique des indicateurs d’établissements et méthodologie des enquêtes auprès des recruteurs”, May 2014, Relief47 31. CAMES (2017) CAMES Doctoral Schools Evaluation Repositories. https://www.lecames. org/wp-content/uploads/2019/06/Re%CC%81fe%CC%81rentiel-Evaluation-Ecoles-Doctor ales-CAMES.pdf 32. Education at a Glance 2016. OECD Indicators. https://www.oecd-ilibrary.org/fr/education/ regards-sur-l-education-2016_eag-2016-fr 33. Education at a Glance 2017. OECD Indicators. https://www.oecd-ilibrary.org/fr/education/ regards-sur-l-education-2017_eag-2017-fr 34. Education at a Glance 2018. OECD Indicators. https://www.oecd-ilibrary.org/fr/education/ regard-sur-l-education-2018_eag-2018-fr 35. Education at a Glance 2019. OECD Indicators. https://www.oecd-ilibrary.org/education/reg ards-sur-l-education-2019_6bcf6dc9-fr. 10 Sep 2019 36. Houssin D, Geib JM (2019) Rapport d’évaluation de l’école doctorale N ° 531 Engineering and Environmental Sciences of Paris-Est University. Wave E 2015–2019, Evaluation campaign 2013–2014, AERES, Training and diplomas section 37. Berland N (2009) Mesurer et piloter la performance. e-book, www.management.free.fr 38. Kaplan RS, Norton DP (2021) The balanced scorecard: translating strategy into action. HBS Press, Boston, p 35. Accessed January 2021

Cloud Attacks and Defence Mechanism for SaaS: A Survey Akram Harun Shaikh(B) and B. B. Meshram Department of Computer Engineering, Veermata Jijabai Technological Institute (VJTI), Mumbai, India {ahshaikh_p18,bbmeshram}@ce.vjti.ac.in

Abstract. Cloud computing systems are the de-facto deployments for any user data and processing requirements. Due to a wide variety of cloud systems available today, increase in the number of services provided by these systems. These services range from software-based systems to high-end hardware-based infrastructures. The wide variety attracts a lot of attention by unwanted hackers, due to which the cloud deployments are one of the most cyber attacked entities here, we review the attacks and issues of cloud computing entities namely Software as a Service (SaaS). These attacks are quantified at both micro-level and macro-level. This paper also discusses the different solutions for these attacks, identified the gaps of each solutions and recommends methods which can be adopted to further improve the discussed solutions. Keywords: Cloud · Attacks · IaaS · PaaS · SaaS · Solutions · Deployment

1 Introduction The number of attacks is ever-increasing in the cloud. This is due to the fact that more and more businesses and customers are shifting from local server-based deployments to cloud deployments. Due to this fact, cloud deployments have become a playing field for penetration testers, white-hat hackers, black-hat hackers and other security personnel [1]. These attacks can be software-based, platform-based or infrastructure-based. Softwarebased attacks can attack SaaS, PaaS, and IaaS deployments, while platform-based and infrastructure-based attacks usually attack only the PaaS and IaaS layers, respectively. In order to summarize the list of attacks that can affect on each of these cloud layers, researchers can refer Fig. 1a as follows, To remove these attacks researchers from different domains have suggested different techniques [2]. For instance, to remove SQL injection and XSS attacks; techniques like regular expression matching, deep learning-based classification and others. In order to remove different kinds of attacks, different kinds of approaches are needed; these approaches differ in terms of computational complexity, time complexity, algorithmic complexity and other levels of complexities. It is very difficult for researchers to evaluate each and every method manually, and then apply it for their own research [3]. Thus, in order to resolve this issue in this paper SaaS attacks and their defense mechanisms have © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2022 V. E. Balas et al. (eds.), Intelligent Computing and Networking, Lecture Notes in Networks and Systems 301, https://doi.org/10.1007/978-981-16-4863-2_4

44

A. H. Shaikh and B. B. Meshram

been studied and identified. These mechanisms allow the researchers to identify the best practices in each of these domains and then select the ones which are better for their particular application. Moreover, in this paper we have also identified about different cloud architectures and their components. The following Fig. 1b. showcases these components, This text also covers the security aspects of this component. At the End, we conclude this paper with some finer observations which are made specifically to cloud security, and how to improve the performance and security level of this work with the help of more complicated computational structures.

Fig. 1 a Categery of cloud security

Cloud Attacks and Defence Mechanism for SaaS: A Survey

45

Fig. 1 b Categery of cloud security

2 Cloud Issues Attack and Defence Mechanism 2.1 SaaS Security Issues SaaS-based cloud architectures have different components like web browsers, web server, database server, etc. Each of these components has their particular security challenges in cloud. These security challenges often tend to singularize into general components. From a detailed comparative study about these components done in [1], wherein the authors have mentioned challenges and issues of cloud like data loss, issues in transparency, issues in virtualization, issues in Multi-Tenancy and Managerial Issues. Wherein the loss of control issues includes but are not limited to loss of data & data breach, regional considerations during data transmission and storage, low cost data sources & data analytics. Due to these data control related issues, components like web browsers have to be closely monitored by governing agencies, so as to make sure that no data is being compromised from them. Moreover, other components like web server are also susceptible to data breaches, wherein the attacker might not get access to the actual data, but can get access to patterns in the data, which in turn leads to privacy issues from the user’s perspective. Lack of transparency issues are generally to deal with the data provision intentions of the cloud service providers. Governing agencies ask for user specific data whenever a red-flag is triggered. In the presence of such a request, it is the duty of the provider to be transparent and share full details with the asking authority. But, due to managerial or user-based disclosure issues, the cloud providers do not share this information, instead feed incorrect or non-appropriate information. There by raising transparency issues within the cloud. In order to solve these issues, authors have mentioned techniques like encryption of data, providing access control rules, auditing data integrity using 3rd party auditors, physical and virtual isolation of the cloud deployment, software-based trust establishment techniques. Wherein trust-establishment algorithms [2] like relationship management, trust-packets, etc. have been proposed by researchers

46

A. H. Shaikh and B. B. Meshram

over the years. A second variety of trust-based solutions are hardware-based trust solutions [2], wherein researchers claim to be using hardware chips in order to establish trust among common-hardware devices, thereby reducing the chances for attack. Moreover, government agencies can setup rules in order to further extend the security of cloud. The rules setup by the government agencies apply to a very wide range of audience thereby it is recommended that Government must employ public interest rules in order to further strengthen the presence of cloud deployment in a particular geography. 2.2 Cloud Attacks and Defence Mechanism for SaaS The following Table 1 summarizes the different kinds of attacks, their solutions, and Identified gaps in each defence mechanism and other details for SaaS-based cloud systems, Table. 1 Componentswise Classification of Cloud attacks and its Defence mechanism SaaS Component

Attack

Defence mechanism

Identified gaps

Attacks on web browser

Cross site scripting(XSS)

Redefining cyber security with AI and Machine Learning [3] Protection of cross-site scripting(XSS) attacks in Cloud Environments [4]

Reduces response time [3] Not fully accurate in classification of XSS [4]

Cross site request forgery(CSRF)

Protection of cross-site Low accuracy of detection scripting(CSS) attacks in cloud [4] environments [4] Secure and Efficient data deduplication on cloud for CSRF [5]

Session management

Security against phishing attacks in MCC [6] Privacy-preserving security solution for cloud services [7]

Storage complexity is very high [6] Lower accuracy of detection [7]

SQL injection

SEPTIC: detecting SQL injection attacks and vulnerabilities inside the DBMS [8] DIAVA: framework for detection of SQL injection attacks [9]

High complexity [8] Pattern analysis is done on traffic, which reduces response time [9]

Brute force attacks

Hybrid descast algorithm for cloud security [10] Improved secure content de-duplication identification and prevention (ESCDIP) algorithm in cloud environment [11] Attribute-based convergent encryption key management for secure deduplication in cloud [12]

High complexity due to combination of multiple algorithms [10] Key management and security is another concern [12]

Attacks on web server

(continued)

Cloud Attacks and Defence Mechanism for SaaS: A Survey

47

Table. 1 (continued) SaaS Component

Attacks on application server

Attack

Defence mechanism

Denial of service

SealFS: a stackable file system for tamper-evident logging [13]

Identified gaps

Insecure direct object references

Web application vulnerability prediction(WAVP) using hybrid program analysis and machine learning [14] Secure cloud storage system(SCSS) based on one-time password and automatic blocker protocol [15]

Slow and sluggish performance for real-time cloud data [14]. Reduced accuracy for the latest-attacks, thereby inflexible for real world application [15]

Security misconfiguration

Fuzzy queries(FQ) over encrypted data in cloud [16] Cause infer: automated end-to-end performance diagnosis with hierarchical causality graph in cloud environment [17]

Limited number of configurations are being handled[16] Inflexible to handle new mis-configurations [17]

Insecure cryptographic storage

Secure cloud data deduplication with efficient re-encryption [18] Multiple security level cloud storage (MSLCS) system-based on improved proxy re-encryption [19]

High complexity and storage requirements [18] Key management and key handling needs improvement [19]

Failure to restrict URL access

Trusted Secure Accessing Protection (TCAP) Framework-based on Cloud-Channel-Device Cooperation [20] federated identity Based cloud (FIBC) for the Internet of Things [21]

Limited accuracy, can be improving using ML &AI [20] Limited to IoT, cannot apply for all cloud applications [21]

Insufficient transport layer protection

Secure disintegration protocol (SDP) for privacy preserving cloud storage [22] Packet encryption (PE) for securing real-time mobile cloud applications [23]

High complexity [22] Need lot of data for pattern analysis [23]

Unvalidated redirects and forwards

Secure virtualization environment-based on advanced memory introspection [24] Service popularity-based smart resources partitioning for fog computing-enabled industrial internet of things [25]

Space complexity is high [24] Limited to IoT application, cannot be extended to non-IOT cloud application [25]

(continued)

48

A. H. Shaikh and B. B. Meshram Table. 1 (continued)

SaaS Component

Attack

Defence mechanism

Identified gaps

Attacks on database server

Broken authentication

An efficient and provably secure anonymous user authentication and key agreement for mobile cloud computing [26] Authorized client-side deduplication using CP-ABE in cloud storage [27]

Reduced response time [26] Lack of flexibility in protocol design highly complex reduced speed [27]

Privilege escalation

SAFECHAIN: securing trigger-action programming from attack chains [28] One bit flips, one cloud flops: cross-VM row hammer attacks and privilege escalation [29]

High computational complexity for larger data [28] Limited accuracy for newer attacks [29]

Loose access permissions

Reverse engineering of database security policies [30] Facilitating secure and efficient spatial query processing on the cloud [31]

Requires exponentially large delay for large database [30] Key access patterns and storage is an issue [31]

Excessive retention of sensitive data

A new security framework for cloud data [32] Asymmetric encryption: the proposed model to increase security of sensitive data in cloud computing [33]

Key management is very complex due to asymmetric encryption [33]

Aggregation of personally identifiable information

Security and privacy aware data aggregation on cloud computing [34] Two secure privacy-preserving data aggregation schemes for IoT [35]

High complexity and exponentially increasing delay w.r.t. data size [35]

Denial of service

SealFS: a stackable file system for tamper-evident logging [36]

Data leakage

Validation of data integrity and loss recovery for Protected data in cloud computing [37] Data leakage detection and file monitoring in cloud computing [38] A unique mechanism fort detection of data leakage on transmission [39]

Complexity of file access pattern detection is high [37] Low accuracy for newer types of nodes [38] Limited accuracy [39]

Access permission bypass

Access control and data sharing for dynamic groups in the cloud environment [40] Data access control on multiple cloud system in mobile cloud computing {MCC) [41]

High computational requirements reduce the speed of deployment[40] Configuration needed for deployment is very high [41]

Attacks on file storage server

Cloud Attacks and Defence Mechanism for SaaS: A Survey

49

From the above table we can understand that different attacks have different security algorithms, and due to their mutual exclusivity, these algorithms can be applied in a hybrid manner for any cloud deployment. Attacks like denial of service (DOS) also affect all kinds of SaaS deployments, a software defined network (SDN)-based algorithm to counter the DOS attacks in SaaS is described in [41], wherein algorithms like sFlow which is a part of the OpenFlow protocol is used. This protocol allows for mapping of IP addresses, ports and MAC addresses in order to gather incoming and outgoing traffic statistics from routers and cloud servers. The SDN allows for an efficient implementation of cloud layers, thereby helping in identification of malicious packets which might be causing the DOS attacks. The proposed solution uses a monitor, an analyser, an external mitigator and an internal mitigator in order to identify mis-behaving users/IPs. There are certain security policies in place, which enable for an effective DOS attack detection. The proposed SDN solution can also be used for honey pots and other intrusion detection systems. Virtual private networks (VPNs) which are a major component of SaaS-based deployments, are gaining a lot of popularity due to the high level of anonymity provided by them. A detailed description of attacks and vulnerabilities for VPNs is described in [42]. VPNs are susceptible to all kinds of attacks which include, MIM Attack, Impersonation attack, attacks on Browser, attacks on SSL, Denial of Service (DoS), Eavesdropping, Session Hijacking attacks, Replay attack, attacks on Password discovery and Reflection Attack. A general approach to reducing these attacks is also described in [42]. Solutions to these attacks are similar to solutions for any other software component, and include encryption, trust-establishment, isolation, etc. as described previously.. Another review on the cloud security process is described in [43], wherein database, website and other software component related attacks like injection attacks, XSS attack, CSRF attack, protected APIs, Cookie poisoning, Hidden field manipulation, Backdoor and debug option, Broken authentication and session management, Broken access control, Sensitive data exposure and other SaaS issues with their probable solutions are defined. The work in [43], can be kept as a benchmark for further evaluation of these issues and solutions. Two more Authors on SaaS security worth mentioning are the work done in [44, 45]. In [44], the researchers use block chain for security SaaS services. Block chain is a preferred method for cloud security due to the fact that both block chain and cloud are distributed, can support peer-to-peer architectures, use high performance nodes, and reward users-based on their involvement in improving the system’s efficiency. From the work done in [44] the researchers have analysed that adding block chain-based algorithms and integrity of the cloud platform exponentially without adding overheads to the system. While block chain is found to be promising, its real time implementation and scalability will always be a concern looking at the Bit coin example, where the computational complexity is so high, that nowadays its not recommended to mine the bit coins. But, deep learning and machine learning-based security algorithms as mentioned in [44] have proven to be both computationally efficient and highly secure. Thus, there is a need to integrate machine learning algorithms with block chain to improve security of the entire cloud SaaS system.

50

A. H. Shaikh and B. B. Meshram

3 Conclusion and Future Work From this paper we can analyse that the cloud systems and its components face a wide variety of security issues. These issues can be resolved by using strong authentication techniques (like identity-based authentication, attribute-based authentication, etc.), strong cryptography systems (hashing, BLS, etc.), strong physical systems, strong attack detection & removal algorithms, and strong techniques for other parameters. Strong in this case means algorithms & techniques that satisfy the cloud’s QoS requirements along with the security requirements. This paper has listed the various component of a Cloud system for SaaS like web Application Server, web server, database server etc., that comprises of SaaS. This paper is also categorized the different attacks on each component of SaaS. This paper has also put focus on the various defence Mechanism of each attacks and identified various limitations and gaps in each solutions of respective attacks to the cloud application.

References 1. Liu Y, Sun YL, Ryoo J, Rizvi S, Vasilakos AV (2015) A survey of security and privacy challenges in cloud computing: solutions and future directions. In: Proceeding of 2015. The Korean institute of information scientists and engineers. ISSN: 1976–4677 eISSN: 2093–8020 2. Chandni M, Sowmiya NP, Mohana S, Sandhya MK (2017) Establishing trust despite attacks in cloud computing: a survey. In: Proceeding of IEEE Wisp NET 978–1–5090–4442–9 3. Dhondse A, Singh S (2019) Redefining cyber security with AI and machine learning. Asian J Converg Technol 5(2) 4. Madhusudhan R (2018) Mitigation of cross-site scripting attacks in mobile cloud environments. In: Thampi S, Madria S, Wang G, Rawat D, Alcaraz Calero J (eds) Security in computing and communications. 5. Jiang S, Jiang T, Wang L (2017) Secure and efficient cloud data deduplication with ownership management. IEEE Trans Serv Comput doi: https://doi.org/10.1109/TSC.2017.2771280 6. Munivel E, Kannammal A (2019) New authentication scheme to secure against the phishing attack in the mobile cloud computing. In: Proceeding of Hindawi security and communication networks. p 11 7. Malina L, Hajny J, Dzurenda P, Zeman V (2015) Privacy-preserving security solution for cloud services. Proc J Appl Res Technol 13(1):20–31 8. Medeiros I, Beatriz M, Neves N, Correia M (2019) SEPTIC: detecting injection attacks and vulnerabilities inside the DBMS. IEEE Trans Reliab 68(3):1168–1188. https://doi.org/10. 1109/TR.2019.2900007 9. Gu H et al (2020) DIAVA: a traffic-based framework for detection of sql injection attacks and vulnerability analysis of leaked data. IEEE Trans Reliab 69(1):188–202. https://doi.org/10. 1109/TR.2019.2925415 10. Sengupta N, Chinnasamy R (2015) Contriving hybrid DESCAST algorithm for cloud security. Elsevier, pp 47–56 11. Periasamy JK, Latha B (2020) An enhanced secure content de-duplication identification and prevention (ESCDIP) algorithm in cloud environment. Neural Comput Appl 32:485–494 https://doi.org/10.1007/s00521-019-04060-9 12. Ilambarasan E, Nickolas S, Mary Saira Bhanu S (2020) Attribute-based convergent encryption key management for secure deduplication in cloud. In: Pati B, Panigrahi C, Buyya R, Li KC (eds) Advanced computing and intelligent engineering, vol 1082. Springer, Singapore

Cloud Attacks and Defence Mechanism for SaaS: A Survey

51

13. Soriano-Salvador E, Guardiola-Muzquiz G SealFS: a stackable file system for tamper-evident logging 14. Shar LK, Briand LC, Tan HBK (2015) Web application vulnerability prediction using hybrid program analysis and machine learning. IEEE Trans Depend Sec Comput 12(6):688–707 doi: https://doi.org/10.1109/TDSC.2014.2373377 15. El-Booz SA, Attiya G, El-Fishawy N (2015) A secure cloud storage system combining timebased one time password and automatic blocker protocol. In: 2015 11th international computer engineering conference (ICENCO). Cairo, 188–194. https://doi.org/10.1109/ICENCO.2015. 7416346 16. El-Booz SA, Attiya G, El-Fishawy N (2015) A secure cloud storage system combining timebased one time password and automatic blocker protocol. In: 2015 11th international computer engineering conference (ICENCO), Cairo, pp 188-194. doi: https://doi.org/10.1109/ICENCO. 2015.7416346 17. Chen P, Qi Y, Hou D (2019) CauseInfer: automated end-to-end performance diagnosis with hierarchical causality graph in cloud environment. IEEE Trans Serv Comput 12(2):214–230. doi: https://doi.org/10.1109/TSC.2016.2607739 18. Yuan H, Chen X, Li J, Jiang T, Wang J, Deng R (2019) Secure cloud data deduplication with efficient re-encryption. IEEE Trans Serv Comput doi: https://doi.org/10.1109/TSC.2019.294 8007 19. Shen J, Deng X, Xu Z (2019) Multi-security-level cloud storage system-based on improved proxy re-encryption. J Wireless Commun Netw 2019:277. https://doi.org/10.1186/s13638019-1614-y 20. Cheng Y, Du Y, Peng J, Fu J, Liu B (2019) Trusted secure accessing protection frameworkbased on cloud-channel-device cooperation. In: Yun X, et al (eds) Cyber Security. CNCERT 2018. Communications in computer and information science, vol 970. Springer, Singapore 21. Fremantle P, Aziz B (2018) Cloud-based federated identity for the Internet of Things. Ann Telecommun 73:415–427. https://doi.org/10.1007/s12243-018-0641-8 22. Rawal BS, Vijayakumar V, Manogaran G et al (2018) Secure disintegration protocol for privacy preserving cloud storage. Wireless Pers Commun 103:1161–1177. https://doi.org/10. 1007/s11277-018-5284-6 23. Ajay DM (2019) Umamaheswari E packet encryption for securing real-time mobile cloud applications. Mobile NetwAppl 24:1249–1254. https://doi.org/10.1007/s11036-019-01263-1 24. Zhang S, Meng X, Wang L, Xu L, Han X (2018) Secure virtualization environment-based on advanced memory introspection. In: Proceeding of Hindawi security and communication networks, vol 2018, p 16 https://doi.org/10.1155/2018/9410278 25. Li G, Wu J, Li J, Wang K, Ye T (2018) Service popularity-based smart resources partitioning for fog computing-enabled industrial internet of things. IEEE Trans Indus Inform 14(10):4702–4711. https://doi.org/10.1109/TII.2018.2845844 26. Hu Z, Chen H, Shen W (2019) An efficient and provably secure anonymous user authentication and key agreement for mobile cloud computing. In Proceeding of Hindawi wireless communications and mobile computing, vol 2019, p 12. https://doi.org/10.1155/2019/452 0685 27. Taek-Young Y, Nam-Su J, Rhee KH, Sang US (2019) Authorized client-side deduplication using CP-ABE in cloud storage. In: Proceeding Hindawi wireless communications and mobile computing vol 2019, p 11. https://doi.org/10.1155/2019/7840917 28. Hsu K, Chiang Y, Hsiao H (2019) SafeChain: securing trigger-action programming from attack chains. IEEE Trans Inform Forensics Secur 14(10):2607–2622. https://doi.org/10.1109/TIFS. 2019.2899758 29. Xiao Y, Zhang X, Zhang Y, Teodorescu R (2016) One bit flips, one cloud flops: cross-VM row hammer attacks and privilege escalation. In: Proceeding of the 25th USENIX security symposium. Austin, TX, ISBN 978–1–931971–32–4

52

A. H. Shaikh and B. B. Meshram

30. Martínez S, Cosentino V, Cabot J, Cuppens F (2013) Reverse engineering of database security policies. In: H. Decker, Lhotská L, Link S, Basl J, Tjoa AM (eds) Database and expert systems applications. DEXA 2013. Lecture Notes in Computer Science, vol 8056. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-40173-2_37 31. Talha AM, Kamel I, Al Aghbari Z (2019) Facilitating secure and efficient spatial query processing on the cloud. IEEE Trans Cloud Comput 7(4):988–1001. doi: https://doi.org/10. 1109/TCC.2017.2724509. 32. Mall S, Saroj SK (2018) A new security framework for cloud data. In: Proceeding of 8th international conference on advances computing and communication (ICACC-2018) 33. Hyseni D, Luma A, Selimi B, Cico B (2018) The proposed model to increase security of sensitive data in cloud computing. In Proceeding of the (IJACSA) international journal of advanced computer science and applications, vol. 9 34. Silva LV, Barbosa P, Marinho R et al (2018) Security and privacy aware data aggregation on cloud computing. J Internet Serv Appl 9:6. https://doi.org/10.1186/s13174-018-0078-3 35. Pu Y, Luo J, Hu C, Yu J, Zhao R, Huang H, Xiang T (2019) Two secure privacy-preserving data aggregation schemes for IoT. In: Proceding of Hindawi wireless communications and mobile computing, vol 2019, p 3985232 https://doi.org/10.1155/2019/3985232 36. Soriano Salvador E, Guardiola-Muzquiz G (2021) SealFS: A Stackable File System for Tamper-evident Logging. ETSIT, Rey Juan Carlos University, Madrid, Spain 37. Rejin PR, Paul RD (2019) Verification of data integrity and cooperative loss recovery for secure data storage in cloud computing. Cogent Eng 6(1):1654694 38. Kirdat N, Mokal N, Mokal J, Parkar A, Shahabade RV et al (2018) Data leakage detection and file monitoring in cloud computing. Int J Adv Res Ideas Innov Technol 4(2018):859–866 39. Huang X, Lu Y, Li D, Ma M (2018) A novel mechanism for fast detection of transformed data leakage. IEEE Access 6:35926–35936. https://doi.org/10.1109/ACCESS.2018.2851228 40. Xu S, Yang G, Mu Y, Deng RH (2018) Secure fine-grained access control and data sharing for dynamic groups in the cloud. IEEE Trans Inform Forensics Secur 13(8):2101–2113. https:// doi.org/10.1109/TIFS.2018.2810065 41. Roy S, Das AK, Chatterjee S, Kumar N, Chattopadhyay S, Rodrigues JJPC (2019) Provably secure fine-grained data access control over multiple cloud servers in mobile cloud computingbased healthcare applications. IEEE Trans Indus Inform 15(1):457–468. https://doi.org/10. 1109/TII.2018.2824815 42. Punto Gutierrez J, Lee K (2018) SDN-based DoS attack detection and mitigation system for cloud environment. In: Proceeding of international journal of computer systems (ISSN: 2394–1065), vol 05. http://www.ijcsonline.com/ 43. Shyamala R, Prabakaran D (2018) A survey on security issues and solutions in virtual private network. Int J Pure Appl Math 119(15):3115–3122 44. Ravi Kumar P, Herbert Raj P, Jelciana P (2017) Exploring security issues and solutions in cloud computing services—a survey. In: Cybernetics and information technologies, vol. 4, Sofia, Print ISSN: 1311–9702; Online ISSN: 1314–4081. doi: https://doi.org/10.1515/cait2017-0039 45. Dong Z, Luo F, Gaoqi L (2018). Blockchain: a secure, decentralized, trusted cyber infrastructureSolution for future energy systems. J Mod Power Syst Clean Energy 6(5):958 967. https://doi.org/10.1007/s40565-018-0418-0

“Palisade”—A Student Friendly Social Media Website Nithin Katla(B) , M. Goutham Kumar, Rohithraj Pidugu, and S. Shitharth Department of Computer Science and Engineering, Vardhaman College of Engineering, Hyderabad, Telangana, India

Abstract. In this application, we have developed a website where users can sign up and login and be able to share their thoughts and events at their colleges or companies and experience at interviews or etc. and also like, comment on respective posts, and also user can able to follow or unfollow a user to get updates from different users. We feel very difficult to find blood donors at difficult times, this website provides a list of blood donors with their details such that it will be very helpful in finding blood donors. It is very helpful to people who require blood urgently. There is a good number of donors but we find it difficult to find them. This website solves this issue. We can create chat rooms or join chat rooms and discuss things with people around the world in this application. Keywords: Reactjs-Frontend · RestfulApis · Posts · Blooddonors · Room-chat · Node Js-Backend · Mongodb · Expressjs-Routing

1 Introduction Now-a-days Social media is a tool that became more popular and even became a part of people’s life. Due to its user-friendly features the people are even more attracted to use them. If we need to say a single sentence to describe Social media then we can say that the entire world was in our hand at just a single click away. One of the foremost dominant users of social media are the youth. There were many other negative and disadvantages of social media usage but it was very less compared to the positive aspects. When we consider the positive things in this field, we can observe many advantages. The first and the most prior one was education. By social media platforms we can learn many things. All the aspects that we need were just a click away. Even we can find live lectures in social media examples where a university professor was giving a lecture by sitting in his/her office and the students or the people who are interested in listening can easily listen by just sitting at their individual residences. Even when we consider the current situation of decrease in newspaper reading, more people left the newspaper reading due to the high usage of social media platforms as it gives the news that was happening in the world instantly which cannot make the people wait for the next day to know what actually happened at that situation or place. Social media made people not to think that distance isn’t any problem because due to social media it made people connect with each other that they were very close with. It made © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2022 V. E. Balas et al. (eds.), Intelligent Computing and Networking, Lecture Notes in Networks and Systems 301, https://doi.org/10.1007/978-981-16-4863-2_5

54

N. Katla et al.

people to always connect with their closer ones, relatives and friends for ever even if they were not near to them. When compared to the employees or elders, University and College students are the most users that are using Social media platforms more effectively. Because social media is the only platform which gives entertainment, facts, study and many other information they need in an efficient manner and at a single place. They can observe their role model how he/she was leading their lifestyle in their official social media handles, so these types of many similar things are attracting the students and the youth to these type platforms. Social media can also be used as a biggest asset for learning something from others and getting motivated. These types of things can help people to socialize in society and help in choosing the right paths in their life by seeing their role models. Social media can even make new bonding with other people and they can connect with their stream experts and start engaging with them to learn something new or to clarify their doubts. Now-a-days social media are also used as a marketing platform for many products which helps the consumers to reach their product directly to the customers. This way these social media platforms are even helping some of the marketing agents to get jobs.

2 Related Study In pandemic situations, students are facing a lot of difficulties in discussing with their friends regarding their assignments, tasks, projects, and many more. We have many platforms for attending classes online and many applications for chatting and video calling with friends and we also have many platforms like Instagram, Facebook, LinkedIn for knowing the updates regarding many things including entertainment, news, sports, and so on. But we don’t have a particular dedicated platform for discussing with our friends regarding our assignments and our tasks are given in our academics and posts for different college events, experiences and placements events, etc. We have specially dedicated platforms for jobs, careers, entertainment, chatting, online classes but we don’t have one for students’ purposes like knowing about events, placements, exams, hackathons happening inside or outside of their institute [1]. In Fig. 1, it describes the count of people using social-media concerning years. we can see most of the people use Facebook because teens, adults, old age people get attracted and spend most of their time using Facebook. In the list next comes, YouTube and then, WhatsApp, Instagram, etc. we can observe the day-by-day users of social media getting increased from the above comparison. The Fig. 2 represents the different age groups on the different social media platforms. By looking above, the bar chart we can say older teens are more active on social media compared to young ones. Now, the question that is in everyone’s mind, does social media harm students? If Yes, then how to make a social media platform completely advantageous to Students. 2.1 Security in Social Media Some applications are specific for sharing posts related to fun or jobs etc. like Instagram, Facebook, LinkedIn for knowing the updates regarding many things including

“Palisade”—A Student Friendly Social Media Website

55

Fig. 1 Count of people using social-media

Fig. 2 comparison of using of social-media concerning age

entertainment, news, sports [2, 3]. One can only share their posts related to the fun but, not related to college events, experiences at interviews, etc. These drawbacks made us think and come up with a new idea by giving them a separate platform for having all of these thing. The Fig. 3 represents the most security breaches in current social media platforms [4–6]. Whenever a social media site is developed people are highly concerned over its security [7, 8]. Why because this is one prime platform where it deals mostly with people databases. We have a lot of security breaches in a centralized server. In past there are security compromise incidents that happened with a supervisory control systems as discussed in [9–11]. The major reason to review the security concerns are just because if this is launched in real time exactly the security protocol used for the supervisory system can be implemented with the centralized server too [12, 13]. A small analysis graph shown in the Fig. 3 clarify how bad the security problems are with the current social sites and that should be uprooted in this system once its launched in the future[14].

56

N. Katla et al.

Fig. 3 Most security breaches in current social media platforms

2.2 Disadvantages of Existing System • There are no specific college related websites where students can post their ideas, experiences, college events etc. • There are many fake profiles of blood donors on many websites. It’s hard-to-find blood donors who are genuine to donate blood [15, 16]. • There are no good enough platforms where students can create rooms or join and discuss things related to study or activities. 2.3 Proposed Model Here we came up with a new idea by designing a new aged fully secured online web platform to log in securely and start chatting, posting, likes, shares, discussing with friends, and many more. We can even post about the events happening in the organization and also can communicate about placements in the college with their friends and alumni which helps the students make the right decision about their future. We introduced a new tool that allows users to discuss their needs and their tasks by entering into a room chat where the room will be created once the user creates and other people can directly join the room and start discussing. On our platform users can share their experiences like what he/she did in interviews or events etc. and also college events, useful stuff on our website. As this is a student-friendly website, many students can search and find different events, experiences, thoughts on our websites which will be helpful for a student in his/her career. Here we also proposed a new feature by adding the user’s willingness for donating their blood or not. While signing up to the platform using his/her details like name, mail, college, and password, he/she able to login to his profile and then using the edit option he/she can able to enter and edit details like blood group, interested to donate blood or not, interests of the user, experience in his/her career, if the user chooses to donate his/her blood then it will automatically update the user’s details in a separate window and the other users who are

“Palisade”—A Student Friendly Social Media Website

57

in need can contact the person directly without any miscommunication. In this way, our platform can also help behave as an intermediary for the people who require blood. On the blood donor page in our application, all users can find the name, blood group, email, mobile number of the blood donors. It is very difficult to find a blood group of a patient in urgent situations and our platforms help in finding blood donors in one place and this is very useful to society. We can find users who are interested in donating blood. When users logged on our website and can edit their profile such that if a user ticks donate blood then, his/her info gets visible on the blood-donors page as below. We feel difficult to find blood donors at hard times. This helps in saving a life by helping to find blood donors. We also provide small features like liking a post, commenting on a post or editing the post if he/she has privileges. If every student uses this application then we can build a community of students, this will be helpful in finding different opportunities in posts and chatting about topics in Room chat and blood donors (users who are interested in donating blood). 2.4 Objectives Creating a platform for social media helps in sharing posts, college events, placements experiences and ideas. Maintain data privacy and build a good user interface for our platform. To create secured Login for users and profile of users on our platform. To create a platform where users can find blood donors easily, which helps in saving a life. To create a secured group chat application using socket.io.

3 The Architecture of the Project This architecture shows in Fig. 4, when a user visits our website, then he/she must sign up and login to our website, and state management is done through reactjs and restful APIs are passed from frontend to backend using Axios or fetch calls. Then, backend access database using mongoose and creates a new profile for the user while signup and validates credentials while login. This is how the full stack app works as shown above architectural diagram. Using MERN [17] in the project was a foremost advantage for developers as all the code will be written in Javascript language which was very easy to understand if there were errors. Javascript is a programming language where it can be used for both client and server side [18, 19] which makes developers more convenient that they won’t look for other programming tools for both server and client separately because using one language rather than two makes developers very helpful [20, 21]. For developing the project with multiple programming languages we should have knowledge about how to interface them together which makes more complexity. But if we use Javascript, we only need to be expertised/skilled in Javascript and Json. So at the end using Javascript was more convenient and easier for developing good web applications other than using multiple languages. Our website is done using technologies like reactjs, MongoDB, expressjs, NodeJS. On our website, we need to fetch different posts which users uploaded using restful APIs. Reactjs is a library which used to build pages using components that are very flexible

58

N. Katla et al.

Fig. 4 Architecture Diagram Of MERN

to use and we used mongodb for storing all details of users and posts and express and node js as the server for running our application. 3.1 Methods of Implementation • The implementation of our project is as followed: • Initially to build the project we gathered all the user requirements. • Problem analysis is done to create what type of project that should be built that is unique and different from the other similar platforms that already exist. • Designing of the project is done with user requirements. • Later software and hardware requirements analysis is done to know what we require. Gathering of software and hardware requirements are done. • Developing code to generate a rapid prototype. • Testing of the prototype is done and the bugs are detected. • Debugging of the errors and reconstruction of the project. • Testing repeated for the project to get rid of errors. • With the completion testing phase the project is deployed in the user’s environment [20, 22, 23]. 3.2 Test Cases and Scenarios We will test our application using jest and enzyme for frontend and postman for testing RestfulApis [24–26]. Test cases that were involved in our project are like: • • • •

Users successfully getting signed up on the website. Users successfully logged in on our website with valid credentials. Checking Posts info api from server using postman. Checking editing user’s profile gets successful when the update profile button is clicked.

“Palisade”—A Student Friendly Social Media Website

59

• Checking Like, comment on posts, Blood donors page on the website working properly. These are the test cases which we will check using playwright (e2e) testing and other tools etc. and for checking components of our frontend (reactjs) using jest and enzyme. 3.3 How to Run MERN Application and Use It: These are the steps done to run the MERN stack app: 1. Run both client and server. 2. Open local host with the specified port to access the frontend. Users can sign up to the website and that data gets stored in mongodb. 3. Using users credentials after signup can login to the website and can view other users posts and like them, comment them if the user shows interest in that post [27, 28]. 4. User can follow or unfollow users or organizations, such that, if a user follows a person or the organization, he gets all updates of followed users on a separate following page. This is more flexible and easier to access the following posts. 5. User can create posts by going to create posts on our website. after successfully creating a post, every other user can see his/her post. 6. User can edit his/her profile by mentioning activities, experiences, goals, blood group, and willing to donate blood or not. If the user shows willing to donate blood, then his/her name gets visible on the blood donor’s page on our website. 7. Users can find blood-donors by visiting the blood-donors page on the website. It is a very helpful and social cause that helps in saving one’s life [29]. 8. Users can create or join group chats on the website and start a conversation about the things which they feel they need to discuss (Figs. 5 and 6).

4 Results and Discussion Every user can sign up and Login to our website and be able to access different posts from different users and can be able to follow a particular user or page for more updates and users can able to create a Chat Room or enter into existing Chat Room [30, 31] such that, different users can chat about that particular Chat Room info. Blood donors play a crucial role in saving a life so, it’s difficult to find a blood group in hard times. so, we added a page where users of our application, who are interested in donating blood are visible and if he/she needs blood they can immediately find blood donors according to location, and it’s very helpful in saving a life. In general, we have used the reactjs tool for developing our website which makes our source code into components. Having huge code files also may lead to confusion for both users who access the code and the developers. So having them as small components will help us to a better understanding of the platform and also for integrating the other components/features directly without affecting previous. It even helps in distributing the workload/workspace with others by just distributing the work among others.

60

N. Katla et al.

Fig. 5 Backend Server console logs using expressjs and Nodejs

Fig. 6 Frontend Server console logs using Reactjs

5 Conclusion and Future Scope The main idea of the project is to connect students and other people with their colleges or with their friends directly by contacting them in a secured platform and help them in discussing various things that they do together or individually, also users can know different job experiences, events, on Campus drives events on this platform. And also, this platform contains a feature that can be used by the users for donating blood or if they need blood. This can be done by enabling the user to donate blood and on the website it creates a separate window with the full details of the donor provided during their signup by users and the people in need can contact them directly without any third person and any delay.

“Palisade”—A Student Friendly Social Media Website

61

We can introduce ROOM CHATS where the users can communicate with each other and when they leave the group the chat that they have communicated will be automatically erased and lost, this will help in increasing our security and give the users their privacy. This platform will be unique to the platform that was already present and can be user friendly, users can easily adopt this platform. As our project PALISADE was designed using reactjs components and backend using Nodejs and expressjs and we are using mongodb for database, it will be very helpful and easy to update any new feature to the source as a component and integrate to the main site directly. Like this is a student friendly applications such that, many student are willing to know about different events conducted by different colleges or companies and off campus drives etc. But, because of lack of information they are not able to attend such events. so, our website will be platform where every individual can post their posts which are useful and helpful to others in finding such great opportunities. We can develop a mobile application for the same in future with more advanced features and tools and also can add more features like blood donation.

References 1. Chou DC, Chou AY (2000) The E-commerce revolution, a guide to the internet revolution in banking. Inform Syst Manag 51–55 2. Shitharth S, Shaik M, Ameerjohn S, Kannan S (2019) Integrated probability relevancy classification (IPRC) for IDS in SCADA. In: Design framework for wireless network, lecture notes in network and systems vol 82. Springer pp 41–64 3. Devi BT, Shitharth S An appraisal over intrusion detection systems in cloud computing security attacks. In: Proceeding of the 2nd international conference on innovative mechanisms for industry applications, ICIMIA–2020. p 122 4. Aggarwal S (2018) Web development using reactJS 5. Purnomosidi B (2013) Penbangan sistem informasi penegelolaan inventaris barang divisi pustekin berbasis web. Bandung, Politeknik Telkom 6. Sidik B (2011) JavaScript. Bandung, Informatika 7. Selvarajan S, Shaik M, Ameerjohn S, Kannan S (2019) Mining of intrusion attack in SCADA network using clustering and genetically seeded flora based optimal classification algorithm. Inform Sec 14(1):1–11 8. Blasio G D (2008) Urban–rural differences in internet usage, e-commerce, and e-banking: evidence from Italy. Growth Chang 39(2):341–367 9. Wikipedia.org React (JavaScript Library). [Online]. https://en.wikipedia.org/wiki/React_(Jav aScript_library) 10. Shitharth D, Winston P (2015) An appraisal on security challenges and countermeasures in smart grid. Int J Appl Eng Res 10(20):16591–16597 11. Mitra A (2013) e-commerce in India- a review. Int J Market Finan Serv Manag Res 2(2):126– 132 12. Express (2016) Js in Action, Manning Publications 13. Stefanov S (2016) React: up and running: building web applications 14. Sangeetha K, Venkatesan S, Shitharth S (2020) A Novel method to detect adversaries using MSOM algorithm’s longitudinal conjecture model in SCADA network. Solid State Technol 63(2):6594–6603 15. K.Sangeetha, S.Venkatesan, S.Shitharth, ‘Security Appraisal conducted on real time SCADA dataset using cyber analytic tools’, Solid State Technology, Vol.63 No. Issue 1, 2020, pp. 1479– 1491.

62

N. Katla et al.

16. Shitharth, Winston DP (2016) A novel IDS technique to detect DDoS and sniffers in smart grid. In: Proceedings of 2016 IEEE WCTFTR World conference on futuristic trends in research and innovation for social welfare 17. Tran TX (2019) MongoDB tutorial, a basic guide with step-by-step instructions for the complete beginner 18. Bojinov V, Herron D, Resende D (2018) Node.js complete reference guide. Packt Publishing. ISBN: 9781789952117 19. MongoDB.com MongoDB official. [Online]. https://www.mongodb.com/ 20. MongoDB.com MEAN and MERN stacks’. [Online]. https://www.mongodb.com/blog/post/ the-modern-applicationstackpart-1-introducing-the-mean-stack 21. Jennings RB, Nahum EM, Olshefski DP, Saha D, Shae Z, Waters CJ (2006) A study of internet instant messaging and chat protocols. http://ieeexplore.ieee.org/document/1668399/ 22. NodeJS.org NodeJs official. [Online]. http://nodejs.org 23. ReactJS.org ReactJS official’. [Online]. http://www.ReactJs.org 24. Teixeira P (2012) Hands-on Node.js. Wrox 25. Sebesta RW (2015) Programming the World WideWeb, 8th Edn. Pearson Education 26. Selvarajan S, Shaik M, Ameerjohn S, Kannan S (2019) Integrated probability relevancy classification (IPRC) for IDS in SCADA. In: Design framework for wireless network, lecture notes in network and systems vol 82. Springer, pp 41–64 27. Untangled, by Ludovico Fischer,’React for Real, Front-End Code’, 2017 28. Ilya gelman and boris Dinkevich,’The Complete Redux Book (2nd edition) 29. Kumar S, Sebastian Albina C, Shitharth S, Manikandan T (2014) Modified TSR protocol to support trust in MANET using fuzzy. Int J Innov Res Sci Eng Technol 3(Special issue 3):March 2014, 2551–2555 30. Shitharth D, Winston PD (2017) An enhanced optimization algorithm for intrusion detection in SCADA network. J Comput Sec Elsevier 70:16–26 31. Cadenhead T (2015) Socket.IO Cookbook. Packt Publishing ISBN: 9781785880865

Comparative Review of Content Based Image Retrieval Using Deep Learning Juhi Janjua(B) and Archana Patankar Department of Computer Engineering, Thadomal Shahani Engineering College, Mumbai, Maharashtra, India [email protected]

Abstract. Review of literature is important for gaining knowledge on specific area before doing any research. In today’s world wide web, pictures are increasing making it difficult to retrieve a relevant image. CBIR is used to search any image from the corpus of images. It excerpts features from the input image and retrieve the images from the dataset which have similar features. Plentiful techniques have been developed for CBIR, some of which have been illustrated in this paper. These results can be improved by finding significant hidden data from images. Deep learning extracts these hidden features and classify the image. Lot of experiments have been conducted on CBIR using deep learning. Comparative analysis of these experiments was performed to outline the superiority of CBIR using deep learning over traditional CBIR methods. An experiment has been conducted to evaluate the efficacy of CNN over traditionally used machine learning algorithms like FNN and decision tree classifier. Keywords: Convolution neural network · Similarity measurement · Deep learning · Features extraction · Content-based image retrieval

1 Introduction “Picture speaks louder than words”, people often prefer pictures over text. Also, the cost of memory has been decreased, making storage of images in a memory cost-effective. Improvements and progress in information technology led to increase in multimedia database. Copious images make difficult to analyze or extract some data from images. Traditionally, images were indexed manually but growing dataset makes it difficult to index the image manually [1]. An alternative approach, content-based image retrieval (CBIR) has been developed. There are various applications of CBIR in different areas such as: medical diagnosis, geographical, military, crime prevention, etc. [2]. There is a scope to improve the efficiency of CBIR in the mentioned field, resulting in the research needed in the field. CBIR is used to search any image from the corpus of images. It analyses data based on the visuals content like shape, color, texture, etc. of an image rather than the text or metadata representing an image [1]. As we can see in Fig. 1, there are two stages in image retrieval process. At first, image features are gathered and kept in the database. These © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2022 V. E. Balas et al. (eds.), Intelligent Computing and Networking, Lecture Notes in Networks and Systems 301, https://doi.org/10.1007/978-981-16-4863-2_6

64

J. Janjua and A. Patankar

features can be in the form of color, texture, shape, object etc. This feature extraction will be done offline. When the query image will be given, to extract the similar images from the dataset, features will be extracted from it. Some similarity measuring techniques like Euclidean distance, Cosine similarity, etc. will be applied on dataset and query features, to retrieve the similar images [3].

Fig. 1 Process of content based image retrieval

CBIR results can be improved by finding significant hidden data from images. Machine learning algorithms helps to find this information making system intelligent using training datasets. Convolution neural network (CNN) is based on the deep learning approach. It is widely used in image recognition [6]. CNN model is a combination of feature extraction and the classification [4]. CNN models are trained and tested; each input image will be passed in a sequence of convolution layers. Convolution layers consist of three layer which are pooling, fully connected layers and at the end it applies softmax function. Softmax function is used to classify an object with probability value in range of 0 and 1. Higher the probability, better is the chance of classification [1]. Improved network structures of CNNs result in cost effective in terms of memory and computation complexity. Also, it leads to better performance for applications. Further this paper is divided in the sections as follows, Sect. 2 describes the research background, followed by comparison of existing methods in Sect. 3. Section 4 describes the experimentation followed by results and evaluation in Sects. 5 and 6 mentions the inferences in the end.

2 Research Background In this technology era, images are growing rapidly, making precise image retrieval, a challenging task. In image retrieval, semantic gap occurs when low level visual features describe high level concepts, which is a major concern [1–4]. Class based prediction have been proposed and implemented to overcome the semantic gap [3]. Many other techniques include vector of locally aggregated descriptors [4], bag of visual words

Comparative Review of Content Based Image Retrieval Using Deep Learning

65

model [1, 4], fisher vector descriptors [4], scale invariant feature transform [1, 4], are proposed and evaluated to represent the features of images effectively. Semantic gap mostly occurs due to loss of information while representing image by its local features like color, shape, etc. [3]. This gap can be fulfilled by learning features directly from the images [1–3]. Precision of image retrieval can be increased by finding the hidden information of images. Convolution neural network (CNN), deep learning approach can be used to find the hidden features from the images [1–3]. CNN based models are highly effective in terms of image classification, detection of objects and other problems related to computer vision [1]. Sadeghi-Tehran et al. [2, 10], used a pretrained CNN (residual network (ResNet)). It is used on predefined imageNet dataset. CNN models along with VLAD features have been proved a good fusion to extract the semantic details from the images [3]. CNNs can give accurate classification results if enough data sets along with their labels provided. Overfeat model, pretrained network is used along with CNN for limited datasets, achieving good results in terms of accuracy [4, 12]. Deep learning includes the approach of machine learning algorithms, which by adopting deep architectures used to find out the high-level abstractions in data [1]. Deep Believe Network (DBN) which is a deep learning framework is employed to extract features efficiently for image retrieval. DBN can classify the data with divergence i.e., noise, displacement, smoothness, etc. DBN is reliable as it generates huge data set for learning features [1, 13, 22]. One of the models named Relevance Feedback uses user feedback to overcome the semantic gap. After retrieval of images, feedback is taken from the user whether the retrieved result is useful or not. Based on this, system will get the information regarding semantically similar and dissimilar images. Retrieval results are restored based on the user feedback. This model will stop after delivering the optimal results to the user [2, 3, 14, 15]. CNN can learn image representation and binary codes simultaneously if data is labeled [4]. Lin et al. [16] method generates the binary codes, which are then used for searching by calculating hamming distance. For searching the images in a dataset, hashing based search takes lesser computational time than linear search. Texture classification is one of the challenging tasks in image retrieval. Principal Component Analysis, PCANet has set a modest standard for the texture classification and object recognition [17]. PCANet has three basic modules, binary hashing, cascaded PCA and blockwise histograms. PCANet is used to learn the filters from the data. Binary hashing is used for indexing and histograms are used for pooling. Tzelepi and Tefas [8] proposed three different retraining models i.e., retraining with relevance info, fully unsupervised retraining and Relevance feedback based retraining. These models have outperformed other CNN based models.

3 Comparison of Existing Methods There are many issues which is been faced during retrieval of images based on its content. Some of the issues are semantic gap, geometrical transformations, different illumination condition, texture representation as seen in Fig. 2.

66

J. Janjua and A. Patankar

Fig. 2 Different issues in CBIR

Many papers have been reviewed on content-based image retrieval (traditional methods) and content-based image retrieval using deep learning. Summary has been given of the traditional CBIR methods in the Table 1. CBIR for biomedical images using hybrid approach of relevance feedback by taking users feedback have been proposed [14]. In this model, it compares query image features with features index of image which are stored in dataset. It uses graph ranking techniques to find similar images. These images will be judged by users whether they are relevant or not to the user query to tackle semantic gap. This model is most beneficial for repeated users and for small number of feedbacks. This system was tested on 100 images dataset. It gives accuracy of 60–75% using Navigation Pattern based Relevance Feedback (NPRF). We face many issues like multiple illuminates while classifying an image. A descriptor has been proposed which combines the local binary pattern (LBP) and local colour contrast (LCC) to handle multiple illuminates [18]. LBP is used to extract the texture features and it is robust to variations in illumination changes of images. Whereas LCC is used to retain the colour information and discard that part which is affected by illumination changes. Combining LBP & LCC gives 77.1% accuracy of image classification. Geometrical image transformation is yet another issue which is been faced during image retrieval. Srivastava [19] examined two techniques i.e., Speeded Up Robust Feature (SURF) and Scale Invariant Feature Transform (SIFT) for such image transformations [19]. It has been observed that for all the cases SIFT gives better results than SURF. Although both these techniques give better retrieval only on affine transformations on images. For texture representation, local descriptor named—Local Derivative Radial Pattern was proposed [20]. This proposed method was compared with methods which are predefined such as local vector pattern, local binary pattern, local ternary pattern, local tetra pattern and local derivative pattern. This method shows better accuracy by at least 3.82 and 5.17% when it tested on Brodatz and VisTex datasets, respectively. Though it outperforms many prior methods, its accuracy is less than 80% when number of images retrieved are 30. Semantic gap in CBIR is a distinction between image low-level features and humanbeing high-level perception. Basically, it is the gap between the request of image retrieval and its actual features. Color feature is considered an important feature for retrieval of

Comparative Review of Content Based Image Retrieval Using Deep Learning

67

Table 1 Comparison of traditional CBIR methods Issue

Paper

Methods

Results

Semantic gap

CBIR for biomedical image archives using efficient relevance feedback & user navigation patterns, International Journal of Computer Applications, 2017 [14]

Navigation Pattern based Relevance Feedback (NPRF)

CBIR—20% RF—60–65% NPRF—60 to 75%

Different illumination Combining local condition binary patterns & local color contrast for texture classification under varying illumination, Journal of the Optical Society of America, 2014 [18]

Local Color Contrast (LCC) method along with Local Binary Pattern (LBP)

Without applying any method—10.2% With only LBP—71.9% With LCC & LBP—77.1%

Geometrical transformation

SIFT versus SURF: quantifying the variation in transformation, 2015 [19]

Speeded up robust feature (SURF) Scale invariant feature transform (SIFT)

Less accurate when affine transformation is applied

Texture

Local derivative radial patterns: a new texture descriptor for CBIR, Elsevier, 2017 [20]

LDRP—Local Derivative Radial Pattern

• Works best when number of images retrieved is very less (5–10) • Less than 80% accuracy when number of images retrieved are more than 30

images. But it is not a reliable feature due to different illumination condition of an image. Texture describes the feel of a surface. Texture of an image refers to the visual of an image. It is defined by the illumination of the pixels. The analysis of texture encompasses many issues like classification, segmentation, etc. Geometrical transformation is yet another major issue in retrieval of similar images. As we can see in Fig. 2, same building image is captured with variations like different angle. These above-mentioned challenges can be overcome by finding significant hidden information from images. Deep learning algorithms can be used to find this hidden data from images. Convolution neural network, a deep learning approach is a neural network

68

J. Janjua and A. Patankar

vastly used for image processing like classification, segmentation, etc. The crux of CNN is to focus on the portion of an image rather than processing an entire image for feature extraction. Many techniques of CBIR using deep learning on different datasets have been implemented. It gives better accuracy results than traditional CBIR methods. Summary of these techniques have been shown in Table 2. Table 2 Comparison of deep learning based CBIR techniques Paper

Method

Datasets

Results

Relevance feedback Retrieval results keeps Caltech257 is used for content-based on changing based on consisting 30,607 image retrieval using the user feedback images of object deep learning, IEEE, 2017 [15]

Improved by 2–3% by combining relevance feedback with cbir

Scalable object retrieval with compact image representation from generic object regions, ACM, 2016 [11]

Accuracy: 83.7%

Framework consist of Holidays 1491 feature extraction, holiday images from indexing and querying 500 classes

Scalable database • Pre-trained [residual • MalayaKew (MK) indexing and fast network (ResNet)] Leaf-dataset, have image retrieval based • Novel nested 44 classes with 52 hierarchical on deep learning and images in each database indexing is • University of hierarchically nested used for database California Merced structure applied to • Recursive (UCM) Dataset, remote sensing and calculation based on having 21 classes plant biology, JoI, local density having 100 images 2019 [10] estimation for in each similarity

• MK Leaf-Dataset accuracy—88.1% • UCM Dataset accuracy—90.5%

Deep learning earth observation classification using ImageNet pretrained networks, IEEE, 2015 [12]

Overall accuracy: 92.4%

Use pretrained network, overfeat model along with CNN

UCM consisting of 2100 images of aerial images of US cities

Content based image • Deep belief network SUN dataset retrieval using deep (unsupervised) • Multi-feature learning process, retrieval technique Springer, 2018 [13]

• 98.6% for small dataset (images < 1000) • 96% for large dataset (images > 10,000) (continued)

Comparative Review of Content Based Image Retrieval Using Deep Learning

69

Table 2 (continued) Paper

Method

Datasets

Results

Medical image retrieval using deep convolutional neural network, Elsevier, 2017 [7]

• Class based predictions is used

24 classes, 300 images • Classification in each class total accuracy–99.77% 7200 images, 70–30% • Retieval mean average training-testing precision–0.69 • 0.53 without using class predict-ions

Deep learning of • CNN trained with binary hash codes for imageNet dataset fast image retrieval, • Hidden layer is added to learn the IEEE, 2015 [16] images hash codes • To retrieve similar images

• MNIST Dataset: • MNIST: 98.2±0.3% having 60 K training retrieval precision • CIFAR: 89% and 10 K testing precision images • CIFAR-10 dataset: • Yahoo-1 M: 83.75% precision contains 10 object classes having 50 K training and 10 K testing images • Yahoo-1 M dataset: consists of 1,124,087 shopping item images

PCANet: a simple Three components: deep learning • Binary hashing baseline for image • Cascaded Principal classification?, IEEE, Component 2015 [17] Analysis (PCA) • Blockwise histograms Comparison with RANDNet and LDANet

• FERET dataset, contains 1196 distinct individuals, upto 5 images per individual • LFW dataset, consist of 13,233 face images of 5749 distinct individuals • Extended Yale B dataset, consist of 2414 front face images of 38 individual • AR dataset consists of over 4000 front images of 126 subjects

• PERET dataset–97.25% • LFW dataset–86.28% (unsupervised learning) • Extended Yale B dataset–99.58% • AR dataset –95% accuracy

Results of CBIR have been increased by 2–3% by combining relevance feedback of users [15]. Retrieval results keeps on changing based on the user feedback. Once the results meet the user’s expectations, this model stop running and return the result to the user. In this, deep learning with five convolution layers is used. For similarity

70

J. Janjua and A. Patankar

measurement, Euclidean and Cosine similarity techniques are used. This experiment has been carried out on caltech257 dataset consisting of 30,607 images of object. Sun et al. [11] used framework consist of feature extraction, indexing and querying. In feature extraction, object is detected first by using BING. Features are extracted using the fusion of CNN & VLAD. CNN specialize in describing general semantics whereas VLAD in describing local details of images. Fusion of CNN & VLAD gives accuracy of 83.7% on Holidays dataset consisting of 1491 holiday images from 500 classes. This fusion gives better accuracy than standalone CNN & VLAD features. Inverted indexing structure is used in database to avoid exhaustive search. In another recent study [10], pre-trained CNN [residual network (ResNet)] is used on predefined imageNet dataset. Novel nested hierarchical database indexing is used for querying from the database, which is faster way to process the query, using k-means clustering. For similarity matching, recursive calculation is done based on local density estimation. This framework is tested on MalayaKew (MK) Leaf-Dataset, having 44 classes with 52 images in each. It gives accuracy of 88.1%. And it gives 90.5% accuracy using University of California Merced (UCM) Dataset, having 21 classes having 100 images in each. CNN when trained with large dataset with labelled data gives good accuracy than the limited labelled data. Marmanis et al. [12] proposed a two-stage framework which gives better accuracy with limited labelled data. In its first stage it is using pretrained Overfeat model, higher version of AlexNet [21] model which is trained using 1.2 million images. The second component is the trainable classification architecture which accepts 2D feature as an input. This framework gives 92.4% accuracy when tested on UCML dataset, consisting of 2100 images of aerial images of US cities. Sometimes the images of dataset are incredibly challenging because of its nonlinearity. These images should be pre-processed to increase the efficiency of the system by speeding up the time of training. Random rotation and horizontal flips on images help CNN to be insensitive of the image orientation. ZCA (Zero Component Analysis) [29] is used, it decreases the redundancy in the pixels of image and emphasizes on its structure and features. This framework gives 92.86% accuracy when used on Food-11 which consists of 16,643 images having 11 major food classifications [9]. Tzelepi and Tefas [8] proposed three models: (i) Retraining with relevance information, this model is used if the training dataset labels are available, (ii) Fully unsupervised retraining, this model is used when only dataset is available, (iii) Relevance feedback based retraining, this model is used when feedback of user’s is available. All three models are tested with two datasets: UKBench2 dataset comprised of 10,200 object images classified into 2550 classes (with each class containing 4 images) and Paris 6 K dataset containing 6392 images. Cosine similarity and Eucledian distance is used for similarity measurement. The accuracy of CBIR is dependent on the feature’s representation of an image. Deep Belief Network (DBN) method [23] is applied to retrieve image features and classify the data with divergence i.e., noise, displacement, smoothness, etc. In Saritha et al. [13] method, multi-feature retrieval technique is introduced. This technique combines features of edge directions, color histogram, texture features, edge histogram, etc. These

Comparative Review of Content Based Image Retrieval Using Deep Learning

71

features are obtained and stored as a small sign of an image. For similarity measurement, these signatures are compared. This framework is tested on SUN dataset. It gives accuracy of 98.6% for small dataset (images < 1000) and 96% for large dataset (images > 10,000). CBIR gives best results when class based prediction with CNN is used. For the proposed content based medical image retrieval (CBMIR) systems, it has been tested with a dataset [7]. This dataset consist of 24 classes with 300 images in each class, is prepared from the publicly available medical dataset. CBMIR gives 99.77% classification accuracy, mean average precision of 69% for image extraction, 53% when class predictions is not used. Lin et al. [16] proposed a deep learning architecture which retrieve images by calculating binary hash codes. This framework consists of three components: (i) CNN is pretrained with supervised learning using imageNet dataset, (ii) hidden layer is added to learn the hash codes of the images, (iii) to retrieve similar images using hamming distance between binary codes. This framework is tested with three different datasets, namely MNIST dataset, CIFAR-10 dataset and Yahoo-1 M dataset. The retrieval accuracy with all three datasets is shown in Table 2. PCANet proposed by Tsung-Han Chan et. al., [17] is based on principal component analysis, blockwise histograms and binary hashing. It proved as a simple but provides extremely competitive standard for object recognition and texture classification. PCANet is tested on various datasets namely LFW dataset, Extended Yale B dataset, FERET dataset, and AR dataset giving remarkable accuracy as shown in Table 2.

4 Experimentation This paper summarized the techniques of traditional CBIR and CBIR using deep learning methods. Also, experiment has been conducted on Fashion MNIST dataset [24] and compared the results of decision tree classifier, Feed-forward Neural Network (FNN) and Convolutional Neural Network (CNN) model. 4.1 Decision Tree Decision tree classifier is a supervised machine learning algorithm which is used to classify the data [25]. It uses divide and conquer strategy. Each internal node of this tree represents attribute test, every branch of a tree represents result of test and leaf node indicates the class label [26, 27]. In CBIR, attributes which are used in decision tree is the visual feature of the images [27]. In this experiment, for optimizing decision tree classifier, it was trained with maximum depth of tree as 50 and random state as 42. 4.2 Feed Forward Neural Network Feed forward neural network, an artificial neural network is used for classification. In this the data moves in forward direction. All the nodes in the layers are fully connected with one another [28]. This paper indicates the results of feed forward neural network, which is established on the sequential model. In this network, dense layer has been added

72

J. Janjua and A. Patankar

along with the ReLU activation function to introduce non-linearity. The output layer is a softmax layer as the Fashion MNIST dataset is a multiclass dataset. The optimizer which is used in this experiment is Adam optimizer. 4.3 Convolutional Neural Network Convolutional neural network is a deep learning approach, majorly used for image classification [29]. The pre-processing required is very lesser than the other classification algorithms. These networks do learn filters itself. One of the advantages of using CNN over other neural network, there is no need to flatten the image. CNN can work with 2D images, which results in a great learning of image properties. In this experiment, convolution layer has been added along with the ReLU activation function and softmax function at output layer.

5 Results and Evaluation To test the proposed experiment, Fashion MNIST dataset is used. This dataset comprises of Zalando’s article images which consist of 60,000 training images and 10,000 testing images with 10 classes labels [24]. The evaluation measure which is used to compare the above three mentioned algorithms is accuracy. This measure is defined as the ratio of no. of images correctly categorized to the total number of images as shown in Eq. 1 [30]. Accuracy =

no. of images classified correctly total number of images

(1)

Table 3 Evaluation of different models Models

Decision tree

FNN

CNN

Accuracy

0.79

0.87

0.90

Table 3 shows the accuracy measurement of decision tree, FNN & CNN algorithms on scale of 0–1. Using only one layer of convolution layer in CNN, it outperforms decision tree and FNN models. Accuracy of CNN can be increased further by adding more such convolution layers.

6 Conclusion CBIR is used to extract the image from the corpus of images with the help of its features. In this paper, we have given summary of different research studies on content-based image retrieval (CBIR). Many issues of CBIR have been discussed. For each issue, one research study has been discussed. This CBIR retrieval accuracy can be improved by finding significant hidden data from images.

Comparative Review of Content Based Image Retrieval Using Deep Learning

73

References 1. Zhou W, Li H, Tian Q (2017) Recent advance in content-based image retrieval: a literature survey. 2. Torres R, Falcão A (2006) Content-based image retrieval: theory and applications. RITA 161–185 3. Pasumarthi N, Malleswari L (2016) An empirical study and comparative analysis of Content Based Image Retrieval (CBIR) techniques with various similarity measures. In: Proceeding of the 3rd international conference on electrical, electronics, engineering trends, communication, optimization and sciences (EEECOS 2016). Tadepalligudem, pp 1–6. doi: https://doi.org/10. 1049/cp.2016.1529. 4. Nwankpa C, Ijomah W, Gachagan A, Marshall S (2020) Activation functions: comparison of trends in practice and research for deep learning. 5. Girish MM, Jai Shankar G, Chandan B (2019) Image recognition using convolutional neural network. IJIREEICE 7(3) 6. Tzelepi M, Tefas A (2017) Deep convolutional learning for content based image retrieval. Elsevier 7. Qayyum A, Anwar SM, Awais M, Majid M (2017) Medical image retrieval using deep convolutional neural network. Neurocomputing 266(29), 8–20 8. Tzelepi M, Tefas A (2018) Deep convolutional learning for content based image retrieval. Neurocomputing 275(31):2467–2478 9. Islam MT, Siddique BN, Rahman S, Jabid T (2018) Image recognition with deep learning. Int Conf Intell Inform Biomed Sci 10. Sadeghi-Tehran P, Angelov P, Virlet N, Hawkesford MJ (2019) Scalable database indexing and fast image retrieval based on deep learning and hierarchically nested structure applied to remote sensing and plant biology. J Imaging 11. Sun S, Zhou W, et al (2016) Scalable object retrieval with compact image representation from generic object regions. ACM Trans Multimedia Comput Commun Appl 12(2):29 12. Marmanis D, Mihai D, Esch T, Stilla U (2016) Deep learning earth observation classification using imagenet pretrained networks. IEEE Geosci Remote Sensing Lett 13(1) 13. Saritha RR, Paul V, Kumar GP (2018) Content based image retrieval using deep learning process. Clust Comput 14. George MP, Jayanthi S (2017) CBIR for biomedical image archives using efficient relevance feedback and user navigation patterns. IJCESR 4(10) 15. Xu H, Wang JY, Mao L (2017) Relevance feedback for content-based image retrieval using deep learning. In: Proceedings of the 2nd international conference on image, vision and computing (ICIVC). IEEE 16. Lin K, Yang HF, Hsiao JH, Chen CS (2015) Deep learning of binary hash codes for fast image retrieval. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) Workshops. pp 27–35. Academia Sinica, Taiwan Yahoo 17. Chan T, Jia K, Gao S, Lu J, Zeng Z, Ma Y (2015) PCANet: a simple deep learning baseline for image classification? IEEE Trans Image Proc 24(12) 18. Cusano C, Napoletano P, Schettini R (2014) Combining local binary patterns and local color contrast for texture classification under varying illumination. J Opt Soc Am A 31:1453–1461 19. Srivastava S (2014) SIFT Vs SURF: quantifying the variation in transformations 20. Fadaei S, Amirfattahi R, Ahmadzadeh MR (2017) Local derivative radial patterns: a new texture descriptor for content-based image retrieval, vol 137. Elsevier 274–286 21. Paheding S, Alom MZ, Tarek T, Asari V (2018) The history began from AlexNet: a comprehensive survey on deep learning approaches

74

J. Janjua and A. Patankar

22. Xin M, Wang Y (2019) Research on image classification model based on deep convolution neural network”, EURASIP Journal on Image and Video Processing. https://doi.org/10.1186/ s13640-019-0417-8. 23. Khan A, Islam M (2016) Deep belief networks. IEEE https://doi.org/10.13140/RG.2.2.17217. 15200 24. https://www.kaggle.com/zalando-research/fashionmnist 25. Patel HH, Prajapati P (2018) Study and analysis of decision tree based classification algorithms. IJCSE 6(10) 26. Sharma H, Kumar S (2016) A survey on decision tree algorithms of classification in data mining. IJSR 5(4) 27. Kusrini M, Iskandar D, Wibowo FW (2016) Multi features content-based image retrieval using clustering and decision tree algorithm. Telkomnika 14(4):1480–1492 28. Le-Hong P, Le AC (2018) A comparative study of neural network models for sentence classification. IEEE 29. Albawi S, Mohammed TA, Al-Zawi S (2017) Understanding of a convolutional neural network. IEEE, Turkey 30. Novakovic´ JD, Veljovic A, Ili SS, Papi Z, Milica T (2017) Evaluation of classification models in machine learning. Theory Appl Math Comput Sci

Fuzzy-Logic Approach for Traffic Light Control Based on IoT Technology Guan Hewei1(B) , Ali Safaa Sadiq2 , and Mohammed Adam Tahir3 1 School of Information Technology, Monash University, Monash, Malaysia

[email protected]

2 School of Mathematics and Computer Science, University of Wolverhampton, Wulfruna

Street, Wolverhampton, UK [email protected] 3 Technology Sciences, Zalingei University, Zalingei, Sudan

Abstract. Traffic congestion is an extremely common phenomenal issue, it occurs in many cities around the world, especially in those cities with high car ownership. Traffic congestion not only causes air pollution and fuel wastage, but it also leads to an increased commuting time and reduces the work time availability. Due to these reasons, traffic congestion needs to be controlled and reduced. The traffic light is the most widely adopted method to control traffic, however, most traffic lights in use are designed based on the predefined interval, which cannot cope with traffic volume change very well. Therefore, Internet of Things (IoT) based traffic lights or adaptive traffic lights are developed in the recent years as a complement of the traditional traffic lights. The adaptive traffic light can be built based on monitoring current traffic situation or using Vehicle-to-Vehicle and Vehicle-to-Infrastructure communication. In this paper, a new design of adaptive traffic light is proposed, this traffic light system is based on fuzzy logic and it introduces volunteer IoT agent mechanism, which introduces more accurate results.

1 Introduction Traffic congestion refers to the phenomenon of a road traffic bottleneck that produced by heavy car’s traffic that heading to congested intersections with a slow speed. This case is usually occurring during rush hours and holidays. This situation often occurs in major metropolitan areas around the world, areas with high automobile usage, and highways connecting two cities. Frequent traffic congestion leads to an increased commuting time and it reduces the time available for work, resulting in economic losses due to workers delay getting their working places on time, delay of delivering goods on time and so many other factors. It also causes drivers to feel irritated and impatient, which increases their stress and further damages their health. To a certain extent, traffic congestion is also a waste of fuel and pollution: Engines keep running during traffic congestions, which consumes extra fuel, and at the time of congestion, drivers always accelerate and brake back and forth, which further increases fuel consumption. Traffic congestion therefore not only wastes energy, but also causes air pollution. © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2022 V. E. Balas et al. (eds.), Intelligent Computing and Networking, Lecture Notes in Networks and Systems 301, https://doi.org/10.1007/978-981-16-4863-2_7

76

G. Hewei et al.

Therefore, in this paper, a design of adaptive traffic light is proposed, this traffic light system is based on fuzzy logic and it introduces volunteer IoT agent mechanism to make sure the system gives a more accurate result.

2 Related Work There are many studies, which have examined different methods on traffic management. The authors of [1] proposed a traffic con-gestion detection system in their paper, their system consists of I2I (infrastructure-to-infrastructure) communication method, V2I (vehicle-to-infrastructure) communication method, V2V (vehicle-to-vehicle) communication method, and big data cluster. The system uses abovementioned three communication method to collect all the vehicular information such as vehicle’s latitude, vehicle’s longitude, and vehicle’s speed. Then data is encapsulated in DATA packets by using the LORA-CBF algorithm and transmitted to the big data cluster. In big data cluster, received data is interpreted and further analyzed by using algorithm based on Binary Traffic Output (BTO) algorithm proposed by [2] to identify potential traffic events such as traffic congestion and traffic accident. Similar work has been done in [3], this proposed system employs V2I communication technology based on WAVE IEEE 802.11p standard and uses fuzzy logic to coordinate vehicles on the road. Fuzzy control system considers each vehicle’s comfortable and safe distance and speed adjustment to prevent collisions in advance and improve traffic flow. This traffic management system can analyze received information and sending warning or recommendation commands to the vehicles. In the article [4], authors built an IoT based traffic junction model, with several ultrasonic sensors placed on sides of road to count the number of cars and connect sensors to Arduino microcontroller, then data is transmitted to Raspberry Pi3 microcontroller through Wi-Fi module, where analysis been carried out. The traffic was categorized as heavy and normal traffic, as well as based on the traffic density, traffic signal time has been changed accordingly. Azura che soh, Lai Guan Rhung and Haslina Md. Sarkan have conducted a research on adaptive traffic light system by using Matlab simulation, in their research paper [5] they developed a fuzzy traffic controller based on vehicles queue length and waiting time at cur-rent green phase. They compared fuzzy traffic controller (FTC) with vehicle-actuated controller (VAC) and found that the fuzzy controller distinctly reduces average waiting time, average queue length and delay time. In another research done by authors of [6], it divides the status of inter-sections into four categories. Then based on the evaluation of the status, it adopts a fuzzy reasoning method to evaluate the traffic operation status of intersections and formulate corresponding traffic management method for traffic congestion.

3 Proposed Approach Since traffic congestion has many negative influences on people’s daily life, the objective of this paper is to build a traffic light control system based on fuzzy logic. The system should be able to determine the duration of traffic light’s green time based on given inputs, which are traffic saturation, queue length and traffic status from volunteer IoT

Fuzzy-Logic Approach for Traffic Light Control Based on IoT Technology

77

agent. The system will monitor traffic condition during current traffic light cycle (one traffic light cycle contains three different phases, green light phase, yellow light phase and red-light phase), then the system will adjust the duration of green light phase in the next traffic light cycle. Eventually, this system can help to improve traffic efficiency in a traffic junction and manage traffic more effectively by adjusting traffic light’s green time. Figure 1 shows how system’s architecture at a traffic junction.

Fig. 1 Working scenario of the system

In Fig. 1, four sensors are deployed on the side of the road. Vehicles count sensor is placed right next to the traffic light. During the green light phase, every time a car passes through the intersection the movement can be captured by this sensor. The total vehicle count is the actual traffic volume in the intersection, after getting the total vehicle count, this value is then used to calculate traffic saturation, which is explained in Sect. 3.1.2. Pre-threshold, threshold and post-threshold sensors are used to detect vehicle queue length during the red-light phase. If queue length ≤ 0 M, the pre-threshold sensor will be triggered. If (40 M < queue length < 60 M), the threshold sensor will be triggered. While, if the queue length ≥ 60M, the post threshold sensor will be triggered. In Fig. 1, there are two volunteer IoT based agents at the intersection, volunteer agents can observe and evaluate traffic situation and send traffic situation value (e.g. minor congestion, severe congestion) to the control system, details of volunteer agent are discussed in Sect. 3.1.4. In the real life, there can be multiple volunteer IoT agents in one intersection observe and send traffic situation data at the same time. However, this paper will assume only one volunteer IoT agent is allowed. Another assumption in the paper is that there are no turns involved, only straight through traffic, vehicles can only move from East to West or from North to South as shown in Fig. 1. Since the proposed method in this paper is built by using simulation, all the inputs will be retrieved from a created input file, input file contains pre-defined possible values for inputs variables. Input files for each parameter are discussed in Sect. 3.1. 3.1 Design of Fuzzy Inference System The fuzzy inference system (FIS) consists of fuzzification, knowledge base (membership functions and fuzzy rules) and defuzzification [7]. FIS takes crisp inputs and converts

78

G. Hewei et al.

them into fuzzy inputs by using membership functions, then it combines fuzzy inputs together to come up with a fuzzy output based on fuzzy rules and output membership functions. Figure 2 [7] illustrates the workflow of fuzzy inference system in this paper.

Fig. 2 Computation of output Green Time

3.1.1 Fuzzification of Inputs and Outputs The fuzzy inference system takes three inputs, which are discussed in detail by the following sub-sections: 3.1.2 Traffic Saturation This input metric indicating the degree of saturation of an intersection under traffic signal control is a measure of how much demand it is experiencing compared to its total capacity. Road saturation refers to the ratio of actual traffic volume and capacity of the road, it can be represented byx, and calculated by using equation x = Qq , where q is actual traffic volume and Q is total capacity. x can be any value between 0 and 1. If x is greater than 0.9, the traffic condition at the intersection deteriorated sharply, and traffic congestion is very likely to occur [8]. In this paper, the total capacity of the intersection is assumed to be 100 (Q = 100), which means during one green light phase, the maximum number of vehicles to pass the intersection is 100. Membership function of traffic saturation is shown in Fig. 3. Y-axis indicates the degree of membership; X-axis indicates traffic saturation during the current green light phase. Input file for traffic saturation is named Green Phase Car Count Input File, this file contains all the integer numbers from 0 to 100, in each green light phase, system will

Fuzzy-Logic Approach for Traffic Light Control Based on IoT Technology

79

Fig. 3 Membership function for traffic saturation

randomly retrieve one number from the file as the actual traffic volume (e.g. q = 50) during that green light phase, then q is substituted into equation x = Qq to calculate road saturation x. 3.1.3 Queue Length The queue length input metric refers to the length of space occupied by the stopped vehicles; it can reflect the traffic flow at the intersection. It is of great significance for evaluating the operation status of the intersection, measuring the severity of traffic congestion, and evaluating the existing signal timing plan. In general, the more severe the congestion, the longer the length of the queue. The membership function of traffic saturation is shown in Fig. 4. Y-axis indicates the degree of membership; X-axis indicates queue length which is measured in meters.

Fig. 4 Membership function for queue length

As mentioned above, pre-threshold, threshold and post-threshold sensors are used to detect vehicle queue length. If the pre-threshold value is triggered, it is considered as normal traffic condition, if the threshold is triggered, it is considered as minor congestion, if the post-threshold value is triggered, it is considered as severe congestion. Input file for queue length is named Queue Length Input File, as indicated in Fig. 4, the minimum queue length is 0 m and maximum queue length is 100 m. Therefore, this file contains all the integer numbers from 0 to 100, during each red-light phase, the system will randomly retrieve one number from the file as queue length during that red light phase.

80

G. Hewei et al.

3.1.4 Traffic Status from Volunteer IoT Agent Traffic information will be also gathered from volunteer agents, which could be any person from nearby traffic junction, they can observe and provide traffic situation information to the control system by mobile application. With the help of volunteer agents, the system takes account of nearby traffic junctions. For example, in case both current and next traffic junctions are having severe congestion, if the control system gives green light to current traffic junction, vehicles from current traffic junction will flood into next traffic junction and cause even worse traffic congestion. One solution can be extending the green time in next junction to clear out vehicles, while current traffic junction can extend red time to hold vehicles, after the vehicles in the next junction have been cleared out, the current junction can open green light to let vehicles to pass. However, in this paper we will assume all volunteer agents come from current junction, which means the system will only consider traffic situation at the current intersection. Volunteer agent uses a scale of 1 to 5 as an indication of the traffic situation, 1 stands for no congestion while 5 stands for severe congestion. Meanwhile, a trust level is assigned to each volunteer agent. Trust levels have values of high, medium, and low trust level, each trust level has its own weight, high = 20, medium = 15 and low = 10. Calculation of a volunteer agent’s value involves following steps: Step1: volunteer agent provides a number (e.g. 4) to indicate current traffic situation. Step2: system identifies volunteer agent’s trust level. (e.g. medium). Step3: multiply number provided in step1 (in this case = 4) with the corresponding weight associated with trust level (in this case = 15), 4*15 = 60. The result of step 3 (which is 60 in this case), is then used to evaluate traffic situation by using membership function as shown in Fig. 5.

Fig. 5 Membership function for traffic status from volunteer agent

Traffic status from volunteer agent has two input files, one file is called Volunteer Agent Value Input File, which contains integer numbers from 1 to 5, these numbers are used by volunteer agent as a scale to indicate traffic situation. Another file is called Volunteer Agent Trust Level Input File, since there are three trust levels, high, medium, and low. Thus, this file contains 3 numbers 20, 15 and 10, each number represent high trust level, medium trust level and low trust level, respectively. During each green light phase, the system will randomly choose one number from each file to calculate volunteer agent’s value.

Fuzzy-Logic Approach for Traffic Light Control Based on IoT Technology

81

The output of the system is a variable called GreenTime, which indicates how long the green traffic light will glow on next traffic light cycle. Membership function for output is shown in Fig. 6.

Fig. 6 Membership function for green time

In the above figure, X-axis indicates the duration of the next green light phase, which is measured in seconds, maximum duration is 120 s. 3.2 Fuzzy Inference Engine This section explains how fuzzy rules are derived. Fuzzy inference engine consists of a group of rules. There are 3 input parameters in this paper, both traffic saturation and traffic status from volunteer agent have 4 membership functions, so they both have 4 possible values: normal, minor congestion, medium congestion, and severe congestion. The other input, queue length, has 3 membership functions, so it has 3 possible values: normal, minor congestion and severe congestion. Therefore, there are 4*4*3 = 48 rules in total. First, since all the inputs have the same weight, and order of inputs does not matter, thus, 48 rules can be classified into 19 different patterns, each pattern is assigned with an output value as shown in Table 1. This table is then used to identify output for each rule.

4 Implementation and Initial Results Our implemented adaptive traffic light system consists of a main class, a fuzzy inference system, an interface file, a get input class and 4 input files. Main class controls traffic light cycles and retrieves input data from input files by using data retrieving method in getInput class, and feed data, which are traffic saturation, queue length and traffic status from volunteer agent, into fuzzy inference system, fuzzy inference system will return an output which is green time, then main class will update user interface according to green time. Figure 7 shows internal design of the whole system. A user interface has also been developed for visualizing purpose. With the help of simulation, people can better understand how the system works and see what is happening right now.

82

G. Hewei et al. Table 1 Classification of fuzzy rules Pattern number Input value

Output

1

3 normal

Normal

2

3 minors

Short extension

3

3 severe

Long extension

4

2 normal, 1 minor

Normal

5

2 normal, 1 medium

Short extension

6

2 normal, 1 severe

Short extension

7

2 minors, 1 normal

Short extension

8

2 minors, 1 medium

Medium extension

9

2 minors, 1 severe

Medium extension

10

2 mediums, 1 normal

Short extension

11

2 mediums, 1 minor

Medium extension

12

2 mediums, 1 severe

Medium extension

13

2 severe, 1 normal

Long extension

14

2 severe, 1 minor

Long extension

15

2 severe, 1 medium

Long extension

16

1 normal, 1minor, 1 medium

Short extension

17

1 normal, 1minor, 1 severe

Medium extension

18

1 normal, 1 medium, 1 severe Medium extension

19

1 minor, 1 medium, 1 severe

Medium extension

Figure 8 is the starting screen of the paper, it is the main entrance of the program, starting screen shows the method’s title and some explanations about the proposed technique and its parameter settings. When the Start button is clicked, the callback function of the Start button will be invoked, new simulation screen will be showing, and the system will start running. Figure 9 is the simulation screen before the program starts running, there are two traffic direction, North–South traffic, and East–West traffic. Each traffic direction has its own traffic light, which are North–South Traffic Light and East–West Traffic Light. Traffic Saturation NS/EW, Queue Length NS/EW, and Volunteer Agent Value NS/EW text boxes are used to display values of input data. Duration of the current traffic light is displayed on the top right. Figure 10 shows what will the simulation screen look like after the program is running and the pre-threshold sensor is triggered. A green arrow is showing to indicate the current green light direction, which is East–West traffic, so the duration of East–West green light is displayed on the top right, this duration is calculated based on Traffic Saturation EW, Queue Length EW, and Volunteer Agent Value EW, however, since currently East–West direction is green light, thus, Queue Length EW is set to zero.

Fuzzy-Logic Approach for Traffic Light Control Based on IoT Technology

83

Fig. 7 System internal design

Fig. 8 Starting screen of the external design

In order to quantitatively evaluate the testing result, the Highway Capacity Manual 2010 [9] is used as a comparison, HCM 2010 is a publication of the Transportation Research Board of the National Academies of Science in the United State, which has categorized signalized intersections to different Level of Service (LOS) by control Delay. LOS and control delay defined in this manual is summarized in Table 2 (Highway

84

G. Hewei et al.

Fig. 9 Simulation screen before program running

Fig. 10 Simulation screen with pre-threshold sensor has triggered

capacity manual 2010, 2010). Control delay is the result of a control signal causes vehicles to reduce speed or to stop, the inherent concept of control delay is the same as waiting time. Therefore, control delay is used as a comparison with average waiting time.

5 Conclusion In this paper, a fuzzy logic based adaptive traffic light system is proposed, the signalized traffic model and fuzzy traffic controller have been developed using MATLAB software. The best of our knowledge, this was the first-time volunteer agent concept was integrated into a fuzzy based inference system for managing smart traffic light system based on IoT networks. The effectiveness of the proposed method has been tested by running the system for a long time. Overall, the result of the testing shown that the system has a good performance, therefore, it is feasible and effective to use such adaptive traffic light system at a traffic junction in the smart cities.

Fuzzy-Logic Approach for Traffic Light Control Based on IoT Technology

85

Table 2 LOS criteria for signalized intersections Level of service (LOS) Control delay per vehicle (s/veh) A

≤10

B

>10–20

C

>20–35

D

>35–55

E

>55–80

F

>80

References 1. Bernard M (2014) Traffic light history abstracts. Academia 2. Matthew Nitch S (2018) The number of cars worldwide is set to double by 2040. Re-trieved from https://www.weforum.org/agenda/2016/04/the-number-of-cars-worldwide-is-set-to-dou ble-by-2040 3. Cárdenas-Benítez N, Aquino-Santos R, Magaña-Espinoza P, Aguilar-Velazco J, EdwardsBlock A, Cass AM (2016) Traffic congestion detection system through connected vehicles and big data sensors 16(5):599. https://doi.org/10.3390/s16050599 4. Gupta A, Choudhary S, Paul S (2013) DTC: a framework to detect traffic con-gestion by mining versatile GPS data. In: 2013 1st International conference on emerging trends and applications in computer science. https://doi.org/10.1109/icetacs.2013.6691403 5. Milanes V, Villagra J, Godoy J, Simo J, Perez J, Onieva E (2012) An intelli-gent V2I-based traffic management system. IEEE Trans Intell Transport Syst 13(1):49–58. https://doi.org/10. 1109/tits.2011.2178839 6. Ashok P, SivaSankari S, Vignesh M, Suresh S (2017) IoT based traffic signal-ling system. Int J Appl Eng Res 12(19):8264–8269 7. Azura C, Lai G, Haslinas MD (2010) MATLAB simulation of fuzzy traffic controller for multilane isolated intersection. Int J Comput Sci Eng 02(04):924–933 8. Ghafoor K, Sadiq A, Abu Bakar K (2011) A fuzzy logic approach for reducing handover latency in wireless networks. Netw Prot Algrithm 2(4). https://doi.org/10.5296/npa.v2i4.527 9. Transportation Research Board of the National Academies (2010) Highway capacity manual 2010. D.C, Washington, pp 18–26

Data Clustering Algorithms: Experimentation and Comparison Anand Khandare

and Rutika Pawar(B)

Department of Computer Engineering, Thakur College of Engineering and Technology, Mumbai, India [email protected]

Abstract. Due to increasing databases of all kinds, clustering has become one of the most essential tasks to classify the data. Clustering means to group or divide the data points based on their similarity to each other. Clustering can be stated as an unsupervised data mining technique that describes the nature of datasets. The main objective of data clustering is to obtain groups of similar entities. There are various methods of clustering such as hierarchical, partition-based, method-based, grid-based, and model-based. This paper provides a detailed study about clustering, its working processes. Along with basic information, detailed information about validity measures required to evaluate algorithms is discussed. This paper reviews clustering algorithms like K-Means, Agglomerative, and DBSCAN. A tabular comparison of algorithms is represented to acquire in-depth knowledge. The results obtained after experimenting with algorithms are also been discussed at the end. Keywords: Clustering · Data mining · KDD · K-Means clustering · DBSCAN · Agglomerative · Clusters

1 Introduction With the advancement of technology, and systems getting complex, it’s the databases that are expanding at an exponential rate. Often while developing systems, there comes need to understand the data well and categorize it individually. Acknowledging the significance to clear the clutters of data, data mining is a practice widely adopted by researchers and developers. Data mining is the process of forming clusters of similar data and discovering patterns and trends which go beyond the simpler analysis [1]. With the help of data mining, the relation between large datasets is investigated and it helps to predict the most probable outcomes. Data mining is also used to detect anomalies in the dataset. The raw data gets converted into useful information which is required for further analysis. The data mining algorithm first searches for patterns in the data and these patterns help developers across the globe to learn more about clusters, thereby developing market strategies and product manifestations. Also, Data Mining is the analytical step of KDD i.e., Knowledge Discovery in Databases. © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2022 V. E. Balas et al. (eds.), Intelligent Computing and Networking, Lecture Notes in Networks and Systems 301, https://doi.org/10.1007/978-981-16-4863-2_8

Data Clustering Algorithms: Experimentation and Comparison

87

Despite the similarities between data mining and statistical data analysis, most of the methods used in Data Mining are originated in fields other than statistics. The working of the algorithm depends on effective data collection, data warehousing, and computer preprocessing. 1.1 Working of Data Mining Algorithm The algorithm first explores the given dataset and then starts analyzing it into large blocks of information to acquire meaningful patterns present in the data. The process of Data Mining can be divided into 5 steps as follows: Step 1: To collect the data, organize it, and load it into the warehouse. Step 2: After organizing, load the data into house serves or the cloud server. Step 3: Various professionals of the company such as analysts, members of management teams, and IT teams trying out ways to organize and store the data processed. Step 4: Sorting the data based on various factors such as target variable, user results, etc. by the software application. Data Mining is divided into 3 categories. • Classification • Clustering • Association Mining.

2 What is Clustering? Clustering is the data mining task that is used to determine the value of a number of clusters and data objects which is given as input to the algorithm [2]. The main aim of clustering is to segregate groups with similar features and assign them to one cluster. Clustering is one of the most popular techniques in data science. The data points are assigned into some clusters as per their similarity with other neighbor points. The similarity between sample points is measured by features and a metric called the similarity measure. But as the number of features increases, it becomes complex to find their similarity and group them accordingly. As shown in the above Fig. 1 [3], is the process of how clusters are formed and how subsequently interpretation of data takes place through the knowledge obtained. Firstly, the raw, untreated data undergoes the process of feature selection. Feature selection is the process by which those features are selected that will contribute and have the most resemblance to predict the possible outcome. Having irrelevant features in data can decrease the accuracy of the result. Also, the feature selection method largely impacts the performance of the model. By the use of feature selection, overfitting is reduced, the accuracy of the model increases and also it takes less time for training of the model. The next step after feature selection is applying the initial clustering algorithm. As the task of clustering is subjective [4], the paths to achieve the goal are plenty. Hence, after running the algorithm on the data, initial tentative clusters are formed based on

88

A. Khandare and R. Pawar

Fig. 1 Clustering process

the similarity indexes. The results obtained after forming initial clusters are validated to form the final clusters of the data. Once, the final clusters are formed, the data is now ready for developers to interpret and perform operations on the processed data and gain outcomes.

3 Clustering Algorithms 3.1 Partition Based Method The partitioning method classifies a group of data based on characteristics and similarities present in the data. The number of clusters needs to be mentioned prior to generating clusters [5]. In the partitioning method, when database D contains N multiple objects, then the partitioning method constructs K partitions of data representing clusters and particular regions. 3.2 Hierarchical-Based Method In the hierarchical based method, the data objects are arranged in a hierarchical manner using the proximity measures. In Hierarchical clustering, every data point is treated as a separate cluster [4]. Once clusters are been formed, the following steps are been performed again and again: Step 1: Identifying 2 clusters that are closest to each other. Step 2: Merging 2 maximum comparable clusters. 3.3 Density-Based Method: Density is defined as the number of objects in the given region. This method uses the concept of density connected objects. These models search the data space over the given dense regions [6].

Data Clustering Algorithms: Experimentation and Comparison

89

Density-based algorithms play a vital role to find non-linear shape structures depending on their density [4]. The cluster formed by similar data points starts growing in the same direction as that of density. This method also handles the outliers present in the database. The main concept revolves around forming clusters of data points with nearest neighbors. 3.4 Grid-Based Method In this method, space is been divided into multi resolution grids for quick processing of data. The great advantage of the grid-based method is the significant reduction of computational complexity [7]. The performance of clusters depends upon the size of this grid, as the grid size is smaller than the original datasets. The only disadvantage of this method is that a single uniform grid is not sufficient for high-quality clusters having irregular data. 3.5 Model-Based Method In Model-based clustering algorithms, we can use one of the models to form clusters from the given dataset. The basic principle is to optimize between the mathematical models and datasets. This method uses probability distribution to generate the clusters [8].

4 Algorithms 4.1 K-Means Algorithm The K-Means algorithm is a centroid based technique that takes an input parameter K from the user and predicts the most similar data group and gives in K clusters. There are 2 parts of the group: intracluster and intercluster. Intracluster means the group of data objects inside a particular group or cluster having high similarity whereas intercluster includes the group of data points with low similarity. As shown in above diagram (Fig. 2), the K-Means algorithm is illustrated by the

Fig. 2 K-means algorithm illustration

following process. Input data consists of Dataset ‘D’ having collection of different data points and value of number of clusters ‘K’. Further, K-Means algorithm is implemented onto the input data gathered. Output of K-Means is in the form of clusters according to the value of k.

90

A. Khandare and R. Pawar

Algorithm:

1. Select k objects as inial centroids. 2. Repeat ll no changes are observed. 3. Find distance between data objects and centroids. 4. Form clusters by assigning data objects to the closest centroids. 5. Update centroids. K-Means algorithm starts by determining the accurate value of ‘k’. In the first step, centroid is calculated of given data points. The distance between centroid and object data points is calculated later. Based on the minimum distances formed, clusters are formed. The data points that have least distances, are grouped into single cluster. The algorithm further checks whether the object is present or not. If the answer is yes, then again re-running all the above-mentioned procedure, updates centroid, distance values and clusters. This loop gets terminated when there is no more single object present as a single entity in the space and all of them are classified into some or the other clusters. All the data, is represented by means of flowchart in Fig. 3. 4.2 Agglomerative Clustering Algorithm An agglomerative Algorithm is a bottom-up approach that seeks to build a hierarchy of clusters. In this method, pairs of clusters are merged as one moves up the hierarchy [9]. The results are usually in the form of the dendrogram. In Clustering, Agglomerative is one of most widely used algorithms. The basic idea behind Agglomerative is “Merge & Split” in greedy manner. As shown in Flowchart of Agglomerative Algorithm, Fig. 3, Initially the inputs are considered i.e., all the data points are loaded. In this algorithm, each and every single data point is considered as distinct cluster. Then a pair of data points in clusters are considered with minimum value of dissimilarity. This process keeps on repeating until one big Cluster containing all data points is formed.

Algorithm: 1. Find proximity matrix 2. Consider all data objects in one cluster 3. Repeat the process unl one cluster is formed. 4. Merge the two closest clusters 5. Update the proximity matrix aer every iteraon 6. When the criteria are met, stop

Data Clustering Algorithms: Experimentation and Comparison

91

Fig. 3 Agglomerative Clustering Flowchart

Figure 4 is the output of the Agglomerative Clustering Algorithm. After the algorithm is implemented onto the dataset, a dendrogram as shown in figure is produced. The above dendrogram represents clusters formed based on the data of Height of some people in different countries. Initially, in the bottom, all the countries are considered as single entity i.e., cluster. Then according to similarity and dissimilarity between each other, clusters keep forming, until one big cluster of all countries is formed.

5 About the Dataset For implementation of clustering algorithms, two datasets were considered, synthetic and real. Dataset is collection of information usually in the form of columns. The first dataset was synthetic one and was created using ‘make_blob’ feature from Sklearn library. Whereas the second dataset was real dataset, that contained information about university. 5.1 First Dataset This dataset was created using ‘make_blob’ feature of Sklearn library. This dataset consisted of 200 samples having 4 cluster centers, with an average cluster standard deviation as 1.8. The data was an array of coordinates randomly stated in state space with shape (200,2).

92

A. Khandare and R. Pawar

Fig. 4 Sample Dendrogram

As stated, and visualized in the above Fig. 5, the dataset is formed using 4 cluster centers, which makes it prone to form 4 distinct groups.

Fig. 5 Synthetic dataset

5.2 Second Dataset The second dataset used for analysis was ‘The University Dataset’. The main aim was to classify the dataset into two groups namely ‘Private’ or ‘Public’. This dataset consists of 18 variables and 777 observations. As it is an unsupervised learning algorithm, the dataset is without labels. Since, dataset is based on University, variables could be the number of students enrolled in university, their Application Count, Faculty Count, Types of Streams, Courses Enrollment, Students count in Undergraduate or Post Graduate streams, information about Fee Structure, Study material and Alumni Contacts and many more.

Data Clustering Algorithms: Experimentation and Comparison

93

Although the variables stated are just mere obvious predictions and they don’t even affect the working of our algorithm. 5.2.1 Visualizations Figure 5 is heatmap of the dataset. Heatmap is a way to show some sort of relation between the variables/columns of the dataset. The above heatmap represents correlation between variables, with an index ranging from −0.8 to 1.0. With index of −0.8 exhibits very little correlation and index 1.0 means perfectly correlated to each other.

Fig. 6 Heatmap of variables of dataset

Initial Inferences from dataset: • • • •

No Null Values are present in the data. Skewness in the data ranges from −0.8 to 3.7. Out of the 18 variables, 6 variables had balanced data. Out of 12 variables left, 20% were left skewed and rest 80% were right skewed.

5.2.2 Data Imbalance Data Imbalance refers to a state wherein data points are not equally classified into classes. It means some classes have fewer data points than other classes. The classes that have fewer data points are called as minority classes whereas the ones with more data points

94

A. Khandare and R. Pawar

are termed as majority classes. Training any model onto imbalanced data results in poor predictive performance of the algorithm, especially for minority class. Often the data gets imbalanced due to Measurement Errors or Biased Sampling of data. Some of the imbalanced data is shown below. From the above Fig. 7, it can be infered that the data is unevenly distributed amongst the classes. This means that most of the data either tends towards Maximum value or minimum value.

Fig. 7 Data imbalance visualization

The imbalance present in data can be corrected by the use of sampling methods or correcting the error raised during measurement. This is majorly because the training dataset is not the representation of the problems.

Data Clustering Algorithms: Experimentation and Comparison

95

5.2.3 Outliers Outliers represent the data points that are placed at an abnormal distance as compared to other data points in the dataset. The presence of outlier may be because of the variability in measurement of data or some experimental error. Outliers can cause major statistical errors. Outliers can also arise due to changes in behavior of system or fraudulent, also human or experimental error. Some outliers were detected in the dataset as shown below: Figure 7 represents outliers in all the columns present in the data. On an average, 4.41% is the average outlier present amongst 17 features. With highest outlier noted on Column C and lowest in Column F with 9.39% and 0% respectively. Column.

Fig. 8 Outliers present in the dataset

Outliers in the dataset can increase variability and decrease the statistical power. And hence it is necessary to treat these outliers. However, if the outliers don’t much affect the results, then there is o need to delete the outliers. But, in our case, results are altered. 5.2.4 Outliers Truncation The outliers in this dataset, was removed by using Interquartile Range Method. Interquartile Range Method, also known as IQR, is calculated as the difference between the 75th and 25th percentiles of data. It defines the box in Box and Whisker plot. After the truncation process using IQR Method, all the outliers were successfully removed. According to Fig. 9, all the columns have 0% of outliers.

6 Validity Measures Along with outcomes obtained by executing algorithms, there is also a need to measure the performance and its validity. The validity measures [10] are usually used as fitness functions that help to evolve the quality of clusters formed after every iteration. All the

96

A. Khandare and R. Pawar

Fig. 9 Outliers after truncation process

measures are dependent on the data. All the validity measures might not be time-efficient but they may affect the quality of the cluster. Cluster validation and evaluation techniques supports as procedure to evaluate the goodness of the clusters formed. By validating clusters finding patterns in random data can be avoided. The main aim is to evaluate the quality of clusters formed. Evaluation and Validation parameters include. • Precision: Precision refers to the what percentage of obtained results are relevant. It can be defined as the ratio of correctly predicted positive observations to total predicted positive observations. • Recall: Recall is the percentage of the total relevant results that are been correctly classified by the algorithm. It is the ratio of correctly predicted positive observations to all the observations in actual class. • F1 Score: F1 score can be determined as weighted average of Precision and Recall. And hence it takes into account false positives and false negatives. • Accuracy: Accuracy stands as the most intuitive parameter to measure and evaluate the performance of the algorithm. It is the ratio of correctly predicted observations to the total observations. • Calinski Harabasz Score: It is also known as the Variance Ratio Criterion. Calinski Harabasz Score is the sum of between-clusters dispersion and inter cluster dispersion. Higher the score, better the performance. • Silhouette Score: Silhouette value determines how close a particular object is to its own cluster and compared to other clusters present in the data. The range of score is from -1 to 1, where a high value indicates that the object is well matched to cluster and poorly matched to neighboring other clusters present.

Data Clustering Algorithms: Experimentation and Comparison

97

7 Result And Discussion 7.1 K-Means Algorithm This clustering algorithm is based on a partitioning approach. The algorithm is applied to the datasets and its results are shown below: The clusters were formed as shown below when the value of k was given as k = 4. The above Fig. 10 consists of 2 sections. On the right side, the original dataset is been conceptualized. Whereas the left side presents amount of data scattered after the model is trained and tested on giving value of k as 4.

Fig. 10 Clustering results for k = 4 (K-means)

That means, the model has more or less correctly classified the clusters. Although minor differences are observed when clustering the data points. But, good amount of accuracy is obtained. Table 1 Results of K-means Precision

Recall

F1 score

Accuracy

Calinski harabasz score

Silhouette score

0.9625

0.94

0.96

0.96

560.7427

0.55197

As the data has initially 4 types of data points, and hence the value of k as 4 was best suited for it. After training the model with value of k as 4, accuracy of 96% was achieved, followed by precision, recall and F1 Score values as 0.96, 0.94 and 0.96 respectively. Whereas silhouette score was around 0.551. This can conclude ingenious results with the first try. Further after, the value of K was reduced to 2 and following changes were observed:

98

A. Khandare and R. Pawar

Fig. 11 Clustering results for k = 2 (K-means)

According to above figure, the model classified the data points to most appropriate and legitimate clusters. The data points that were located above 0 x-axis level were termed as one cluster whereas the below 0 level points were grouped into another cluster. 0.3325 /0.5/0.375/0.50/460.44261/0.6490. Precision

Recall

0.3325

0.5

F1 Score 0.375

Accuracy 0.5

Calinski Harabasz Score

Silhouette Score

460.44261

0.6490

Fig. 12 Results of K-means (k = 2)

Although points were classified according to nearest distance, the model yielded 50% accuracy and precision, recall, F1 score as 0.3, 0.5 and 0.37 respectively. There was no prominent change observed in Calinski Harabasz Score. But the Silhouette Score was increased by 25% approximately. This was due to; the points were correctly classified according to their positional distances. But accuracy of model decreased by nearly 50% as, the data initially consists of 4 distinguished sets which can be visualized as 4 different clusters. Hence, from above observations, one can conclude that, proper selection of value of K largely affects the results and overall accuracy of the model trained.

References 1. https://docs.oracle.com/cd/B28359_01/datamine.111/b28129/process.htm#CHDFGCIJ 2. Khandare A, Alvi AS (2017) Efficient clustering algorithm with enhanced cohesive quality clusters. IJISA, ISSN: 2074–9058

Data Clustering Algorithms: Experimentation and Comparison

99

3. https://www.researchgate.net/publication/2500099_On_Clustering_Validation_Techniques/ figures?lo=1 4. https://www.analyticsvidhya.com/blog/2016/11/an-introduction-to-clustering-and-differentmethods-of-clustering/#:~:text=Clustering%20is%20the%20task%20of,and%20assign%20t hem%20into%20clusters. 5. https://www.geeksforgeeks.org/partitioning-method-k-mean-in-data-mining/ 6. Khandare A, Alvi AS (2016) Survey of improved k-means clustering algorithms: improvements, shortcomings, and scope for further enhancement and scalability. ISBN: 978–81–322– 2752–6 AISC, Springer. DOI: https://doi.org/10.1007/978-81-322-2752-6_48 7. https://sites.google.com/site/dataclusteringalgorithms/density-based-clustering-algorithm 8. https://www.datamining365.com/2020/04/grid-based-clustering.html 9. https://towardsdatascience.com/understanding-k-means-clustering-in-machine-learning-6a6 e67336aa1 10. https://www.tutorialspoint.com/data_mining/dm_cluster_analysis.htm 11. Khandare A, Alvi AS (2017) Performance analysis of improved clustering algorithm on real and synthetic data. 9(10), ISSN: 2074–9104. https://doi.org/10.5815/ijcnis.2017.10.07 12. https://epubs.siam.org/doi/abs/10.1137/1.9780898718348.ch14 13. https://en.wikipedia.org/wiki/K-means_clustering 14. https://en.wikipedia.org/wiki/Hierarchical_clustering 15. https://en.wikipedia.org/wiki/DBSCAN 16. https://towardsdatascience.com/how-dbscan-works-and-why-should-i-use-it-443b4a191c80 17. Khandare A, Alvi AS (2017) Optimized time efficient data cluster validity measures. IJITCS, ISSN: 2074–9007

Design and Development of Clustering Algorithm for Wireless Sensor Network Pooja Ravindrakumar Sharma(B) and Anand Khandare Department of Computer Engineering, Thakur College of Engineering & Technology Mumbai University, Mumbai, Maharastra, India

Abstract. In AI and machine learning the process of extracting and discovering patterns in large datasets involving methods in network clustering has been extensively, studied. Clustering method are widely used for partitioned the data. In Unsupervised Machine Learning K-Means is very powerful technique. K-means algorithm calculation is a repetitive process where we try to limit the distance of data points from the approximate data points in cluster. A power-efficient for Wireless sensor network a K-means algorithm is proposed, which is utilized to enhance the performance. K-means algorithm has many drawbacks that hampering his work. In this we proposed a limitation of K-means and some suggestions. Most of the existing K-Means algorithm are there but does not provide a better performance. This performance is based on proper selection of initialization method. It is shown that how the modified k-means algorithm will build the nature of cluster and mainly it focuses on the assignment of cluster which is centroid selection so as to improve the clustering performance through K-Means clustering. There are some of the methods for initialization through which we can hypertune our algorithm to improve it. In this paper, an enhanced clustering algorithm of modified K-Means approach it has been proposed to enhance the performance of it so that we get a better accuracy. And, that we can apply on WSN applications. Keywords: Clustering · K-means algorithm · WSNs · Improve Centroid selection · Accuracy · SSE · Modified method in algorithm

1 Introduction 1.1 Enhancement of an Algorithm for WSN New advances and semiconductor components utilization have brought to a development in to the gadgets world. By coordinating frameworks advancement in chips has welcomed a critical change in transit we see our general surroundings. The improvement of central processor has prompted the creation of little measured gadgets utilized in numerous applications, there are called sensors. These hubs are as yet being developed stage however numerous applications these days use them for their purposes. Sensor elements are not only capable of measuring a physical factor to which they are designed, but also to process and store the collected data [1]. The WSN is built of nodes from a couple to two or three hundred or even thousand, where each node is related with one © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2022 V. E. Balas et al. (eds.), Intelligent Computing and Networking, Lecture Notes in Networks and Systems 301, https://doi.org/10.1007/978-981-16-4863-2_9

Design and Development of Clustering Algorithm for Wireless

101

another sensors. In request to accumulate data more efficiently, there is use of unsupervised clustering algorithm is utilized for data correspondence in wireless sensor network (WSN). The concept of Clustering has been around for quite a while and its real life example in our day to day life we uses and also came to know how these approach used to cluster our data. Clustering analysis discover clusters of data that is like one another [2] and its main principle objective for clustering analysis is to find out the high-quality cluster such that we can find the similarity and useful for exploring data. If there are finding of natural grouping because there are no obvious groupings, here clustering algorithms can be used [3]. And in WSN can be characterize as a network of wireless devices that can gather and furthermore, convey the data through the wireless links [4]. Clustering in sensor nodes is very important in order to solve many problems like scalability, energy and lifetime issues of sensor networks. It incorporates gathering of sensor centers into clusters and choosing cluster heads (CHs) for all the clusters. A Clustering helps in organizing voluminous, huge data into clusters which shows inward design of the data. Example clustering of genes. A Wireless sensor network (WSN) can be characterize as, a network of wireless devices that can gather and convey the information through the wireless links. It is a self-coordinated organization made by an enormous number of miniature sensors that are arbitrarily sent in observing territorial through remote correspondence. Bunching is one of the critical procedures for drawing out the organization lifetime in remote sensor networks (WSNs). It incorporates gettogether of sensor centers into cluster and picking cluster heads (CHs) for all of the cluster. The data is ready to be used for other AI techniques after clustering. Example: For news Summarization wherein we group data into clusters and then find centroid. For relevant information data clustering Techniques are very useful. A couple of years back there is increased in the potential use of Wireless sensor network in applications such as Environmental management and various surveillance [5]. Continuously issue, it is regularly tracked down that the assessment of the specific estimations of all the rules is troublesome and in WSN it incorporates gathering of sensor hubs into groups and choosing Cluster Heads (CHs) for all of the groups. The working of CHs is to collecting the information from specific bunches hubs and forward the accumulated information to base station, so in request to energy efficient. Clustering is well known problem which has been calculated widely used in the applications of WSN. This paper, is to define the problem of WSN where clustering include, we propose K-means which is distance based method for clustering by grouping of sensor nodes [6]. K-means algorithm—It is Clustering algorithm in which use for selection of cluster head through Euclidian distances done, after that all nodes transmit their data to a central node which save this information in a node. Then after the data are collected from all the nodes then it will performs the k-mean clustering algorithm [7]. To overcome the K-mean disadvantages we will using calculation in a smarter way for initialization [8] of the centroid and improve the behavior of the clustering. Aside from initialization, the remainder of the calculation is the same as the standard K-means calculation. That is K-means++ is the standard K-implies calculation combined with a more smarter way for the selection of centroid and also Elbow technique used to discover

102

P. R. Sharma and A. Khandare

the number of cluster in a right manner. Using different parameters like initialization method, accuracy, homogeneity score, Davies Bouldin score and Silhouette score [9]. After that we go through experiments for implementing K-Means with different measurement techniques. Based on the results of the experiments, we finally come to a conclusion about the performance done by modified algorithm using different parameters [10].

Fig. 1 Cluster based mechanism of algorithm in WSN

2 Ease of Use 2.1 K-means Generalization Benefits . We are given an informational index of things, with specific highlights, and qualities for these highlights (like a vector). The undertaking is to arrange those things into gatherings [11]. To accomplish this, we will utilize the K-Means calculation; an unsupervised learning algorithm. The calculation functions as follows: • First we introduce k focuses, called means, haphazardly. • We categorize each item, to its nearest mean and we update the mean’s directions, which are the midpoints of the items ordered in that mean up until now. • We repeat the process, for a given number of iterations and toward the end, we have our groups [12]. K-means Advantages: • • • • •

It is simple to implement. It also scales to large datasets. Guarantees Good convergence. It start with warm positions of centroids. It is easy adaptable.

Design and Development of Clustering Algorithm for Wireless

103

• It forms Clusters into different sizes and shapes, such as elliptical clusters [13]. Drawback: One of the downside of this algorithm is delicate to the initialization of the centroid or the mean points, a random initialization of centroids resulted in poor clustering which is not suitable [14]. Clustering outliers: In all the algorithm outliers is main issue it might get their own cluster instead of being ignored. Consider removing outliers before clustering [15]. 2.2 Modified K-means (Initialization Method) A smarter initialization will improve the clustering, the rest of it remain same as standard k-mean algorithm i.e. k-means++ is used in first phase of standard K-means algorithm. The procedure for initialization, will pick up centroid that are far away from one another. The chances of initially picking up centroids that lie in the different clusters [16]. Applications of K-means algorithm: • • • • • • • • • • • •

It is used in Market segmentation. It is used in Document clustering. It is used in Image segmentation, compression. It is used in Vector quantization. It is used in Cluster analysis. It is used in learning or dictionary learning. It is used in Identifying crime-prone areas. It is used in Insurance fraud detection. It is used in Public transport data analysis. It is used in Clustering of IT assets. It is used in Customer segmentation. It is used in WSN applications.

2.3 Clustering in WSN Wireless sensor organizations (WSNs) are utilized in different applications from medical care to military. Because of their restricted, small force sources, energy turns into the most important resource for sensor hubs in such organizations [17]. To improve the use of energy assets, scientists have proposed a few thoughts from expanded points. Bunching of nodes assume a significant function in preserving energy of WSNs. Clustering approaches center around settling the contentions emerging in compelling information transmission [18]. In this part, we have plot a couple of current energy-proficient bunching ways to deal with improve the lifespan of wireless sensor network. In order to collect the data more efficiently, an unsupervised Machine learning Clustering Algorithm is used for data communication in WSN. For that cluster analysis there is a significant procedure for ordering a “mountain” of data into sensible important documents [19]. One of the clustering techniques for dragging out the network lifetime in WSN. It include the gathering of sensor hubs into clusters and CHs for all the clusters [20].

104

P. R. Sharma and A. Khandare

3 Abbreviations and Acronyms WSN: Wireless Sensor Network, WCSS: Within-Cluster-Sum-of-Squares, CH: Cluster Heads, ECHs: Electric Cluster Heads, BS: Base Station [21], Ed: Euclidean distance, K-mean: Clustering algorithm, K-mean++: One of the Initialization method of k-mean. 3.1 Equations In the principal stage of Standard K-means the initialization is chosen random centroid but here will using K-means++ for good clustering as here for K-means++ will figure the distance (D(x)) of every information focuses (x) from the group community that has as of now been picked [22]. At that point, pick the new bunch community from the information focuses with the likelihood of x being corresponding to. (D(x))

(1)

To choose clusters right number will use an elbow curve which is elbow method to determine the ideal number of clusters shown through WCSS versus Number of clusters [23]. ⎛ ⎞ cn dm ⎝ (2) distance(di , ck )2 ⎠ WCSS = ck

di in ci

To find the distance between two points. Euclidean distance, n D(p, q) = (qi − pi )2

(3)

i=1

3.2 Complexity The complexity of standard k-means is O(n) for i in which it is a number of iterations. For each centroid O(kn). To compute new means which worst case takes O(k + n). The space necessities for K-mean are unobtrusive in light of the fact that lone the information focuses and centroids are put away. In particular, the capacity required is O((m + k)n), where m is the quantity of focuses, n consider the quantity of traits. In particular, the time required is O(I*k*m*n) [24].

4 Proposed Methodology In Fig. 2 we can see the calculation which is utilized to beat the previously mentioned downside we use K-means++ This calculation guarantees a smarter initialization for centroid and try to improvise the behavior of the clustering. Apart from introduction, the remaining part of the calculation is equivalent to the standard K-implies calculation. That

Design and Development of Clustering Algorithm for Wireless

105

Fig. 2 Modified clustering algorithm for WSN

is K-means++ is the standard K-means calculation combined with smarter initialization of the centroids. Pick one centroid consistently indiscriminately for every data points say x, compute D(x), is the distance among x and the closest centroid that has effectively been picked. • Pick one new data point indiscriminately as another centroid, utilizing a weighted probability distribution where a point x is picked with probability proportional to Squares (D(x)). • Step 2 and 3, repeat until K center’s have been chosen. 4.1 Dataset • Dataset is consist of several attributes and it is related with the environment of beach whether sensor dataset which is taken from data.world. • Beach Weather Stations—Automated Sensors—dataset by cityofchicago | data.world. • The maintenance of whether sensor beaches data of Chicago Park Distinct along Chicago’s Lake Michigan lakefront. These sensors and large catch the demonstrated estimations hourly while the sensors are in activity throughout the midyear [25]. In Fig. 3 it is our beach whether automated dataset consist of different attributes. 4.2 Implementation In our beach whether sensor dataset it consist of several attributes which are Beach name, Measurement timestamp, Water temperature, Turbidity, Transducer depth, Wave height, Battery life, Measurement ID and Wave period. Firstly we does experimentation on customer mall dataset taken from kaggle site resource from which will get the segmentation of customers and finding the optimal number group of cluster through elbow method and figures shows the customer segmentation by comparing through several attributes like age, gender etc.

106

P. R. Sharma and A. Khandare

Fig. 3 Dataset of beach whether automated sensor

Fig. 4 Number of cluster versus WCSS

This Fig. 4 shows through elbow method we get the clusters right number. In this we getting 2–3 outlier which we can ignore them. In Fig. 5 the clusters like on X-axis clusters formed on the range between [−2, 2] which is an applitude of it there are less number of outlier we could seen over here. 4.3 Result and Discussion In Fig. 6 there is graph which shows the accuracy and SSE after experimentation done we find fluctuation so it is not an enough parameter for consider our algorithm so we have taken more parameters which shows below in figure.

Design and Development of Clustering Algorithm for Wireless

107

Fig. 5 Clusters formed through dataset

Modiﬁcaons Of K -means Algorithm 90 80 70 60 50

Accuracy

40

SSE

30 20 10 0

0.6

0.7

0.5

0.5

Phase-I

Phase-II

Phase III

Phase IV

Fig. 6 Modified of K-means algorithm

In Fig. 7 we have Silhouette Analysis and in Fig. 8 there are some more parameter computing for algorithm to get better result. In Table 1 we can shows the different parameter following: k data points focuses from a given dataset are haphazardly picked as bunch focuses, or centroids, and all preparation examples are plotted and added to the nearest group. After all occurrences have been added to groups, the centroids, addressing the mean of the occasions of each cluster are re-calculated, with these re-calculated centroids turning into the new focuses of their particular groups. Initialization method—‘k-means++’: It selects the centroids in a smart way. It will speed up convergence.

108

P. R. Sharma and A. Khandare

Fig. 7 Silhouette analysis

Compung the Parameters 2.5

2.24

2.24

2.24

2 Dalvies_score

1.5 1

Homogenity_score

0.98

0.96

0.95

Silhouee_Analysis

0.5 0.04

0.04

0.04

0

Fig. 8 Computing parameters Table 1 Modified K-means algorithm result analysis using different parameter Parameter name

Phase-I

Phase-II

Phase-III

Initialization method Random K-mean++ K-mean++ K-value

K=4

K=4

K=5

Accuracy

-

51.71

65.27

SSE

0.6

0.7

0.5

Davies Bouldin score -

-

2.24

Homogenity score

-

-

0.98

Silhouette analysis

-

-

0.04

Design and Development of Clustering Algorithm for Wireless

109

K-value—It find the optimal value for K which perform for the unsupervised ML Clustering Algorithm. When the value of K increases, result in fewer distortion. Accuracy—It measured the true value which is closer. SSE—It is full form is sum of the squared and calculated the centroid and each member of the cluster distance. As K value increases there will be decreases in disortation which will be small. Davies Bouldin Score—This defined as the score of average similarity measure of each cluster which containing its most similar cluster, it is the ratio between the withincluster distance and cluster distances similarity. For better result clusters which are farther separated and less dispersed. Homogenity Score—Its absolute value of the labels is an independent metric. It’s value for cluster label would not change in any way. For perfect labeling are homogeneous which shows its value between 0 and 1. Silhouette Analysis—It is utilized to decide the degree of separation between clusters. And it ranges in the interval between [−1, 1]. • If value 0 → this sample is very near to neighboring clusters. • If value 1 → this sample is consider when neighboring cluster is far away. • If value −1 → this sample consider that it is assign to wrong cluster.

5 Conclusion For knowing algorithm performance we have to take more parameters to know for it’s better result. Because in unsupervised there is no labelling of data. The K-means algorithm has been widely used for clustering large sets of data. Because of the performance of existing standard k-means algorithm does not always guarantee to have a good results as the accuracy because the final clusters depend on the selection of initial centroids. Moreover, the computational intricacy of the standard algorithm is questionably high owing which require to reassign the data point88s focuses a number of times, during each and every iterations of the loop. This presents an enhanced and development of algorithm which joins a systematic strategy for finding initial phase for centroids and it is an efficient way to utilize on WSN application. There is still researching for a better result the result shown above is experimenting of different datasets. Acknowledgements. We truly thanks data.world makes it easy for everyone not just the “data people”—to get clear, accurate, fast answers to any business question. It is home to the world’s biggest collaboration data community, which is available free and open to the public.

References 1. A tutorial on clustering algorithm.s https://matteucci.faculty.polimi.it/Clustering/tutorial_ html/index.html. 2. An introduction to clustering. https://medium.com/datadriveninvestor/an-introduction-to-clu stering-61f6930e3e0b.

110

P. R. Sharma and A. Khandare

3. Raouf MM (2019) Clustering in wireless sensor network (WSNs). Research gate paper. https:// doi.org/10.13140/RG.2.2.34342.98887 4. Oracle database online documentation 11g release 1 (11.1) / data warehousing and business intelligence data mining concepts. https://docs.oracle.com/cd/B28359_01/datamine.111/b28 129/clustering.htm#CHDCHHJFs 5. Partitioned Method (K-Mean) in data mining, last updated: 05 Feb 2020. https://www.geeksf orgeeks.org/partitioning-method-k-mean-in-data-mining/ 6. Almajidi AM, Pawar VP, Alammari A () K-means based method for clustering and validating wireless sensor network. In: International conference on innovative computing and communications, Vol 55, pp 251–25 7. Kodinariya TM (2013) Review paper on determining number of cluster in K-means clustering 1(6):90–95 8. Harrington P (2012) The k-means clustering algorithm. Machine learning in action 9. El Alami H, Nahid A (2019) ECH: an enhanced clustering hierarchy approach to maximize lifetime of wireless sensor networks. https://doi.org/10.1109/ACCESS.2019.2933052 10. Hassan AA, Shah W, Husien AM, Talib MS, Mohammed AAJ, Iskandar M (2019) Clustering approach in wireless sensor network based on k-means: limitations and recommendations. Int J Recent Technol Eng (IJRTE) 7(65), ISSN: 2277–3878 11. Ray A, De D (2016) Energy efficient clustering protocol based on K-means (EECPK-means)midpoint algorithm for enhanced network lifetime wireless sensor network. IET Wirel Sens Syst 6(6):181–191 12. Kanungo T, Mount DM, Netanyahu NS, Piatko CD, Silverman R, Wu AY (2002) An efficient k-means clustering: analysis and implementation. IEEE Trans Pattern Anal Mach Intell 24(7):881–892 13. Lloyd SP (1982) Least squares quantization in PCM. IEEE Trans Inf Theory 28(2):129–137 14. Wilkin GA, Xiuzhen H (2007) K-means clustering algorithms: implementations and comparison. In: Proceedings of the 2nd international multi-symposiums on computer and computations sciences. IMSCCS’07, pp 133–136 15. GreekforGreek. https://www.geeksforgeeks.org/ml-k-means-algorithm/ 16. Euclidean distance. https://sites.google.com/site/dataclusteringalgorithms/k-means-cluste ring-algorithm 17. Bangoria BM (2014) Enhanced K-means clustering algorithm to reduce time complexity for numeric value. IEEE, 2014 18. Raouf MM (2019) Clustering in wireless sensor network (WSNs). Researchgate paper, March 2019 19. Yadav A, Singh SK (2016) An improved K-means clustering algorithm. IEEE, Nov 2016 20. Sujatha S, Sona AS (2013) New fast K-means clustering algorithm using modified centroid selection method. IJERT 21. Sindhu D, Singh S (2014) Clustering algorithms: mean shift and K-means algorithm. IEEE 22. Dataset taken from: dataset: About us | data.world 23. Khandare AD (2015) Modified K-means algorithm for emotional intelligence mining. In: International conference communication and informatics (ICCCI), 24 Aug 2015. https://doi. org/10.1109/ICCCI.2015.7218088 24. Barai A (Deb) Dey L (2017) Outlier detection and removal algorithm in K-means and hierarchical clustering. World J Comput Appl Technol 5(2):24–29. https://doi.org/10.13189/wjcat. 2017.050202 25. Dataset available on: https://data.world/cityofchicago/beach-water-quality-automated-sen sors/workspace/file?filename=beach-water-quality-automated-sensors-1.csv

Mitigate the Side Channel Attack Using Random Generation with Reconfigurable Architecture A. E. Sathis Kumar(B) and Babu Illuri Department of Electronics and Communication Engineering, Vardhaman College of Engineering, Hyderabad, Telangana, India [email protected]

Abstract. A side channel analysis is major threat in secure communication device. Here our object is very clear to protect the sensitive information by adding DNA cryptography with chaotic countermeasure method. To overcome the side channel issues. Due to the side-channel attack can retrieve easily sensitive information from crypto embedded devices. So that the reason we need to protect the sensitive information (encryption key) in crypto embedded devices by applying the random number generation method. Mostly the random number generation used in medical data transmission, information security, and all sensitive areas. So in the research work, we apply a random generation by using the chaotic circuit. And also the paper concentrate to realize the chaotic circuit in hardware using FPGA with partially reconfigurable. And also based on Lorenz’s chaotic circuit used to protect sensitive information. Simulation and Synthesis verification done by using the VIVADO tool. Security analysis is done by hierarchical UACI, NPCR, and correlation. Keywords: Side channel attack · FPGA · Chaotic

1 Introduction Electronic trading, digital signature, multimedia communications and other areas have quickly grown with the popularity of personal computers and networks [1]. At about the same time, more and more focus is also being paid to information security issues in these areas. Most hackers are excellent at analyzing the encryption algorithm mathematically, so that they can quickly decrypt sensitive information [2]. In recent years hackers are retrieve the sensitive information through side channel power analysis attack. To overcome the above problem we need to develop a random number generation to enhance the security devices. The sensitive values will be pass through the initial condition of chaotic equation, based on the chaos initial condition it will generate random values, and it is proved that chaos exists in almost all engineering fields [3]. Further the chaos theory have been use huge areas like secure communication, image processing, filed of cryptography [4–9]. This paper organized as follows. Our contribution is two-fold. First, we suggested chaotic counter-measures with initial conditions groups for an efficient counter-measure technique includes in DNA crypto © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2022 V. E. Balas et al. (eds.), Intelligent Computing and Networking, Lecture Notes in Networks and Systems 301, https://doi.org/10.1007/978-981-16-4863-2_10

112

A. E. Sathis Kumar and B. Illuri

process. Second, we suggested a regional and power-centric hardware implementation for proposed chaos solution that could be deemed sufficient for the effective prevention of SCA (Side Channel Attack).

2 Cryptography Now day’s cryptography devices are prone to the hackers to retrieve sensitive information, for example military data, medical data, banking transaction. Future most of the people going use NFC (Near Field Communication Device) for money transactions. For this kind latest technology we need to have countermeasure for side channel attack. Especially by using Side channel attack via power traces to retrieves the information from secure device [10]. a. Related work Now days reverse engineering is used take the sensitive information of the crypto devices [6]. For example side channel attack is one of the reverse engineering departments, this attack based on the three categories i.e. 1. Correlation Analysis 2. Differential Power Analysis 3. Template Attack and Non-Template Attack [11–16]. Based on the above three methods, hackers can analysis the sensitive information of crypto devices using power traces [17]. Convert to the research stage of power traces to collect secure information, such as hidden PIN number; until they obtain the data they can convert into a cloning process to produce precisely the same output of the system. As a result, the majority of the papers adopted only the extraction of the key from the secured network, only with a few writers focusing on including a countermeasure using the FPGA method and all authors includes reconfigurable architecture with Test involves machine learning algorithms for classification power traces of protected computer data as a countermeasure and practice [18–25]. Based on the above survey we categorize include DNA cryptography with chaotic countermeasure method in this paper. b. DNA Cryptography We need a cryptographic device to encrypt confidential information. But recent days a reverse engineering technique used to extract confidential information from the crypto devices. The most of the paper says that DNA crypto process is more secure [26] but due to the side channel attack hackers can retrieve the sensitive information via power traces of the secure DNA devices. In general, most cryptography algorithms are so effective in mathematics, but if you know the key structure, you can easily retrieve sensitive information through side channel attack. [27], and few works are implemented regarding to mitigate the side channel attack but general structure of the DNA process is below. The sequence of DNA has been seen in this study and Used as. A = 00

Mitigate the Side Channel Attack Using Random Generation

113

C = 00 G = 10 T = 11 In the standard operating method, B is equal to 01100010, which will be equivalent to CGAG in DNA computing. The value of DNA is the storage space and the DNA gram contains 1021 bases of DNA = 108 terabytes of data that can be stored in a very compact form.

3 Chaotic Circuit Ux/dt = k (y − x)

(1)

Uy/dt = −xz + ly

(2)

Uz/dt = −lx + ym

(3)

The above equation having k = 10, l = 20 and y = 35 are the given values of the above mentioned equations. k, l, y are the chaotic initial values. The above Fig. 1 shows the proposed work. In this article, we use the pseudorandom chaotic series to produce the DNA encoding rule decoding matrices for the medical image. a. Algorithm Procedure Step One: Plain Text: A plain text is classified as the original medical message that needs to be transmitted. Step Two: Original Medical image to convert to cipher data, which we cannot read Step Three: Decryption: convert cipher text to plain text Step Four: Key is combined with a chaotic circuit random number with DNA Crypto method. b. Methodology FPGA Design Flow: In the FPGA design flow, we need a design specification for the first stage. Using the Verilog HDL language to incorporate and realise the hardware configuration based on the specifications. The features of the modeling process must be analyzed by Afterword using the simulation process. So simulation here is nothing but testing the functionality of

114

A. E. Sathis Kumar and B. Illuri II. PROPOSED WORK DNA Encryption Process

Medical Data

Secret Key

Cipher Data XOR

Chaotic Circuit

Encryption Process

Cipher Data

XOR

DNA Decryption

Original Medial Data

Decryption Process Secret Key

Fig. 1 Proposed encryption and decryption process

the design whether it functions properly is not based on specification using the VIVADO tool in the context of wave form as shown in Fig. 4. To verify the Verilog HDL code, we used HDL code along with test bench. Appling test patterns are used to stimulate the template on the test bench. And the next step is to grasp the specification-mentioned hardware. So, the RTL architecture is transformed into the net-list hardware relationship between the modules based on the synthesis process (Fig. 2). c. Architecture of a PYNQ–Z2 Board The main important feature of PYNQ board is communication between PS (Programmable System) and PL (Programmable Logic) with python environment based on the System on chip process. The abbreviation of PYNQ is python productivity for ZYNQ. The internal architecture of the PYNQ board is shown in Fig. 3.

Mitigate the Side Channel Attack Using Random Generation

115

Specification

HDL .v file

.v file _TB

Simulation

Synthesis

Fig. 2 FPGa design flow

Fig. 3 PYNQ board modules [7]

4 Results and Comparison Analysis In Fig. 4, the action of the effects of chaotic circuit simulation using Matlab is shown. In this simulation results, it is explicitly mentioned that hackers cannot figure out what the initial values and the final values of the random actions are. So based on this random process intruder that is unable to process, confidential information cannot be tracked back by side-channel power research (Figs. 5, 6, 7, 8, and 9).

116

A. E. Sathis Kumar and B. Illuri

Fig. 4 Board selection using VIVADO tool

Fig. 5 Create project

VIVADO Design Process: Shows the HDL simulation of the chaotic circuit using Verilog HDL, based on this method we can analyse the functionality of the chaotic circuit, and further improve the hardware using FPGA VIVADO software. And Fig. 11 shows the DNA hardware partitioning representation, it is evident that to reduce the area and increase the speed of the chaotic mechanism in the DNA crypto process (Fig. 10). In Fig. 11 show the partial reconfigurable of proposed DNA with chaotic circuit design. Based on the partial reconfigurable process we can reduce hardware burden. In Fig. 12 shows the RTL view representation of the DNA with Chaotic design process. It is exactly the hardware realization of the proposed work.

Mitigate the Side Channel Attack Using Random Generation

117

Fig. 6 Selection hardware board PYNQ-Z2

Fig. 7 Project file creation

The integration of the co-design of hardware and software is shown in Fig. 13. It also helps to optimize the region and increases speed. In Table 2 shows the utilization resources.

118

A. E. Sathis Kumar and B. Illuri

Fig. 8 Project file giving constrains

Fig. 9 Chaotic circuit simulation using Matlab 2018

NPCR(%) =

UACI (100%) =

D(i, j)

i,j

M ×N

× 100

⎧ ⎫ ⎨ |C (i, j) − C (i, j)| ⎬ 1 2

1 M ×N⎩

i,j

255

⎭

× 100

S

S

(I (x, y) − µ) × (I (x, y) + 1) − µ

x=1 y=1

CP = i, j are the pixel values S S S S

((I (x, y) − µ))2 × ((I (x, y) + 1) − µ)2 x=1 y=1

x=1 y=1

Mitigate the Side Channel Attack Using Random Generation

119

Fig. 10 Simulation of chaotic circuit

Fig. 11 Partial reconfigurable DNA and chaotic circuit

*NPCR: Number of pixels change rate *UACI: Unified Average changing intensity *CP: Correlation NPCR: The NPCR is nothing but the rate of change in the number of pixels, primarily used to verify the protection intensity in the encryption process. And here we used as an application is medical information.

120

A. E. Sathis Kumar and B. Illuri

Fig. 12 RTL view DNA with chaotic design

Fig. 13 Hardware software co-design

UACI: Unified average changing intensity is used to show the strength of image encryption process. So, we will demonstrate the power of the encryption architecture based on the UACI scheme. It is safe if it is at least 96, otherwise it is not safe (Table 1).

Mitigate the Side Channel Attack Using Random Generation

121

Table 1 Comparision different algorithm Algorithms

Method (%)

Medical Images

Proposed work

NPCR UACI CP

99.80 33.49 99.45

Ref. [2]

NPCR UACI CP

99.72 33.53 99.65

Ref. [3]

NPCR UACI CP

99.70 33.55 99.50

Ref. [4]

NPCR UACI CP

99.21 33.17 99.11

Utilization resource of FPGA See (Table 2). Table 2 Resource utilization percentage FPGA board resource DSP LUTs Flip Flops BRAM

Availability 220

Consumed

Utilization percentage

200

90.90%

53200

37825

71.09%

106400

52005

48.87%

140

65.20

46.57

NIST standard analysis: As per the US statistical security suite NIST standard [28–30] is checked randomize of the chaotic circuit implementation with DNA architecture. As a result, a hybrid encryption process is created by combining chaos behaviors with DNA structure.

5 Conclusion In this paper we focus on the mitigate the side channel attack of DNA cryptography using chaotic random number generation. We analyzed simulation results of chaotic circuit behavior and realize the circuit in terms of hardware using FPGA. The future scope the paper is we can include machine learning algorithm to detect the side channel attack and to make more counter measurement technique.

122

A. E. Sathis Kumar and B. Illuri

References 1. Author F (2016) Article title. Journal 2(5):99–110 2. Le T-H, Servière C, Cledière J, Lacoume J-L (2007) Noise reduction in the side channel attack using fourth order cumulants. IEEE Trans Inf Forensic Secur 2(4):710–720 3. Ryoo J, Han DG, Kim SK, Lee S (2008) Performance enhancement of differential power analysis attacks with signal companding methods. IEEE Signal Process Lett 4. Research Center for Information Security (RCIS) of AIST, Side-channel Attack Standard Evaluation Board (SASEBO) http://www.rcis.aist.go.jp/special/SASEBO/index-en.html 5. Author F, Author S (2016) Title of a proceedings paper. In: Editor F, Editor S (eds) Conference 2016, LNCS, vol 9999. Springer, Heidelberg, pp 1–13 6. Author F, Author S, Author T (1999) Book title. 2nd edn. Publisher, Location 7. Author F (2010) Contribution title. In: 9th international proceedings on proceedings. Publisher, Location, pp 1–2 8. LNCS Homepage. http://www.springer.com/lncs. Accessed 21 Nov 2016 9. Gneysu T, Moradi A (2011) Generic Side-channel counter-measures for reconfigurable devices. In: Crytographic hardware and embedded systems-CHES 2011, Nara, Japan, LNCS, vol 6917, pp 33–48 10. Illuri B, Jose D (2021) Highly protective framework for medical identity theft by combining data hiding with cryptography. In: Chen JZ, Tavares J, Shakya S, Iliyasu A (eds) Image processing and capsule networks. ICIPCN 2020. Advances in intelligent systems and computing, vol 1200. Springer, Cham. https://doi.org/10.1007/978-3-030-51859-2_60 11. Kadir SA, Sasongko A, Zulkifli M (2011) Simple power analysis attack against elliptic curve cryptography processor on FPGA implementation. In: Proceedings of the 2011 international conference on electrical engineering and informatics, Bandung, Indonesia, 17–19 July 2011, pp 1–4 12. Saeedi E, Kong Y, Hossain MS (2017) Side-channel attacks and learning-vector quantization. Front Inform Technol Electron Eng 18(4):511–518 13. Singh A, Chawla N, Ko J-H (2019) Energy efficient and side-channel secure cryptographic hardware for IoT-edge Nodes. IEEE Internet Things J. https://doi.org/10.1109/JIOT.2018.286 1324 14. Zhao M, Edward Suh G (2018) FPGA-based remote power side-channel attacks. In: 2018 IEEE symposium on security and privacy 15. Illuri B, Jose D (2020) Design and implementation of hybrid integration of cognitive learning and chaotic countermeasures for side channel attacks. J Ambient Intell Human Comput (2020). https://doi.org/10.1007/s12652-020-02030-x 16. Zhao M, Suh G (2018) FPGA-based remote power side-channel attacks. In: 2018 IEEE symposium on security and privacy (SP). IEEE, pp 229–244 17. Schellenberg F, Gnad D, Moradi A, Tahoori M (2018). An inside job: remote power analysis attacks on FPGAs. In: 2018 design, automation & test in Europe conference & exhibition (DATE). IEEE, pp 1111–1116 18. Shan W; Zhang S, He Y (2017) Machine learning-based side-channel-attack countermeasure with hamming-distance redistribution and its application on advanced encryption standard. Electron Lett 53(14). (7 6 2017) 19. Pande A, Zambreno J (2018) Design and hardware implementation of a chaotic encryption scheme for real-time embedded systems. In: An effective framework for chaotic image encryption based on 3D logistic map, Security and communication networks, vol 2018 20. Singh A, Chawla N, Ko J-H (2019) Energy efficient and side-channel secure cryptographic hardware for IoT-edge nodes. IEEE Internet Things J 6(1)

Mitigate the Side Channel Attack Using Random Generation

123

21. Özkaynak F (2017) Construction of robust substitution boxes based on chaotic systems. Neural Comput Appl 1–10. https://doi.org/10.1007/s00521-017-3287-y 22. Diab H (2018) An efficient chaotic image cryptosystem based on simultaneous permutation and diffusion operations. IEEE Access 6:42227–42244 23. Liu L, Zhang Y, Wang X (2018) A novel method for constructing the Sbox based on spatiotemporal chaotic dynamics. Appl Sci 8(12):2650. https://doi.org/10.3390/app8122650 24. Gierlichs B, Batina L, Tuyls P, Preneel B (2008) Mutual information analysis. In: International workshop on cryptographic hardware and embedded systems. Springer, pp 426–442 25. Moradi A, Schneider T (2016) Improved side-channel analysis attacks on xilinx bitstream encryption of 5, 6, and 7 series. In: International workshop on constructive side-channel analysis and secure design. Springer, pp 71–87 26. Batina L, Gierlichs B, Prouff E, Rivain M, Standaert F-X, Veyrat-Charvillon N (2011) Mutual information analysis: a comprehensive study. J Cryptol 24(2):269–291 27. Zhang Y, Wang X (2014) Analysis and improvement of a chaosbased symmetric image encryption scheme using a bit-level permutation. Nonlinear Dyn 77(3):687–698 28. Tang G, Liao X, Chen Y (2005) A novel method for designing S-boxes based on chaotic maps. Chaos Solitons Fractals 23:413–419 29. Jamal S, Khan M, Shah T (2016) A watermarking technique with chaotic fractional S-box transformation. Wirel Pers Commun 90(4):2033–2049 30. https://www.nist.gov/fusion-search?s=side+channel+attack+

A Statistical Review on Covid-19 Pandemic and Outbreak Sowbhagya Hepsiba Kanaparthi(B) and M. Swapna(B) Vardhaman College of Engineering, R.R.District, Kacharam, Shamshabad, India

Abstract. Throughout the history, various pandemics invaded the human race deducing their existence by half or more. A brief study on the past pandemics holds the above statement true. A similar situation is being experienced in the present day, a war against an invisible enemy; the novel COVID-19 coronavirus. A statistical study of the current case helps to analyze the impact of the pandemic and take required measures and is preserved for future use in case of a similar situation. The world data of covid-19 is compared and studied graphically from the very start of the pandemic till date. Keywords: COVID-19 · SARS-CoV · MERS-CoV · H2 N2

1 Introduction Viruses are parasites at microscopic level, usually very much smaller than bacteria. Viruses, like every other parasite, cannot multiply outside of a host body. Viruses can infect animals, plants, microorganisms like bacteria and archaea. Viral diseases or infections usually occur when virus particles enter cells of an organism that are vulnerable enough for the virus to reside and reproduce [1]. CORONAVIRUS FAMILY: Coronaviruses are a subfamily of Orthocoronavirinae, in the family Coronavirida, enveloped, single stranded, +ve strand RNA viruses classified within the Nidovirales order. Pathogens of many animal species and of humans, including the recently isolated SARS-CoV are found in the coronavirus family. Almost 10 years after SARS wrecked havoc, in 2012, MERS-CoV began causing illness mainly in the Arabian Peninsula, and was later found to have infected dromedary camels. It evolved from a different bat coronavirus. MERS had a relatively higher death rate than SARS but it did not spread as much [2].

© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2022 V. E. Balas et al. (eds.), Intelligent Computing and Networking, Lecture Notes in Networks and Systems 301, https://doi.org/10.1007/978-981-16-4863-2_11

A Statistical Review on Covid-19 Pandemic and Outbreak

125

1.1 The Novel Coronavirus or COVID-19 The WHO on the February 11, 2020, named the Novel coronavirus- induced pneumonia, as coronavirus disease 2019 (COVID-19). The virus escalated in scale since it first appeared in December 2019 in Wuhan, China. The novel coronavirus was named as severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) by The International Virus Classification Commission on the same day. Covid-19 is not the 1st grievous respiratory disease outbreak caused by the coronavirus. Coronaviruses have caused three epidemic diseases in the past two decades, which are COVID-19, SARS and MERS. Many countries have reported cases of Covid-19 around the world. The number of confirmed cases in China reached over 79 thousand, of which 2,873 were dead, and 41,681 were cured, according to the data, up to the March 1, 2020. The number of confirmed cases in other countries reached to 7,041, of which 105 were declared dead, and 459 recovered. The “World Health Organization” (WHO) declared COVID-19 was listed as the “Public Health Emergency of International Concern” (PHEIC) on the 31st of Jan-20, i.e. it is risk to all over the world and a coordinated international action taken is in need. Based on the research being carried out on covid-19 and the previous study on SARS-CoV and MERS-CoV vaccine or medicine is yet to be finalized as the newly found vaccines or medicines are not effective enough [3]. 1.2 Formation of COVID 19 Coronaviruses are enfold viruses with a +ve sense single-stranded RNA genome (26– 32 kb). coronavirus genera (α, β, γ, δ) are identified, Human coronaviruses (HCoVs) detected within the α and β respectively as coronavirus (HCoV-229E and NL63) coronavirus (“MERS-CoV, SARS-CoV, HCoV-OC43 and HCoV-HKU1”) genre. After few days (“19th-Dec-2020”) patients symptoms like suffering with cough, fever, and dyspnea with “acute respiratory distress syndrome” (ARDS) were reported in Wuhan, China, caused by an unidentified microbial infection. Virus genome pattern of 5 patients with pneumonia hospitalized from Dec-18-2019 to Dec-29-2019, led to discovery of new β-Covid strain altogether. The newly found novel β-Covid that has been isolated is 88% identical to the pattern of two bird family, batderived SARS with likely corona viruses, bat-SL-CoVZC45 and bat-SL- CoVZXC21 are about 50percent identical to the pattern of MERS-CoV. The worldwide Virus Classification Commission then renamed novel β-CoV to “SARS-CoV-2”. The phylogenetic tree structure of SARS-likely corona viruses complete genome sequences is clearly depicted in the Fig. 1. In China it was discovered by several researchers and scientists that SARS-CoV-2 necessary need of ACE-2 (“angiotensin-converting enzyme 2”) as a receptor to enter cells just like SARS-CoV. The most significant factor for viral infections and diseases is the attachment of the virus with the receptors of the host cells. SARS-CoV plausibly emerge from bats, spread to other species ACE-2 and spread species to humans. DPP4 (“Dipeptidyl peptidase 4, also referred to as CD-26) was detected as a functional receptor for diseases MERS-Covid, due to receptor binding the S1 domain of the MERSCovid spike protein was co-purified with DPP-4 explicitly from lysates of susceptible Huh-7 cells. MERS-CoV can bind DPP-4 from different species, which promotes the

126

S. H. Kanaparthi and M. Swapna

Fig. 1 Phylogenetic tree structure—SARS-likely corona viruses

A Statistical Review on Covid-19 Pandemic and Outbreak

127

transmission to humans and other species, and effected cells from an outsized number of species [4]. 1.3 Symptoms Symptoms of COVID-19 are visible in two to Fourteen in days after subjected to virus either of directly/indirectly. Fever, frog-in-one-s-throat, shortness of breath or difficulty in inhalation are the symptoms that are visible in a person infected by the virus, Other symptoms may be tiredness, aches, runny nose and sore throat. In some people there might be no signs of symptoms at all. The symptoms of COVID-19 can range between a severity level of very mild and severe. Old age or who are suffering from lung disease, diabetes, heart disease and who have a bad immune system may have a higher chance to be effected by the virus [5]. 1.4 Treatment for outbreak of virus COVID-19 It is quite same situation and earlier like SARS-Covid and MERS- Covid currently availability of vaccine does not exist for SARS-CoV-2 infection. The health care department are used to treat the effected include oxygen therapy and to cover secondary bacterial infections a wide range of antibiotics are used. Plasma therapy effective in some cases, showed satisfactory results. It is not considered as an idle way to treat covid-19 worldwide. When new medicines or vaccines are being tested it should strictly follow the specifications set by each Health organizations of respective countries and that it should pass all the previous phases test [6].

2 Literature Survey Sixth cholera (1899–1923): Cholera is a nature borne disease from water that ejected from a bacterium that is said to be Vibrio cholerae. India is the first to experience an outbreak of the disease, later Middle East, North Africa, Eastern Europe and Russia observed similar outbreaks. An estimated 8,00,000 were killed by this pandemic. Someone who was in close contact with the effected or the disease, may experience symptoms such as extreme dehydration, diarrhea, and vomiting [7]. Hong Kong flu: The global outbreak of the influenza virus first started in China which is the origin of the flu. The virus’ H3 N2 subtype, the cause of a previous influenza outbreak is suspected to be where the pandemic was evolved from. One million people were killed by this virus. The factor that facilitated its rapid spread around the globe is that the virus was highly contagious. Five hundred thousand cases had been reported within two weeks in Hong Kong. The virus later escalated rapidly throughout Southeast Asia. Respiratory symptoms typical of influenza, symptoms like muscle pain, chills, fever, and weakness were caused by the infection. People infected by this virus suffered from these symptoms for around a week. In the group of highly susceptible are infants and the elderly and were related to the group of highest levels of mortality. Only after the pandemic had peaked in many countries, a vaccine had been developed and became available [8].

128

S. H. Kanaparthi and M. Swapna

Asian flu (1957): The pandemic of the late 1950s, avian influenza (Asian flu) outbreak that spread and later became non-existent after a vaccine was introduced. Two million people died from this pandemic at its peak. It is recorded that a virus called as influenza. A subtype H2N2 is the cause of this widespread outbreak. Research has specified that virus is mix of the genetic material of a species into new combinations in different individuals, starting from strains of human influenza viruses and avian influenza. The virus spread throughout China and its surrounding regions in the initial months of the 1957 flu pandemic. Some individuals who were infected experienced only minor symptoms, like mild fever and cough, others experienced fatal complications like pneumonia. Protective antibodies were believed to be embedded in the ones who were unaffected by the virus. The mortality and spread of the Asian flu epidemic occurring worldwide had been limited by the fast express development of a vaccine and availability of antibiotics [9]. Spanish flu (1918): This flu was one of the deadliest flu outbreaks in history. The Spanish flu killed over 50,000,000 people and infected about 500,000,000. It was one of the deadliest flu outbreaks in history. The widespread outbreak was caused by the H1N1 virus. It is reported that spread of disease was mainly caused by excessive intake of patients and poor hygiene of hospitals. Europe was the first to report cases of the Spanish flu. The United States and Asia were reported to found cases of the virus before it escalated to other parts of the world. During the time of the pandemic, there were no vaccines or drugs to treat this fatal flu. People were ordered to cover their face using a mask, schools, other educational institutions, entertainment centers and other public places were shutdown. The deceased were either buried or cremated. The respiratory system was mainly attacked by the flu. This strain of virus was highly transmissible: when a person affected by the flu, sneezes, talks or coughs, the droplets of saliva are passed on by air, and when inhaled by anyone nearby the flu is transmitted to the other. A person can also be infected if he or she comes in contact with the virus contaminated objects and then touches their nose, eyes or mouth. Around five hundred thousand people worldwide are estimated to die from the illness during a typical flu season as reported by WHO. Up to 40% of the world’s population were sickened by this deadly pandemic that began in the year 1918 and killed an estimated 50,000,000 lives [11, 18]. EBOLA VIRUS: The Republic of Sudan and the Democratic Republic of Congo were the first to observe the outbreak of Ebola in 1976. In the early 2014, the outbreak began in West Africa. Touching with blood or tissue from infected people or mammals or other body fluids causes the spread of Ebola. As reported by WHO this outbreak is the most complex wave of the disease [12]. HIV: HIV may be the deadliest virus of the modern world. It is still the biggest killer. An estimated 32 million people were killed by HIV since it was first recognized in the early 1980s. People infected with HIV can now live for years, all thanks to the powerful antiviral drugs. Many under developed and developing countries are still struggling because of the disease. 95% of new cases are reported in such countries. In the African region under WHO nearly 1 in every 25 adults is reported HIV-positive, which accounts nearly two-third of the people living with HIV in the world [15].

A Statistical Review on Covid-19 Pandemic and Outbreak

129

SMALLPOX: The World Health Assembly declared that there were no new cases of smallpox reported in 1980. Prior to the development of vaccine, about 1 in 3 of those who were infected died from this disease. Lasting scars and, decline in eye sight were the after effects of the disease. Higher death rate was observed outside of Europe. From research it is known that the virus was contained within Europe before tourists from Europe carried it to other regions. Historians estimate that 90% of the native Americans died from smallpox which was brought in by the tourists from Europe. A total of 300 million people were killed by then fatal disease in the twentieth century [16].

3 Statistical Analysis Datasets from WHO and world meters are taken for the analysis of the covid-19 current status around the world. The dataset gives the following Attributes: Country—Countries effected by covid-19, Cases-cumulative Total—The total number of cases in each country respectively, Cases—cumulative total per 1 million. Total no population—total number of cases per one million population for each country respectively, No of Cases—newly reported in last 7 days, in last 24 h, Total no of Deaths— cumulative Total—total number of cases reported till date, Deaths—cumulative total per 1 million population, Deaths—newly reported in last 7 days, Deaths—newly reported in last 24 h (Table 1). The datasets used are: https://covid19.who.int/table https://www.worldometers.info/ coronavirus/. Graphs and statistical analysis: https://github.com/HepsibaKanaparth/COVID-19Statistical-analysis.

4 Results and Discussion From the statistical analysis of the above mentioned dataset the following are the results observed: United States of America is at the top of the list with the most reported case in the world i.e. 16,245,376 cases, it also stands first for the total no of deaths recorded with 298,594 cases. Figures 2 and 3 India takes the second place for the total no of cases recorded i.e.9,932,547 cases Fig. 2, where as Brazil takes the second place for the most no of deaths recorded with 181,835 Fig. 3. USA has the most no of cases reported in the past week and the past 24 h with 1,489,380 and 204,281 respectively also the no of deaths reported in the past week (17,152) as well as the past 24 h (1754) Fig. 7. After USA, Italy has the most no of deaths reported in both, past week and past 24 h: 4617,846 respectively Figs. 6 and 7. From studying the dataset it can be said that the most transmissions had occurred due to community transmission Fig. 12. Afghanistan has the least number in all the fields in the analyzed dataset with zero cases and deaths reported. Surprisingly the origin of the covid-19, China is listed at 74th place in the world dataset with only 779 recently reported cases and 11 reported deaths (Figs. 2, 3, 4, 5, 6, 7, 8, 9, 10, 11 and 12).

130

S. H. Kanaparthi and M. Swapna Table 1 Statistical analysis on covid-19 outbreak

S.No

Attributes used for assessment

Metrics/Parameters observed

1

No of cases registered over period of time

Min no of cases registered, Max no of cases registered

2

No of cases registered in effected countries

Total no of cases

3

No of cases registered at rate of 1 million populations

Total no of cases registered

4

No of cases registered in the past week

Total no of cases registered in the past week

5

No of cases recorded in previous day

Total no of cases recorded in the past 24 h

6

No of deaths recorded in effected countries

Total no of deaths

7

No of deaths registered at rate of 1 million populations

Total no of deaths registered

8

No of deaths registered in the past week

Total no of deaths registered in the past week

9

No of death registered in the past 24 h

Total no of deaths registered in the past 24 h

10

No of cases registered and total no of deaths registered in the top 15 effected countries

Total no of cases versus Total no of deaths for top 15 countries effected

11

No of cases Registered in the past week and no of deaths registered in the past week

Total no of cases reported in the recent week versus no of deaths reported in the recent week for top 15 countries effected

12

Ratio of cases register with past death ratio

Total no of cases reported in the past 24 h versus Total no of deaths reported in the past 24 h for top 15 countries effected

13

Transmission type in effected countries

Rate of different types of transmission

14

No of critical cases

Total no of critical cases in the effected country wise

A Statistical Review on Covid-19 Pandemic and Outbreak

Fig. 2 Pictorial Representation of Deaths in top 15 effected countries

Fig. 3 Pictorial Representation of total no cases registered in top 15 countries

Fig. 4 Total no of cases per 1 M population per 1 M population

131

132

S. H. Kanaparthi and M. Swapna

Fig. 5 Total no of Deaths

Fig. 6 New cases registered in the past week for top 15 countries

Fig. 7 Reported deaths in past week for top 15 countries

A Statistical Review on Covid-19 Pandemic and Outbreak

Fig. 8 Total Cases versus Total deaths in top 15 countries

Fig. 9 Cases reported in past 7 days versus Deaths reported in the past week

Fig. 10 Cases reported in 24 h versus Deaths reported 24 h in top 15 countries

133

134

S. H. Kanaparthi and M. Swapna

Fig. 11 Total cases and deaths in top 15 countries

Fig. 12 Rates of different types of transmissions

References 1. Lwoff A (1957) The concept of virus. Microbiology 17(2):239–253 2. Yang L, Liu S, Liu J, Zhang Z, Wan X, Huang B, Chen Y, Zhang Y (2020) COVID-19: immunopathogenesis and Immunotherapeutics. Signal Transduct Target Ther 5(1):1–8 3. Guan WJ, Ni ZY, Hu Y, Liang WH, Ou CQ, He JX, Liu L, Shan H, Lei CL, Hui DS, Du B (2020) Clinical characteristics of coronavirus disease 2019 in China. N Engl J Med 382(18):1708–1720 4. Feng W, Newbigging AM, Le C, Pang B, Peng H, Cao Y, Wu J, Abbas G, Song J, Wang DB, Cui M (2020). Molecular diagnosis of COVID-19: challenges and research needs. Anal Chem 92(15):10196–10209. (Marinho EM, de Andrade Neto JB, Silva J, da Silva CR, Cavalcanti BC, Marinho ES, Júnior HVN (2020) Virtual screening based on molecular docking of possible inhibitors of Covid-19 main protease. Microbial Pathoge 148:104365)

A Statistical Review on Covid-19 Pandemic and Outbreak

135

5. Saire JEC, Navarro RC (2020) What is the people posting about symptoms related to Coronavirus in Bogota, Colombia?. arXiv:2003.11159 6. Cunningham AC, Goh HP, Koh D (2020) Treatment of COVID-19: old tricks for new challenges 7. Dorman MJ, Kane L, Domman D, Turnbull JD, Cormie C, Fazal MA, Goulding DA, Russell JE, Alexander S, Thomson NR (2019) The history, genome and biology of NCTC 30: a nonpandemic Vibrio cholerae isolate from World War One. Proc R Soc B 286(1900):20182025 8. Peckham R (2020) Viral surveillance and the 1968 Hong Kong flu pandemic. J Glob Hist 15(3):444–458 9. Gagnon A, Acosta E, Hallman S, Bourbeau R. Dillon LY, Ouellette N, Earn DJ, Herring DA, Inwood K, Madrenas J, Miller MS (2018) Pandemic paradox: early life H2N2 pandemic influenza infection enhanced susceptibility to death during the 2009 H1N1 pandemic. MBio 9(1) 10. Centers for Disease Control and Prevention CDC (2006) The global HIV/AIDS pandemic, 2006. MMWR Morb Mortal Wkly Rep 55(31):841 11. Watanabe T, Kawaoka Y (2011) Pathogenesis of the 1918 pandemic influenza virus. PLoS Pathog 7(1):e1001218 12. Pigott DM, Deshpande A, Letourneau I, Morozoff C, Reiner RC Jr, Kraemer MU, Brent SE, Bogoch II, Khan K, Biehl MH, Burstein R (2017) Local, national, and regional viral haemorrhagic fever pandemic potential in Africa: a multistage analysis. The Lancet 390(10113):2662–2672 13. Meyers L, Frawley T, Goss S, Kang C (2015) Ebola virus outbreak 2014: clinical review for emergency physicians. Ann Emerg Med 65(1):101–108 14. John TJ (1997) An ethical dilemma in rabies immunisation. Vaccine 15:S12–S15 15. Willis NJ (1997) Edward Jenner and the eradication of smallpox. Scott Med J 42(4):118–121. (Article Information:Volume: 42 issue: 4, page(s): 118–121 Issue published: August 1, 1997 N J WillisMedical StudentNinewells Hospital and Medical School University of Dundee, Dundee, DD1 9SY) 16. Duchin JS, Koster FT, Peters CJ, Simpson GL, Tempest B, Zaki SR, Ksiazek TG, Rollin PE, Nichol S, Umland ET, Moolenaar RL (1994) Hantavirus pulmonary syndrome: a clinical description of 17 patients with a newly recognized disease. New Engl J Med 330(14):949– 955. (Muranyi W, Bahr U, Zeier M, van der Woude FJ (2005) Hantavirus infection. J Am Soc Nephrol 16(12):3669–3679) 17. Johnson NP, Mueller J (2002) Updating the accounts: global mortality of the 1918–1920 “Spanish” influenza pandemic. Bull History Med 105–115

Performance Evaluation of Secure Web Usage Mining Technique to Predict Consumer Behaviour (SWUM-PCB) Sonia Sharma(B) and Dalip Department of MMICT&BM, Maharishi Markandeshwar Deemed to be University, Mullana (Ambala), Ambala, Haryana, India [email protected]

Abstract. In the changed scenario of e-commerce, not only the number of visitors but also the frequency of their visiting the site, browsing information is increasing day by day. So investigating client’s conduct is a significant piece of web architecture which is helpful for decision making, planning sites as indicated by the client’s needs. To understand and handle consumers’ web behaviour, web usage mining (WUM) is used which includes data preprocessing and pattern discovery. For pattern discovery, Apriori algorithm is selected as standard of the study. Apriori algorithm isn’t just the primary utilized affiliation rule mining procedure yet also the most well-known one. After threadbare analysis, it came to the fore that the conventional Apriori algorithm suffered from two serious flaws. Firstly, it keeps filtering the database again and again and thereby generating, unnecessarily, enormous number of rows. Secondly, it lacks the foolproof system of security of data. The proposed algorithm (SWUM-PCB), which has been accomplished on genuine information of website by utilizing Hash table, Hashing Encryption (Message Digest algorithm MD5), association rule mining has a marked edge over the conventional Apriori algorithm. Not only its effectiveness and security of information is unquestionable but also its accuracy in predicting the consumer’s future conduct is found to be unparalleled. Experimental results to find a performance evaluation of the proposed approach will present better results on real weblogs (raw data) of a website in less execution time, memory usage, less number of rows, calculation cost and less searching time to predict the result. Keywords: Consumer behaviour · Performance evaluation · Weblogs · Security

1 Introduction In the present day, the World Wide Web proved as a benchmark for all the consumers all over the world to access meaningful, purposeful information of their interest and gives all types of services such as online education, traveling guide, e-government, etc. It also provides various cradles to obtain data for mining. Web mining [1–3] acts as an umbrella to extract useful information and discover the information from the web. It consists of three methods such as web structure mining (WSM), web content mining (WCM) and © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2022 V. E. Balas et al. (eds.), Intelligent Computing and Networking, Lecture Notes in Networks and Systems 301, https://doi.org/10.1007/978-981-16-4863-2_12

Performance Evaluation of Secure Web Usage Mining Technique

137

web usage mining (WUM) [4–7]. To predict and identify user behaviour is an important activity in web usage mining. It is also known as weblog mining and this method is mainly used by many organizations to predict consumer browsing behaviour [8–11] & website personalization. Behaviour of consumer is analyzed from patterns generated from the weblogs [12, 13]. Association Rule mining is used to uncover the relationship among data. It identifies the exact association between the data. In the present scenario, association rule methods are searching the data as per the support count. In web usage mining, association rules means accessing of the set of website pages together with minimum support count [14]. For association rule mining, Apriori algorithm is most widely used algorithm by the organizations. After ragged scrutiny, it came to the fore that the Apriori algorithm grieved from two grim problems. Firstly, it keeps filtering the database again and again and thereby generating, unnecessarily, enormous number of rows. Secondly, there is not a concept of security of data. In the field of web utilization mining, analyst utilizes the numerous calculations and check the exhibition assessment with existing (Apriori) calculation yet just on information accessible free or little thing set, for example, mushroom/pumb set dataset and some applies the security yet just for the protected information communication [15]. The proposed algorithm (SWUM-PCB) which has been accomplished on genuine information of website www.viralsach.xyz1 by utilizing Hash table [16, 17], Hashing Encryption (Message Digest algorithm MD5), encoding and decoding, Apriori property & anti-monotonicity has a marked edge over the conventional Apriori algorithm. The Proposed algorithm not only gives the assurance of security of data [18–20] but also predict the consumer behaviour in an efficient way. The Message Digest (MD5) algorithm [21, 22] is utilized for expanding the nature of the whole framework through the age of the encoding and unraveling the information made sure about keys and in this manner the real information is made sure about with more effectiveness and keep up the uprightness of information and furthermore fulfill the property of secrecy, non-disavowal and approval. In this paper, section-II shows the previous work done by researchers in the arena of web usage mining and generating frequent patterns. Section –III describe the existing Apriori algorithm of data mining which is used to find an association between the data. Section-IV presents the proposed algorithm (SWUM-PCB) which is designed to predict consumer behaviour on real data (weblogs of website). Section-IV presents the experimental results and performance evaluation of the proposed algorithm with the existing algorithm.

2 Literature Review The discovery of hidden pattern is most important activity of web usage mining. Numbers of researchers done a lot of work in this field and have applied many techniques on processed data for pattern discovery. In [23] authors used the clustering technique to find user interest among the website pages [24]. The authors deliberately present the usage of access patterns and show the difference of technology boom in the field of web usage mining [15]. To find the best hit ratio, the authors proposed a fuzzy probabilistic algorithm and two-levels of predicting model is designed by the authors [25]. In this 1 The domain www.viralsach.xyz is not active this time.

138

S. Sharma and Dalip

authors introduced a new technology for horizontal and vertical segmentation of data and apply the Diffe-Hellman algorithm [25]. Authors emphasized the technique of web usage mining and also explain the benefits of data collected in the form of weblogs [26]. Custom built Apriori algorithm is proposed by the authors to find the pattern to know the consumer browsing behaviour.

3 Apriori Algorithm This algorithm is known as an influential and classical algorithm [27, 28] to generate association rules in data mining. It is used for mining frequent item sets. The Pruning technique is used in Apriori. The first pruning technique is based on the support count and the second is based on the frequent item set. It follows the closure property of frequent item set i.e. if A is the set of Frequent Items then its proper subset is also frequent. Steps of Algorithm A1 = {list of frequent items}; For (r = 1; Ar! = r + + ) do for each transaction t in required database do Increment the count of all candidates in Cr + 1 that are contained in t Ar + 1 = candidates in Cr + 1 with min_support Returnr Ar;

4 Proposed Architecture The proposed architecture is shown in Fig. 1 proposed work is divide into mainly four steps. A lots of raw data is stored on the web server. Firstly data is collected and converted into.csv format for the study. After that pre-processing of data is done, after getting processed data, proposed secure pattern discovery algorithm is applied to generate patterns. At the last step analysis of pattern is done for the prediction of consumer behavior.

5 Proposed Algorithm This study proposes a Secure Web Usage Mining Technique (SWUM) for the prediction of consumer behaviour. In SWUM algorithm, Hash based approach with Message digest algorithm and association rule mining are used together. The Hash Based approach decreases the size of candidate set through filtering any k-item set whose corresponding hashing count below the threshold and instead of scanning whole data base only those items to be scanned that occurs frequently. Use of MD5 secure the data (data integrity) and also it compress the data and encoding and decoding the data is done in the proposed

Performance Evaluation of Secure Web Usage Mining Technique

139

Fig. 1 Proposed framework

work, it shows improvement in execution time and utilization of space and generating the number of rows for analysis. Steps of Algorithm Step I: Reading of Raw data Step-II: Apply Data Cleaning and session Identification Method Step-III: Apply Secure Pattern Discovery Algorithm • • • • • • • • • • • •

Activate and Initialize the variables & functions Read the log file stored in the dataset Apply Hashing Encryption (MD5) and encoding the data Create initial item set from a hash table Apply association rule mining Setting support count of item set Setting the itemset according to minimum support count Create a result containing itemset and their support count. decode all data Check all the matching data with the main file. If it matches than create a resultant file. Creation of Resultant file

Step-IV: Pattern Analysis Step V: Prediction of Consumer Behaviour

140

S. Sharma and Dalip

6 Experimental Results and Discussion The proposed work is implemented on the real raw data (weblogs) of website www.vir alsach.xyz. The description of Raw data collected and division of data obtained after preprocessing into number of sets is shown in Table 1. For the purpose of performance evaluation, both proposed (SWUM-PCB) algorithm and basic Apriori have been run on the same platform. Programming is done in Python version greater than 3.0 and a working IDE like IDLE, Spyder (preferred). Table 1 Description of data for experiment S.No Description

Size (rows)

1

Raw data collected

33,172

2

Data after preprocessing 18,216

3

Data set-I

18,216

4

Data set-II

10,567

5

Data set-III

6567

6

Data set-IV

2567

The experimental runs have been conducted on different data set i.e. Data set I, II, III & IV (shown in Table 1) at various minimum support count &taking some parameters in consideration such as Data security (data integrity, confidentiality, non-repudiation, authorization, Availability, data Freshness), execution time, memory usage and predicting of consumer behaviour in terms of generating rows under consideration.. Results obtained are very large so in this paper some of the results are shown of two different sized datasets. It has been found that the proposed algorithm always takes less time than basic Apriori and the interesting information can be mined in a shorter time. Difference in execution time for data set –I and data set-II is shown in Figs. 2 and 3. It is shown in Fig. 2, Time taken by Apriori algorithm is very high as compared to Proposed algorithm. At minimum support level 50, Apriori algorithm takes 3998.453 s where as proposed algorithm taken only 52.29 s. Similarly at minimum support 80, SWUM-PCB takes 20 s only which makes proposed method as an optimum solution. SWUM-PCB takes less execution time for data set-II consisting of 10,567 rows as shown in Fig. 2. It takes 10.19 s at 50 minimum support count and Apriori algorithm takes 558.39 s. Figures 4 and 5 depicts the difference in memory usage by both algorithm for data set-III & IV. Due to use of hash table and Md5 which compress the data and maintain the security of data and thus increase the efficiency of proposed system and give the result in less time. So also provides results in less memory usage. For checking of memory usage, memory function is applied on both algorithms which shows less processing time and memory usage. Figures 6 and 7 shows the difference in time and generating number of rows for different data set at best minimum support count. If minimum support threshold set

Performance Evaluation of Secure Web Usage Mining Technique

141

1200

Time in Seconds

1000

973.197

946.99 844.268

800 600

558.3929

400 200 0

34.15

34.07

30

40

13.19 45

10.19 50

Minimum Support Count Existing Algortihm(Apriori) Proposed algorithm(SWUM-PCB)

Fig. 2 Difference in execution time for data set-I 4500 4000

Time in Seconds

3500 3000 2500 2000 1500 1000 500 0

50

60

70

80

Existing Algortihm(Apriori)

3998.453

3850.514

2052.0989

1987.396

Proposed algorithm(SWUMPCB)

52.29

25.79

24

20

Fig. 3 Difference in execution time for data set-II

to high, data covered will be less and less associations can be formed. If minimum support set to low, data covered will be more and many associations can be formed. Increasing amount of records, Apriori algorithm takes longer time to generate frequent item set as compared with proposed method. After selecting best minimum support count, the difference in time, memory and generating number of rows for prediction of consumer behaviour is obtained. Prediction of consumer behaviour and generating frequent pattern by the proposed algorithm (SWUM-PCB) is very efficient and give high

142

S. Sharma and Dalip Existing Algorithm(Apriori) Proposed Algorithm(SWUM-PCB) 60 50 40 30 20 10 0 Existing Algorithm(Apriori) Proposed Algorithm(SWUM-PCB)

10

20

30

50.19

39.8

34.8

34.5

24.7

24.7

Memory Usage in Kbs

Fig. 4 Memory usage for data set-III 100 90 80 70 60 50 40 30 20 10 0

Existing Algortihm(Apriori) Proposed Algorithm(SWUM)

40

35

30

25

79.6

81.2

94.8

94.8

28.8

33.2

33.2

34.9

Fig. 5 Memory usage for data set-IV

searching accuracy rate in less time to find the associations for particular IP address. Exact association to visit the specific page for the consumer IP address is obtained in a better way. Comparison between both the algorithms are shown in Table-2.

7 Conclusion Web usage mining is an essential tool for website based organizations. The conventional Apriori algorithm is used by the many researchers but it takes more execution time and more memory usage. The current researcher in this paper presents a methodology which not only provides the result in less execution time and by consuming less memory usage due to the utilization of MD5 and hash based approach with encoding scheme

Performance Evaluation of Secure Web Usage Mining Technique

143

2500 2000 1500 1000 500 0

18216

10567

Existing Algorithm(Apriori)

6657

2567

Proposed Algortihm(SWUM-PCB)

Fig. 6 Difference in execution time at best minimum support count

Number of Rows

6000

5098

5000 4000 3000 2037

2000 1000

1852 659 395

614

0

809 458

18216

10567

6567

Data Size Existing Algortihm(Apriori)

2567

Proposed algorithm(SWUM-PCB)

Fig. 7 Difference in generating number of resultant rows

together. The proposed algorithm (SWUM-PCB) i.e. secure web usage mining technique to predict consumer behaviour is very simple for producing the secure, efficient, accurate, fast results in comparison to the existing algorithm. The existing algorithm relies on mathematics so it is less robust. In the proposed algorithm, there is no need to maintain the separate database and no need to mine the data separately. As the technology is changing day by day, our proposed system contributes high robustness and can cope up with the advancement of technology.

144

S. Sharma and Dalip

Table 2 Comparison between existing algorithm and proposed algorithm (SWUM-PCB) Parameters

Existing Apriori algorithm

Proposed algorithm (SWUM-PCB)

Methodology

Association rule mining

Hash table, message digest algorithm (MD5), Association rule mining

Memory consumption

More space

Less space

Size of item sets

Not useful for large item set

Useful for a large amount of item set

Time

More time to scan database

Less time

Number of iteration

More number of iteration

Less number of Iteration

Data base scan

Not reduces unnecessary data base scans

Reduces unnecessary data base scans

Data security

No

Yes (Data Integrity), providing encoding and decoding

Overhead

More

Less

Searching time of predicting results

More time

Less time

Application used for

Market based analysis

Web intelligence, predicting consumer browsing behaviour, market–based analysis, decision making

Computation cost

More

Less

Resultant data

More

Less

Efficiency

Less

More

References 1. Velasquez JD, Jain LC (2010) Advanced techniques in web intelligence. Springer, pp 143–165 2. Dujovne LE, Velasquez JD (2009) Design and implementation of a methodology for identifying website key objects. In: Knowledge based and intelligent information and engineering system, vol 57, no 11. Springer, pp 301–308 3. Neelima G, Rooda S (2015) An overview on web usage mining. Advances in intelligent system and computing. Springer, Cham, p 338. https://doi.org/10.1007/978-3-319-13731-5_70 4. Upadhyay A, Purswani B (2013) Web usage mining has pattern discovery. Int J Sci Res Publ 3(2). ISSN 2250-3153 5. Gauch S (2007) User profiles for Personalized information access. The Adapative web. Springer, pp 54–89 6. Jespersean SE, Throhauge J, Bach T (2002) A hybrid approach to web usage mining data warehousing and knowledge discovery. Springer, pp 73–82 7. Thomson L (2005) A standard framework for web personalization

Performance Evaluation of Secure Web Usage Mining Technique

145

8. Sharma S, Dalip (2019) Comparative analysis of various tools to predict consumer behavior. J Comput Theor Nano Sci 16:3860–3866 9. Belch G, Belch M (2009) Advertising and promotion: an integrated marketing communications perspective. McGraw-Hill/Irwin, New York, p 775 10. Agwan AA (2014) E-commerce: a concept to digital marketing. Asia Pac J Mark Manag Rev 3(12):18–35 11. Keikha Z, Sadeq MO (2014) The e-readiness assessment pattern designing with an approach to e-commerce-a case Study. Int J Eng Res 4(2):85–92 12. Umamaheswari S, Srivatsa SK (2014) Algorithm for tracing visitors on-line behaviors for effective web usage mining. Int J Comput Appl (0975–8887) 87(3) 13. Adhiya KP, Kolhe SR (2015).An efficient and novel approach for web search personalization using web usage mining. J Theor Appl Inf Technol 14. Shakya, SG, Patidar G (2017) Improved apriori algorithm for web log data. In: IDES joint international conferences on IPC and ARTEE–2017 15. NusratJabeen T, Chidambaram M (2018).Security and privacy concerned association rule mining technique for the accurate frequent pattern identification. Int J Eng Technol 7(1.1):19– 24 16. Arwa Altameem and MouradYkhlef (2018). Hybrid Approach for Improving Efficiency of Apriori Algorithm on Frequent Itemset, IJCSNS International Journal of Computer Science and Network Security, VOL.18 No.5, May 2018,pp.151–156. 17. Vanitha K, Santhi R (2011) Using hash based apriori algorithm to reduce the candidate 2-itemsets for mining association rule. J Global Res Comp Sci 2(5). (April 2011). 18. Sharma S, Dalip (2020) A novel secure web usage mining technique to predict consumer behavior. Int J Adv Sci Technol 29(5):5633–5640. ISSN: 2005-4238 IJAST 19. Log analysis for web attacks: a beginner’s guide (2018) 20. Stallings W (2011) Network security essentials: applications and standards 21. Kanickam SHL, Jayasimman L (2019) Comparative analysis of hash authentication algorithms and ECC based security algorithms in cloud data. Asian J Comput Sci Technol 8(1):53–61. ISSN: 2249-0701 22. https://en.wikipedia.org/wiki/MD5. 23. Parekh M, Patel AS, Parmar SJ, Patel VR (2015) Web usage mining: frequent pattern generation using association rule mining and clustering. Int J Eng Res Technol 4(04):1243–1246. ISSN: 2278-0181 24. Kannan PA, Ramaraj E (2013) Usage and research challenges in the area of frequent pattern in data mining. ISRO J Comput Eng 13(2):08–13. e-ISSN: 2278-0661, p-ISSN: 2278-8727 25. Anitha V, Isakki Devi P (2016) A survey on predicting user behavior based on web server log files in a web usage mining. In: International conference on computing technologies and intelligent data engineering (ICCTIDE’16), pp1–4 26. Rawat SS, Rajamani L (2010) Discovering potential user browsing behaviour using custom built algorithm. Int J Comput Sci Inf Technol (IJCSIT) 2(4) 27. Kon YS, Rounteren N (2010) Rare association rule mining and knowledge discovery: technologies for frequent and critical event detection. Information Science Reference, H ERSHEY, PA 28. Sun W, Pan M, Qiang Y (2011) Inproved association rule mining method based on t statistical. Appl Res Comput 28(6):2073–2076

Quantum Computing and Machine Learning: In Future to Dominate Classical Machine Learning Methods with Enhanced Feature Space for Better Accuracy on Results Mukta Nivelkar(B) and S. G. Bhirud Veermata Jijabai Technological Institute, Mumbai, India

Abstract. Quantum Computing is new standard which will contribute computational efficiency on to the many operational methods of classical computing. Quantum computing motivates to use of quantum mechanics such as superposition and entanglement for making new standard of computation which will be far different than classical computer. The quantum computing concept need to understand Qubit which is nothing but Quantum Bit that differs quantum computing from classical computing. Classical bit, which can be either Zero 0 or One 1 in single state at a time moment, a Qubit or Quantum Bit can be Zero 0 and One 1 at same time called as in superposition state. Quantum Computers will use quantum superposition and quantum entanglement are the two basic laws of quantum physics principles. Computational tasks which are non-computable by classical machine can be solved by quantum computer and these computational tasks defines heavy computations those expects large size data processing. Machine learning on classical space is very well set but it has more computational requirements based on complex and high-volume data processing. This paper surveys and propose model with integration of quantum computation and machine learning which will make sense on quantum machine learning concept. Quantum machine learning helps to enhance the various classical binary machine learning methods for better analysis and prediction of big data and information processing. Keywords: Qubit · QML · Supervised ML · Bloch Sphere · Superposition · Entanglement

1 Introduction 1.1 Quantum Computing and Concepts Quantum computing and machine learning is well established on classical computers and researchers are still working on improvement of ML algorithms. First section of paper briefs on quantum computing concept qubit and quantum mechanisms such as superposition and entanglement. Next section of the paper will do survey and comparison analysis of quantum machine learning and its need. Classical machine learning © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2022 V. E. Balas et al. (eds.), Intelligent Computing and Networking, Lecture Notes in Networks and Systems 301, https://doi.org/10.1007/978-981-16-4863-2_13

Quantum Computing and Machine Learning: In Future to Dominate

147

has some computational difficulties in terms of operational methods of classical computer. Proposed techniques in quantum computing domain will overcome computational difficulties and represent data on very high dimensional feature space by making use of quantum mechanisms. Quantum mechanics such as superposition and entanglement will help to represent and process data on higher dimensional space which will make sense of how quantum machine learning will reproduce better results over classical ml methodologies. To understand quantum concept, we should be brief on following points: Quantum Qubit, Superposition, Entanglement. The benefits of quantum computers motivate the standard by introducing qubit or a quantum bit, the Quantum information called as Quantum bit. Single number has 0 or 1 used to represent the state of a bit on a classical computer. So classical bit can be having state 0 and 1 at one-time moment. Classical binary data and information is string of zero 0 and one 1. An atom might represent a quantum bit and electrons, photons will be used for state formation of qubit. Qubit is a powerful bit having multiple states within it based upon superposition and entanglement principle.1 qubit is in having two different states, which can also be denoted as 0 and 1 at same time moment. 2-qubits can have 4 states as (00, 01, 10 and 11) at same time moment. Similarly, 3-qubit has 8 states as (000, 001,010,011,100,101,110,111) at the same time and so on for n-qubit the states will be 2n. Quantum mechanism is motivation behind proposed quantum machine learning. Quantum mechanism will have benefits on operational limits of classical computers. The probability for each state is α and β. Two-qubit quantum computer thus requires four amplitudes. Quantum Bit or Qubit on Classical Computers, data and information of data is in terms of Classical 0 s and 1 s, and the concept that carries such information is called a “bit” or called as ‘ON as 1’ and ‘OFF as 0’ bit. A Classical bit can be in either a 0 or 1 state at any one-time moment. A quantum computer proposed new standard in terms of mechanism that is a quantum bit or “qubit” instead of a classical bit 0 and 1. A quantum bit will make use of two states (0 and 1) to represent data and information, but bit will hold single state at a time moment and a qubit can take together states of 0 and 1 simultaneously at any one moment is called as superposition. In similar way, two qubits in this state can provide the four values of 00, 01, 10, and 11 all at one time and 3-qubit can have eight states such as (000,001……111). Bloch sphere is single qubit representation as shown in diagram. Bloch sphere represents 1 qubit space in which electron will spin in up position is activation state represented as ‘1’ and electron spin in down position will represented as ‘0’. Sphere representation of Qubit is well known as the Bloch sphere, and it will visualize the state of a single qubit. Figure 1 shows 1-qubit Bloch sphere representation with real value amplitudes [1]. Superposition Atom is tiny and very small chemical particle which will change the whole definition of classical to quantum mechanism. Data on classical machine is represented by a single atom that is represented in states denoted by |0 > and |1 > . A bit with composite state is known as a qubit nothing but Quantum Bit. Qubit implementation utilizes the atom energy levels is called as superposition mechanism which definitely makes comparison between classical and quantum Machines. Electron spinning up direction is excited state represented as |1 > and a ground and null state representing |0 > A single or 1- qubit is

148

M. Nivelkar and S. G. Bhirud

Fig. 1 Qubit state representation with real value as an amplitude

superposition of the qubit two states together for 0 and 1 denoted by the addition of the state vectors of 0 and 1: = α|0 > +β|1 >

(1)

is superposition of two states 0 and 1 for 1 qubit accordingly state will measured for multiple qubits as per mechanism. Equations 1, 2 ,3 represents 1qubit, 2qubit, 3qubit. 2- qubit state is represented in following manner, four states and shown in Fig. 2 = α1 |11 > + α2 |01 > + α3 |10 > +α4 |11 >

(2)

Fig. 2 Qubit Quantum superposition and entanglement

where, α and β are the complex numbers. a n-bit qubit is superposition of 2n -states at One moment of time shown below in Equation. = α1 |000 > + α2 |001 > + . . . . + αn−1 |110 > + αn |111 >

(3)

Quantum Computing and Machine Learning: In Future to Dominate

149

Entanglement Quantum entanglement is one the fundamental and strong quantum physics mechanism of quantum systems to define correlations between states within a Superposition. Qubits cannot be separated once they are getting entangled with each other. after entanglement on multiple particles Qubits are going to behave and react with each other as per their entangled rule which is applied. Figure 2 shows quantum entanglement for 2 qubits and which is the superposition of four states. Quantum entanglement is one of principles of quantum physics, and which is really very complex to understand. Entanglement which is Linking of multiple qubits together to show superposition states on higher dimension. Qubits are entangled with each through the action of laser to connect them. Once after entanglement they are in intermediate state of behavior with each other. But then after qubit can separated with any distance but they will be remaining entangled with each other. Quantum gates Quantum gates are used to represent states and superposition between the qubits. Quantum computing will use logic gates and these gates are having different purposes. Like, Hadamard gate will gives superposition of 1qubit, Pauli-X will require to rotate in xdirection, also Pauli-Y will rotate in Y-direction. Controlled Gates will work on 2-qubit and more to measure second Qubit. Table 1 shows quantum gates overview and what will be the input provided to the gate have shown [1]. Table 1 Quantum gates Gate

Input

Classical gate

Description

Hadamard gate

1 qubit

None

superposition

Pauli-X gate

1 qubit

NOT GATE

X-rotation

Pauli-Y gate

1 qubit

None

Y-rotation

2 Quantum Enhanced ML Quantum computers will supercharge the classical machine learning algorithms. Quantum computing will perform such tasks which classical computer cannot on operational ability of machine. Quantum machine learning is capable to consider higher dimension feature space such as n-dimensional feature space. Classical machine learning can plot data on multiple dimensions but it would take more time and sometimes it will fail on operational power of system. Quantum Machine learning will perform complex analysis and prediction in terms of weird pattern generation. Complex and high dimensional pattern generation will be not possible on classical machine at some extent. Quantum ML

150

M. Nivelkar and S. G. Bhirud

will speed up the task of big data processing to achieve accuracy more than tradition approach of ML. Proposing quantum machine learning algorithm: a new era of computing. Quantum computer can analyze vast amount of data and predict certain result in very less time in comparison with traditional computers such as digital and high-performance computers. These machines are still taking longer time to predict accurate analysis. By proposing quantum machine learning algorithm, the model which will predict pattern in very short time with excellent accuracy. Quantum machine learning will help to generate and model system which will give faster result on classical machine.

3 Quantum ML Research Domains Quantum Whether Forecasting: In this application dataset formation will be on real time basis. also forecasting data will be required real time processing which will predict result in faster manner. This application currently using GPU processing. forecasting will process and analyse Rain, Temperature and other climate related data. This real time data has very big size so processing will be also time consuming. Quantum computing in whether forecasting would overcome this difficulty in computations. Drug Discovery: Quantum computing and machine learning will be playing very important contribution in this field also. Quantum Satellite Image Processing: Image processing will be best set classical application also giving good performance on supercomputers. But there are certain image types those are very big size such as Satellite images. Quantum satellite image processing would be more time efficient. Quantum Social network data Analytics: Data gathered on various social networking sites such as Facebook, Twitter, Instagram will have vast volume. Not only text data analytics but also multimedia data analytics on real-time basis can be made. Quantum computer can also take challenge to process this type of data. Disaster Management: Disaster data such as Earthquake, Flood, Land Sliding requires before time predictions. Quantum computer can really do its good in it. Cryptography: There is lot of research happened in past two decades in cryptography. CPU and GPU are currently involved and contributed in various cryptography techniques [2]. 3.1 Survey on Quantum Literature Ablayev and Ablayev [3] in 2019 have proposed quantum methods for ML and also highlighted on available quantum tools in initial part of paper. Supervised ML methods in classical model are described. Last section of research paper describes classification and clustering algorithms. Quantum representation of ML standards proposed on quantum tools such as Grover search algorithm and its variant. ProjectQ, Quiskit, Rigetti, Quipper are the technologies proposed in the literature survey. Decision Tree and Branching Program are considered as models of computations as deterministic version of machine learning classification algorithms. Classical and quantum complexity analysis and comparison have shown in last section of paper.

Quantum Computing and Machine Learning: In Future to Dominate

151

Ablayev et al. [4] proposed in as continuous research on [3] on quantum ML tools. F. Ablayev and M. Ablayev surveys QML for classification and discusses the quantum nearest neighbor algorithm (NN) algorithms. Next section highlights the QNN algorithms gives quadratic time speed-up over classical algorithms. Quantum neural network is beneficial on time scale compare to classical NN. Literature review in this paper consist of two parts, quantum tools”, here it presents some fundamentals and several quantum tools based on Grover search quantum algorithm. Grover search is well known quantum search algorithm. Second section of paper discusses classification in ML that that can be enhanced with quantum technology. This paper enlightens supervised learning algorithms as classification to elaborate and speed up by the use of quantum tools. Yangyang et al. [5] Quantum information processing, Molecular science, and atomic physics will require robust control design and which is key task in quantum technology. In this paper, multiple samples and mixed-strategy DE (msMSDE) algorithm is proposed. This algorithm is also called as improved dimensional algorithm. Proposed problem will search fields for quantum control parameters. In this algorithm (msMSDE), is highlighting on mutation operations. Fitness evaluation method for mutation operation requires to use multiple samples and its mixed strategy for performance measures. Author wants to convey that the msMSDE algorithm is best fit for control problems. This work discussed the Optimal universal quantum machine’s performance. Proposed research is using quantum qudit which quantum partical on n-number of states. Qudit is more powerful than qubit. The state discriminating of a qudit is begins from a set of templates states collected. Training set which is used is classically available and requires some global information for processing data. Realistic modelisation approach is quite hard to make up. Quantum state absolute information is not accessible because of random states of qubit. Structural properties of quantum states are available. Scenario is explained with quantum communication. In Quantum communication, receiver is unaware of transmitted state, but has classical knowledge on the code used by the sender. Quantum learning machine requires to deal with quantum data and quantum processor however classical machine learning is dealing with classical data and classical processor. Quantum reinforcement learning concept, tools and methods are proposed [6]. Which will give idea about how reinforcement learning can be better on quantum platform. Daoyi Dong and Chunlin Chen have proposed Quantum Reinforcement Learning. In this paper, QRL is proposed based on the concepts and theories of quantum computation in comparison with classical reinforcement learning algorithms in terms of exploration and exploitation, low learning speed. Quantum physics principal superposition motivates to introduce updating ridge regression algorithm. The state in traditional reinforcement learning is looked upon as the Eigen state (Eigen action) in Quantum reinforcement learning. Quantum tools and techniques are discussed in for machine learning and optimization [7, 8]. Quantum algorithms and applications are discussed in manuscript [9]. Fanizza et al. [10] This paper proposed continuation of research on msMSDE algorithm highlights onto solving the three classes of quantum robust control problems. Mutations are used for fitness evaluation using ms. Problem such as open inhomogeneous quantum ensembles and the consensus problem of quantum networks with uncertainties, msMSDE shows excellent performance in it. This paper highlights on

152

M. Nivelkar and S. G. Bhirud

practical implementation of msMSDE on fs laser. Good TPA signal generated by using proposed technique. Thus, it will manage and control CH2BrI fragmentation. Comprehensive comparative investigation is proposed for control of molecule optimally and also compared different computation algorithm and its variants. Nguyen et al. [11] Author stated that in this work proposed on Quantum based Artificial Neural Networks implementation. QNN which is exponentially increases the QML algorithm. Shaikh and Ali [12] This paper surveys literature on Big data analytics in quantum technology and quantum machine learning and current research in this work. Paper highlights on various machine learning techniques depends on logic of learning methods. Ahead section of paper discusses Quantum supervised and unsupervised machine learning. Compared with classical supervised and unsupervised machine learning techniques. Paper discusses more on disadvantage existing Machine learning techniques and tools and the benefits of the quantum computing in big data analytics is modelled. Quantum machine learning has great challenges in the information processing and data science. Quantum computing is still having restriction on implementation because of the Quantum computers hardware availability and access for implementation of algorithm and simulation of it on quantum machine and unavailability of necessary tool. Currently lot of research is going on in this field. Proposed algorithm is using big data in healthcare sector. These types of dataset have various formats such as text, complex images, sensor based live data, and live video streaming data. Quantum (Photons) is basic unit of quantum architecture; big data can be processed by considering heterogeneity in data. When this paper has published, quantum machines were unavailable so that theoretical concept modelling is done. Quantum ridge regression is proposed. proposed work shows how ridge regression achieved exponential speedup over classical ridge regression standard [13]. Author discussed on various methods of machine learning and techniques to evolve these methods using quantum tools and techniques. Paper also discussed about available platforms on which simulation can be done [14]. Imran et al. [15] Shabir Ahmad have proposed a descriptive data analysis approach and also predictive analysis is discussed. Proceeding explorer collection and analysis of waste management data and managing in-time waste information. Proposed approach performance is tested using a waste dataset of Jeju Island, South Korea. Quantum Geographic Information Systems (QGIS) software is collecting data based upon geographical map by placing virtualized waste bins. In this paper, reviews on the advance technologies including Machine Learning, Quantum Computing and Quantum Machine Learning, and proposed use of all this in 5G communication network. Paper discusses about 5G services and open research challenges in 5G communication [16]. Cao et al. [17] quantum computing in drug discovery is proposed and potential of research has been discussed by authors [18] discussed quantum optics and quantum computing approach in many body systems. Manuscript [19] highlighted on quantum blockchain, paper discussed on application [20, 21] elaborated on brief discussion about quantum architecture, hardware and simulation methodologies.

Quantum Computing and Machine Learning: In Future to Dominate

153

Quantum support vector in big data classification is studied and proposed in paper [22] which will highlight on research aspect of support vector machine for voluminous dataset and comparison with classical support vector is mentioned. Li et al. [23] enlightens on again experimental evolution of support vector machine in quantum space by making use of quantum computing tools. 3.2 Comparison of Classical and Quantum Models Quantum technology evolved recently and there are many areas of classical domain can regenerate on computational speedup. Survey papers are digging the research in quantum technology and quantum machine learning. Ablayev et al. [3] and [4] are showing how quantum technology will work in supervised and unsupervised learning and mathematical models and tools for methods. Most of the papers have discussed quantum implementation on Qiskit which requires following points. Quantum tools: required for model formation. Mathematical model: which will define quantum implementation for machine learning methods. Descriptive model: few papers have proposed descriptive approach for quantum methods. Overall discussion focuses on how quantum learning will be time efficient and computationally better over classical ml models. Various application such as Machine learning in data related to molecular science, networking standards are used and quantum computing have proposed in these specific domains. This rigorous survey is helping out to find research gap in quantum and classical machine learning standard in some extent. 3.3 Motivation of Quantum Computing in ML Classical ML have computational complexities on classical machines in terms of complex pattern generation on reduced computational time. Weird pattern generation which classical computing cannot do. Quantum classification and clustering for enhanced prediction on higher dimensions of data. Quantum efficiency and scalability on more powerful computational standard. Data and information processing to achieve more computational speed. Quantum data analytics for real time and faster result generation with accurate analysis and prediction for big data processing. 3.4 Platform Used for Proposed Implementation Quantum hardware is available for simulation for all public users by various vendors. IBM Quantum cloud experience is giving better services and access to the quantum hardware. RIGETTI Quantum machines are available online we can access these machines from our classical machine. These services can allow us to work on Quantum PAAS platform. IBM Q is branch of IBM Quantum Platform which is well-known platform to give quantum computer access on cloud basis in terms of Platform as service (PAAS). IBM Q platform is available to general public as well as IBM clients with free of cost,

154

M. Nivelkar and S. G. Bhirud

which is available through the cloud. Recently in July 2020 IBM organized Global Summer School for Knowledge seekers to explore on practical implementation of Quantum Algorithm on their quantum machines. Any user can access the circuit composer, which allows one to design quantum circuits. IBM has given a 14-qubit model and it is available to the general public, while their 20-qubit model has been given for IBM Q clients. Development can take place through Qiskit and packages are available and compatible on anaconda platform, which is an open-source quantum programming framework which we can use to write programs as well to code circuit composer [24].

4 Dataset Selection and Implementation Proposed work will consider big data for processing on quantum machine. Following tasks has to be planned. Select big data for processing. The data transformation will be done in classical to quantum space. Apply machine learning algorithm for classification of data on ndimension feature space. If data will be having n-number of features then it will be difficult on classical model to plot and analyse big data. Proposed problem will do dataset selection for Whether forecasting / Climate science data/Disaster management. Figure 3 shows data represented on higher-dimensional feature space and how the concept will be adapted by Machine learning algorithm. Proposed system has following features:

Fig. 3 Quantum space data representation and comparison with classical machine learning

Data representation on higher dimensional feature space. Classification of data for n-number of classes. Reduced time for training the dataset. Fasters and accurate vector search. Accuracy on component search and analysis.

Quantum Computing and Machine Learning: In Future to Dominate

155

5 Conclusion Quantum Mechanics is fundamental theory of physics which is powerful than classical physics mechanism. Quantum mechanics is used to build weird analysis for complex pattern generation in data. Quantum systems can generate complex patterns that are very hard to generate on classical machines mechanism. But quantum systems can also learn and recognize the patterns that can’t be recognized classically. Quantum ML can be built on many other parameters such as hyperplane generation on high-dimensions feature space, multiclass classification, faster vector generation. Not only Supervised learning, unsupervised and reinforcement also will show good quantum performance.

References 1. Schuld M, Franscesco P Supervised Learning with quantum computer 2. Gupta1 S, Sahu K Quantum computation of perfect time-eavesdropping in position-based quantum cryptography. Quantum computing and eavesdropping over perfect key distribution 3. Ablayev F, Ablayev M, Huang JZ, Khadiev K, Salikhova N, Wu D (2020) On quantum methods for machine learning problems part I: quantum tools. Big Data Mining and Anal 3(1). ISSN 2096–0654 ll04/06llpp 41 {55, March 2020}. https://doi.org/10.26599/BDMA. 2019.9020016H. Simpson (2004) Dumb robots, 3rd ed. UOS Press, Springfield, pp 6–9 4. Ablayev F, Ablayev M, Huang JZ, Khadiev K, Salikhova N, Wu D (2020) On quantum methods for machine learning problems part II: quantum tools. Big Data Mining and Anal 3(1). ISSN 2096–0654 ll04/06llpp 41 {55, Number 1, March 2020. https://doi.org/10.26599/ BDMA.2019.9020016H. Simpson (2004) Dumb robots, 3rd ed. UOS Press, Springfield, pp 6–9 5. Yangyang LI, (Senior Member, IEEE), Tian M, Liu G, Peng C, Jiao L, (Fellow, IEEE) Learning-based quantum robust control: algorithm, applications, and experiments. https:// doi.org/10.1109/ACCESS. 2020.2970105 6. Dong D, Member, IEEE, Chen C, Member, IEEE, Li H, Senior Member, IEEE, Tarn T Fellow, IEEE, Quantum reinforcement learning IEEE transactions on system, man, cibernetics 7. Yangyang LI, (Senior Member, IEEE), Tian M, Liu G, Peng C, Jiao L, (Fellow, IEEE) (2020) Quantum optimization and quantumlearning: a survey. Received January 10, 2020, accepted January 21, 2020, date of publication January 28, 2020, date of current version February 6, 2020. Digital Object Identifier https://doi.org/10.1109/ACCESS.2020.2970105 8. Neukart F et al (2017) Traffic flow optimization using a quantum annealer. https://arxiv.org/ abs/1708.01625 9. Dong D Senior member, IEEE. Xing X, Ma H, Chen C Member, IEEE. Liu Z, Rabitz H Learning-based quantum robust control: algorithm, applications, and experiments. IEEE Trans Cybern 10. Fanizza M, Mari A, Giovannetti V (2019) Optimal universal learning machines for quantum state discrimination. IEEE Trans Inf Theory 65(9) 11. Nguyen NH, Behrman EC, Moustafa MA, Steck JE Benchmarking neural networks for quantum computations 12. Shaikh TA, Ali R Quantum computing in big data analytics: a survey. In: 2016 IEEE international conference on computer and information technology 13. Yu CH, Gao F, Wen QY An improved quantum algorithm for ridge regression. IEEE Trans Knowl Data Eng XX(XX 1) 14. Dunjko V, Taylor JM, Briegel HJ (2016) Quantum-Enhanced machine learning. Phys Rev Lett 117:130501. https://link.aps.org/doi/https://doi.org/10.1103/PhysRevLett.117.130501.

156

M. Nivelkar and S. G. Bhirud

15. Imran, Ahmad S, Kim DH (2020) Quantum GIS based descriptive and predictivedata analysis for effective planning of waste management. Received January 29, 2020, accepted March 2, 2020, date of publication March 6, 2020, date of current version March 17, 2020. Digital object identifier https://doi.org/10.1109/ACCESS.2020.2979015 Department of Computer Engineering, Jeju National University, Jeju 63243, South Korea 16. Nawaz SJ 1, (Senior Member, IEEE), Sharma SK 2, (Senior Member, IEEE), Wyne S 1, (Senior Member, IEEE), Patwary MN 3, (Senior Member, IEEE), Asaduzzaman MD 4, (Member, IEEE) Quantum machine learning for 6G communication networks: state-of-the-art and vision for the future. Received March 12, 2019, accepted April 2, 2019, date of publication April 4, 2019, date of current version April 17, 2019. Digital object identifier. https://doi.org/ 10.1109/ACCESS.2019.2909490 17. Cao Y 1;2; _, Romero J 1;2, Aspuru-Guzik A 1;2;3;4;5 Potential of quantum computing for drug discovery. IBM J Res Dev.https://doi.org/10.1147/JRD.2018.2888987 18. Daley (2012) Quantum optics and quantum many-body systems: quantum computing. http:// qoqms.phys.strath.ac.uk/researchqc.html 19. Fernández-Caramés TM 1, (Senior Member, IEEE), Fraga-Lamas P 1, (Member, IEEE) Towards post-quantum blockchain: a review on blockchain cryptography resistant to quantum computing attacks. IEEE Access.https://doi.org/10.1109/ACCESS.2020.2968985 20. Murali P et al (2019) Full-stack, real-system quantum computer studies: architectural comparisons and design insights. In: Proc. ISCA’19, Phoenix, AZ, June 2019, pp 1–14 21. Tacchino F et al (2019) An artificial neuron implemented on an actual quantum processor. Nat npj Quantum Inf 5(26) 22. Rebentrost P, Mohseni M, Lloyd S (2014) Quantum support vector machine for big data classification. Phys Rev Lett 113(13):130503 23. Li Z et al (2015) Experimental realization of a quantum support vector machine. Phys Rev Lett 114(14):140504 24. IBM Quantum Cloud access website: https://quantum-computing.ibm.com/login

Design and Develop Data Analysis and Forecasting of the Sales Using Machine Learning Vinod Kadam(B) and Sangeeta Vhatkar Thakur College of Engineering and Technology, Mumbai 400101, Kandivali(E), India

Abstract. Data Analysis and Forecasting on Supermarket Sales Transactions is a proposed system which focus on the betterment of the sales in the business. The whole proposed system comprises mostly of 4 sections: Exploratory Data Analysis in experiences, exploratory data assessment is an approach to manage separating enlightening assortments to gather their standard ascribes, routinely with visual strategies. Exploratory Data Analysis proposes the fundamental strategy for performing starting evaluations on information to find plans, to spot anomalies, to test theory and to check questions with the help of framework estimations and graphical portrayals. Client Segmentation Theoretically we will have fragments like Low Value: Customers who are less dynamic than others, not continuous purchaser/guest and creates extremely low—zero—ossibly negative income. Mid Value: trying to everything. Of-ten utilizing our foundation (however not however much our High Values), genuinely continuous and creates moderate income. High Value: The gathering we would prefer not to lose. High Revenue, Frequency and low Inactivity. Market Basket Analysis is a technique which recognizes the nature of connection between sets of things purchased together and perceive instances of co-occasion. A co-event is when at least two things occur together. Timearrangement techniques for forecasting. Determining is a methodology or a framework for assessing future pieces of a business or the movement. It is a technique for deciphering past data or experience into evaluations of what might be on the horizon.. Keywords: Time series · RFM model · Market basket analysis · Apriori algorithm · ARIMA · SARIMA

1 Introduction Data Analysis and Forecasting on Supermarket Sales Transactions which focus on the betterment of the sales in the business. We are presently observing solid outcomes from organizations that utilize Machine Learning (ML) as well as Artificial Intelligence (AI) to surpass their opposition and close more arrangements. In fact, sales teams that adopt these tools are seeing an in drives what’s more, plans of over half and cost diminishes of up to 60%, as showed by the Harvard Business Review. Here are just a few of the

© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2022 V. E. Balas et al. (eds.), Intelligent Computing and Networking, Lecture Notes in Networks and Systems 301, https://doi.org/10.1007/978-981-16-4863-2_14

158

V. Kadam and S. Vhatkar

possibilities like interpret customer data, Improve sales forecasting, predict customer needs, Efficient transaction sales. For data analysis, we from the outset handle the information. So In Statistics, EDA (exploratory data analysis) is used to: Better understand the data, build an intuition about the data, generate hypothesis, find insights, Visualization [1]. After the visualizing the data we used RFM (recency, frequency, monetary) analysis to segment the customer. The RFM model estimates when individuals buy (Recency), how frequently they buy (Frequency) and the amount they buy (Monetary). While past procurement of customer can suitably anticipate their future purchase direct, association can recognize which customer is praiseworthy. To figure RFM model score we will apply K-mean bunching [2]. But we should tell how many clusters we need to K-means algorithm. To find it out, we will apply Elbow Method. Elbow Method just tells the ideal group number for ideal dormancy. For better understanding of the result we can see the mean value of Recency, Frequency and Revenue represent. With we can segment the data on the filter of low-value, mid- value and high-value. After RFM Score we need to recognizes the strength of relationship between sets of item bought together and distinguish examples of co-event, for that we use Market Basket Analysis (MBA). A co-event is when at least two things happen together. For beneficial outcome in MBA we measure the strength of a rule by calculating the following matrices like Support, Confidence and lift [3]. After MBA we investigate deals exchange information for item racking by Apriori Algorithm [4]. Time arrangement estimating has been shown practical in sensible dynamic in various spaces [5]. Right now translating beyond facts or enjoy into forecasting of factors to come. Due to the seasonal fashion of time collection used, the Seasonal ARIMA (SARIMA) is selected for the version development [5].

2 Problem Statement The present questionable economy, organizations are attempting to receive elective approaches to stay serious. Distinctive incapable determining strategies lead to various item stock outs. So research revolves around (company size, stakeholders, solutions they want) different forecasting techniques for demand prediction through machine learning have the ability to compare it to historical sales efforts. The proposed framework can come to an obvious conclusion and better foresee what arrangements would be viable and the probability of the arrangement shutting and to what extent it will take. This understanding enables deals the executives to more readily apportion assets and anticipate deals projections.

3 Proposed Methodology See (Fig. 1).

Design and Develop Data Analysis and Forecasting

159

Fig. 1 Data analysis and forecasting architecture

3.1 Exploratory Data Analysis (EDA) In EDA (exploratory data analysis) the first sign that a visualization is good is that it shows you a problem in your data, detect outliers or anomalous events and find interesting relations among the variables [6]. Dataset Understanding (Fig. 2). Added a column Invoice Year Month to have a month wise view of the data (Figs. 3 and 4). A genuine model can be used or not, yet essentially EDA is for seeing what the data can exhort us past the ordinary showing up or hypothesis testing task [7].

160

V. Kadam and S. Vhatkar

Fig. 2 Sale transactional data of the time period of December 2010 to December 2011

Fig. 3 Month wise view of the data

3.2 RFM (Recency, Frequency, Monetary) Model For a fruitful business, taking part in a powerful battle is a key assignment for advertisers. Customarily, advertisers should initially distinguish showcase division utilizing a scientific mode and afterward execute an effective battle intend to target gainful clients [8]. Proposed system was expected to used RFM (recency, frequency, monetary) concept to segment customer [9]. RFM method is used to analyzing customer value. It is generally utilized in database promoting and direct advertising and has gotten specific consideration in retail and expert administrations ventures [10]. This investigation proposes utilizing the accompanying RFM factors: • Recency(R): at the point when individuals purchase. • Frequency(F): how frequently they purchase. • Monetary(M): the amount they purchase. In RFM model apply K-implies grouping to allot a score to the client [11]. Nevertheless, ought to instruct what number with respect to bunches we need to K-implies calculation. To find it out, it will apply Elbow Method. The elbow methodology runs

Design and Develop Data Analysis and Forecasting

161

Fig. 4 Monthly order count by using quantity field

k-suggests gathering on the dataset for a degree of qualities for k (state from 1–10) and from there on for each evaluation of k cycles a common score for all groups [12] (Figs. 5, 6, 7, 8, 9 and 10).

Fig. 5 Calculating Recency for each customer

3.3 Market Basket Analysis (MBA) Market Basket Analysis (MBA) is a standard instance of affiliation rule mining. Market Basket Analysis makes If–Then circumstance rules, for example, if thing An is obtained, by then thing B is likely going to be purchased. The guidelines are probabilistic in nature or, in a manner of speaking, they are gotten from the frequencies of co-occasion in the discernments [13]. Among every one of the strategies for information mining, apriori algorithm is seen as better for association rule mining [14].

162

V. Kadam and S. Vhatkar

Fig. 6 Determined bunches and relegated them to every client

Fig. 7 Make Frequency Clusters, we need to discover all out number requests for every client

The key idea in the Apriori algorithm is that it accept all subsets of an incessant itemset to be visit [15]. Similarly, for any infrequent itemset, all its super-sets must also be infrequent. In order to select the interesting rules out of multiple possible rules from the business, proposed system is using the following measures [16]. Support: The ensuing measure called the sureness of the standard is the extent of the amount of trades that remember all things for {B} just as the quantity of exchanges that remember all things for {A} to the quantity of exchanges that remember all things for

Design and Develop Data Analysis and Forecasting

163

Fig. 8 Determined bunches and allocated them to every Customer in our data frame tx_user

Fig. 9 Revenue can be calculated by unit price * quantity

Fig. 10 The mean value of Recency, Frequency and Revenue

{A}[17]. Support = (A + B) / Total

(1)

164

V. Kadam and S. Vhatkar

Confidence: The ensuing measure called the sureness of the standard is the extent of the amount of trades that remember all things for {B} just as the quantity of exchanges that remember all things for {A} to the quantity of exchanges that remember all things for {A}[17]. Confidence = (A + B) / A

(2)

Lift: The third measure called the lift or lift extent is the extent of certainty to expected certainty. Expected certainty is the certainty partitioned by the recurrence of B. The Lift discloses to us how much better a standard is at anticipating the outcome than simply accepting the outcome in any case. More noteworthy lift esteems show solid er affiliations [18] (Fig. 11).

Fig. 11 Market basket analysis on France from the dataset [19]

Lift = ((A + B / A) / (B / Total))

(3)

3.4 Time Series Forecasting Time-series techniques for estimating. Forecasting is a strategy or a method for assessing future parts of a business or the activity. It is a technique for interpreting past information or experience into assessments of things to come [19]. Time arrangement includes the utilization of information that are filed by similarly separated increments of time (minutes, hours, days, weeks, and so on) Because of the discrete idea of time arrangement information, time arrangement informational indexes have an occasional as well as pattern component incorporated into the information [20]. The initial phase in time arrangement displaying is to represent existing seasons (a repeat ring design throughout a fixed timeframe) as well as patterns (upward or descending development in the information). Representing these installed designs is the thing that we call making the information fixed. Arrangement is supposed to be fixed arrangement if and just if the joint likelihood is doesn’t change over the long run that is the mean and fluctuation of the arrangement stay consistent over the long haul [21].

Design and Develop Data Analysis and Forecasting

165

With moving information, as time increment the mean of the arrangement either increments or diminishes with time (think about the consistent expansion in lodging costs after some time). For occasional information, the mean of the arrangement varies as per the season (think about the increment and diminishing in temperature like clockwork) [22] (Fig. 12).

Fig. 12 Time Series forecasting on the entire data

One step ahead forecast uses the actual value for each subsequent forecast (Figs. 13 and 14).

Fig. 13 One Step ahead forecast on the entire data

166

V. Kadam and S. Vhatkar

Fig. 14 Visualizing Forecasting Predicted mean for entire data

There are two strategies that can be applied to accomplish stationarity, contrast the information or straight relapse. To take a distinction, you ascertain the distinction between successive perceptions. To utilize direct relapse, you incorporate double marker factors for your occasional segment in the model [23]. 3.5 SARIMA (Seasonal Autoregressive Integrated Moving Average) Autoregressive Integrated Moving Average, or ARIMA, is a guaging technique for univariate time arrangement information. As its name proposes, it upholds both an autoregressive and moving normal components. The coordinated component alludes to differencing permitting the technique to help time arrangement information with a pattern [24]. An issue with ARIMA is that it doesn’t uphold occasional information. That is a period arrangement with a rehashing cycle [25]. ARIMA expects information that is either not occasional or has the occasional part taken out, for example occasionally changed through strategies, for example, occasional differencing [26]. SARIMA technique is time arrangement anticipating strategy for stochastic model information with occasional information design [27]. • ARIMA (p, d, q): The non-occasional piece of the model • SARIMA notation is (P, D, Q) s: The occasional piece of the model • s: Seasonal factor Seasonal Autoregressive Integrated Moving Average, SARIMA or Seasonal ARIMA, is an expansion of ARIMA that expressly upholds univariate time arrangement information with an seasonal segment [28]. It adds three new hyperparameters to

Design and Develop Data Analysis and Forecasting

167

determine the autoregression (AR), differencing (I) and moving normal (MA) for the occasional segment of the arrangement, just as an extra boundary for the time of the irregularity [29]. Time series forecasting of Germany (Figs. 15 and 16):

Fig. 15 Time Series Forecasting of Germany

Fig. 16 One Step ahead forecast on the Germany data

One step ahead forecast utilizes the real incentive for each ensuing estimate [30]. The Mean Squared Error of our forecasts is 1607357.8. Dynamic forecast utilizes the estimation of the past forecasted estimation of the reliant variable to register the following one [32] (Fig. 17). The Mean Squared Error of our forecasts is 1648114.76 (Fig. 18).

168

V. Kadam and S. Vhatkar

Fig. 17 Dynamic forecast on the Germany data

Fig. 18 Visualizing forecast on the Germany data

4 Proposed Algorithm Proposed System work for Data analysis and forecasting of the sales in the business following way: Step 1: Get the data. Step 2: Understanding dataset. Step 3: Apply Exploratory Data Analysis to visualize data. Step 4: Apply RFM method for analyzing customer value.

Design and Develop Data Analysis and Forecasting

169

Step 5: Assign Score to the customer using K-means Algorithm. Step 6: Utilized Market Basket Analysis Technique to recognizes the strength of affiliation. Step 7: Distinguish the relationship between sets of items bought together. Step 8: Distinguish examples of co-event. Step 9: For better association rule mining apply apriori Algorithm. Step 10: Visualize the time series. Step 11: Stationaries the data. Step 12: Build SARIMA Model for seasonal component. Step 13: Make The prediction of the data.

5 Conclusion This paper portrays the procedure of Forecasting in detail. In experiences, EDA is an approach to manage separating educational assortments to summarize their rule ascribes, consistently with visual methodologies. It is allude to performing introductory examinations on information to find designs, to spot oddities, and to check supposition with the assistance of summery insights and graphical portrayals. There is a better understanding on which customers are priority and which are not and have better understanding what action are required on low priority customer to improve sales. It would seem that 3 is the ideal one dependent on business necessities. We can proceed with less or more bunches. Now its shows products are mostly bought in pairs, which can enable the business at store level to sell these product side by side in store to improve furthermore sales. further this model was regularized, market basket analysis on France from the Dataset then use these finding to improve sales by pairing up the frequent bought items together. The precision of the regularized model was discovered to be in the sell 340 Green Alarm tickers yet just 316 Red Alarm Clocks. so perhaps it can drive more Red Alarm Clock deals through proposals. Accordingly it is presumed that SARIMA gives precise consequences of figures and can be utilized for foreseeing the deals of store on Germany information utilizing One stride ahead conjecture of genuine incentive for each ensuing estimate and the Mean Squared Error of our gauges is 1607357.8. At that point Dynamic estimate utilizes the estimation of the past gauge estimation of the reliant variable to figure the following one. The Mean Squared Error of our conjectures is 1648114.76. In spite of the fact that picking of the model for determining will absolutely rely upon the kind of dataset.

References 1. Thankachan K (2017) Automating anomaly detection for exploratory data analytics 2. Boyapati SN, Mummidi r (2020) Predicting sales using machine learning techniques 3. Liu RQ, Lee YC, Mu HL (2018) Customer classification and market basket analysis using Kmeans clustering and association rules: evidence from distribution big data of korean retailing company 4. Noureen S, Atique S, Roy V, Bayne S (2019) Analysis and application of seasonal ARIMA model in energy demand forecasting:-a case study of small scale agricultural load

170

V. Kadam and S. Vhatkar

5. Mehrmolaei S, Keyvanpour MR (2016) Time series forecasting using improved ARIMA 6. Yamada S, Yamamoto Y, Umezawa K, Asai S, Miyachi H, Hashimoto M, Inokuchi S (2016) Exploratory analysisfor medical data using interactive data visualization 7. Martinez WL, Martinez AR, Solka JL Exploratory data analysis with MATLAB”: third edition 8. Wei JT, Lin SY, Wu HH (2010) A review of the application of RFM model 9. Dr.(Mrs) Sheshasaayee A, Logeshwari L (2018) Implementation of clustering technique based rfm analysis for customer behaviour in online transactions. In: 2018 2nd International conference on trends in electronics and informatics (ICOEI) 10. Tavakoli M, Hajiagha MM, Masoumi V, Mobini M (2018) Customer segmentation and strategy development based on user behavior analysis, RFM model and data mining techniques a case study 11. Daoud RA, Amine A, Belaid B, Lbibb R (2015) Combining RFM model and clustering techniques for customer value analysis of a company selling online. In: Conference 2015 IEEE/ACS 12th international conference of computer systems and applications (AICCSA), November 2015 12. Hsuan-Kai C, Wu HH, Wei JT, Lee MC (2013) Customer relationship management in the hairdressing industry: an application of data mining techniques. Expert Syst Appl 40(18):7513–7518 13. Sukhia KN, Khan AA, Bano m (2014) Introducing economic order quantity model for inventory control in web based point of sale applications and comparative analysis of techniques for demand forecasting in inventory management 14. Sekban J (2019) Applying machine learning algorithms in sales prediction: istanbul, august. Alfiah F, Pandhito BW, Sunarni AT, Muharam D, Matusin PR (2018) Data mining systems to determine sales trends and quantity forecast using association rule and CRISP-DM method. Int J Eng Techn 4(1) 15. Mrs. Kavitha M, Dr. Subbaiah S (2020) Association rule mining using apriori algorithm for extracting product sales patterns in groceries. ICATCT – 2020, vol 8, no 03 16. https://towardsdatascience.com/market-basket-analysis-using-associative-data-mining-andapriori-algorithm-bddd07c6a71a 17. Gurudath S (2020) Market basket analysis & recommendation system using association rules 18. Dr. Dhanabhakyam M, Dr. Punithavalli M (2011) A survey of data mining algorithm for market basket analysis 19. Lam R (2013) Forecasting trends in the healthcare sector. Procedia computer science, Elsevier 17:789–796 20. Sarpong SA (2013) Modeling and forecasting maternal mortality; an application of ARIMA models. Int J Appl Sci Technol 3(I):19–28 21. Cheng, Hsue C, Wei LY (2010) One step-ahead ANFIS time series model for forecasting electricity loads. Optim Eng 11.2:J03–317 22. Vhatkar S, Dias J (2016) Oral-care goods sales forecasting using artificial neural network model 23. Hyndman RJ, Athanasopoulos G (2015) 8.9 seasonal ARIMA models. Forecasting: principles and practice. oTexts. Retrieved 19 May 2015 24. Shah N, Solanki M, Tambe A, Dhangar D (2015) Sales prediction using effective mining techniques 25. Hyndman RJ, Koehler AB (2005) Another look at measures of forecast accuracy. Monash University 26. Sharma SK, Sharma V (2012) Comparative analysis of machine learning techniques in sale forecasting. Int J Comput Appl (0975 – 8887) 53(6) 27. Dudek G (2014) Short-term load forecasting using random forests. In: Intelligent systems. Springer International Publishing, pp 821–828

Design and Develop Data Analysis and Forecasting

171

28. Meryem O, Ismail J, Mohammed EM (2014) A comparative study of predictive algorithms for time series forecasting. IEEE, Publisher 29. Pindoriya NM, Singh SN, Singh SK (2009) One-step-ahead hourly Load Forecasting using artificial neural network. In: 2009 International conference on power systems 30. Milojkovi´c J, Litovski V (2011) Dynamic one step ahead prediction of electricity loads at suburban level. In: 2011 IEEE first international workshop on smart grid modeling and simulation (SGMS), 17 Oct 2011

Prediction of Depression Using Machine Learning and NLP Approach Amrat Mali(B) and R. R. Sedamkar Thakur College of Engineering and Technology, Mumbai University, Mumbai, India

Abstract. Today, for Internet users, micro-blogging has become a popular networking forum. Millions of people exchange views on different aspects of their lives. Thus, micro blogging websites are a rich source of opinion mining data or Sentiment Analysis (SA) information. Because of the recent advent of micro blogging, there are a few research papers dedicated to this subject. In our paper, we concentrate on Twitter, one of the leading micro blogging sites, to explore the opinion of the public. We will demonstrate how to collect real-time Twitter data and use algorithms such as Term Frequency-Inverse Document Frequency (TFIDF), Bag of Words (BOW) and Multinomial Naive Bayes (MNB) for sentiment analysis or opinion mining purposes. We are able to assess positive and negative feelings using the algorithms selected above for the real-time twitter info. The following experimental evaluations show that the algorithms used are accurate and can be used as an application for diagnosing the depression of individuals. We worked with English in this post, but it can be used with any other language. English in this document, but it can be used for any other language. Keywords: NLP (Natural language processing) · Machine learning · Reddit · Social networks · Depression

1 Introduction Depression as a common mental health disorder has long been defined as a single disease with a set of diagnostic criteria. It often co-occurs with anxiety or other psychological and physical disorders, and has an impact on feelings and behaviour of the affected individuals [1]. According to the WHO report, there are 322 million people expected to suffer from depression, equal to 4.4% of the global population. Nearly half of the in-risk individuals live in the South-East Asia (27%) and Western Pacific region (27%) like China and India. In many countries depression is still under-diagnosed and left without any adequate treatment which can lead into a serious self-perception and at its worst, to suicide [2]. In addition, the social stigma surrounding depression prevents many affected individuals from seeking an appropriate professional assistance. As a result, they turn to less formal resources such as social media. With the development of Internet usage, people have s-tarted to share their experiences and challenges with mental health disorders through online forums, micro-blogs or tweets.

© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2022 V. E. Balas et al. (eds.), Intelligent Computing and Networking, Lecture Notes in Networks and Systems 301, https://doi.org/10.1007/978-981-16-4863-2_15

Prediction of Depression Using Machine Learning and NLP Approach

173

Their online activities inspired many researchers to introduce new forms of potential health care solutions and methods for early depression detection systems. They tried to achieve higher performance improvements using various Natural Language Processing (NLP) techniques and text classification approaches. Some studies use single set features, such as bag of words (BOW) [2, 3], N-grams [4], LIWC [5] or LDA [6, 7] to identify depression in their posts. Some other papers compare the performance of individual features with various machine learning classifiers [8–11]. Recent studies examine the power of single features and their combinations such as N-grams + LIWC [12] or BOW + LDA and TF-IDF + LDA [13] to improve the accuracy results. With almost 326 million active users and 90 million publicly distributed tweets to a wide audience, Twitter is one of the most popular social networking sites [13]. Many researchers have used Twitter data successfully as a source of insights into the epidemiology of users tweeting emotions, depression and other mental disorders. As an online discussion site conducted by multiple groups or “sub-reddits,” Reddit social media is widely used. It is also used for discussions on stigmatic subjects because it enables the users to be totally anonymous. The posts of Reddit users who wrote about mental health discourse were studied by Choudhury. Features such as self-concern, weak linguistic style, decreased social participation, and the expression of hopelessness or anxiety predicted this change.

2 Related Work To provide new insight into depression detection, there are different types of research exploring the relationship between mental wellbeing and language use. Sigmund Freud [12], dating back to the earliest years of psychology, wrote about Freudian slips or linguistic errors to expose the authors’ inner thoughts and feelings. Various approaches to the relationship between depression and its language have been established through the development of sociology and psycholinguistic theories. For example, according to the cognitive theory of depression of Aaron. Reference [12], affected people appear to view themselves and their environment in mostly negative terms. Via derogatory words and first-person pronouns, they also express themselves. Identify self-preoccupation as their typical feature, which can evolve into an intense stage of self-criticism. Other scholars have been inspired by these hypotheses to come up with empirical evidence for their validity. For instance, in three separate periods of their lives, Stirman and Pennebaker [12] compared the word use of 300 poems written by 9 suicidal and 9 non-suicidal authors. The findings indicate that more first-person singular pronouns (I, me or we) were used by suicidal poets. Depressed students used more negative words and fewer positive words of feeling, according to his findings. In order to predict the improvement of depressive symptoms, Zinken et al. [13] investigated the psychological importance of syntactic structures. He assumed that in its word use, a written text may barely differ; however, it could differ in its syntactic structure, especially in the construction of relationships between events. Analyzing the roles of cause and insight. Studies on depression and other mental health problems have brought new challenges with the advent of social media and the

174

A. Mali and R. R. Sedamkar

Internet era. In order to capture user behavioural patterns, online domains such as Facebook, Twitter or Reddit have provided a new forum for groundbreaking analysis with a rich source of text data and social metadata. As an online discussion site conducted by various groups or “subreddits,” Reddit social media is widely used. Since it allows the complete privacy of users, it is also used to address stigmatic subjects. Choudhury et al. [13] reviewed the posts of Reddit users who wrote about the debate on mental health and later moved to fix suicidal ideation problems. In the recent past, in a large research community, shared tasks potentially applicable to different circumstances have become significantly common. RISK, or the Early Risk Prediction Conference and Labs Assessment Forum, is a public competition that enables researchers from various disciplines to participate and collaborate on the production of reusable benchmarks for the assessment of early risk detection technologies used in various fields, such as health and safety.

3 Problem Statement Patients of mental problems such as Alzheimer’s, depression, anxiety and neurodegenerative diseases. The most depressed place in the world is India. Using Natural, after taking input as text data, results will be extracted from the dataset. Language processing algorithm and classification algorithm algorithm to find the data as depression data and prediction or not as a suicidal post.

4 Proposed System Data Pre-Processing: We will take our first look at it now that we have received our data, looking for missing values and selecting which sections of the data set will be useful for our classifier. We’re also going to start pre-processing text information with natural language tools. Some exploratory data analysis and visualizations end with this portion. Before we move to the feature selection and training level, we use the NLP tools to pre-process the dataset. To divide the posts into individual tokens, we use tokenization first. Next, we remove all URLs, punctuations, and stop words that might lead to erroneous results if ignored. Then, we apply stemming to reduce the words to their root form and group related words together (Fig. 1). Extracting functionality: After data pre-processing, we feed our models with the characteristics which reflect the language habits of users in Reddit forums. To explore the linguistic use of the users in the blogs, we use the LIWC dictionary, LDA topics, and N-gram features. To encode words to be used by different classifiers, these methods of text encoding are used. N-gram modelling is used to analyse the characteristics of the documents. In text mining and NLP, it is widely used to calculate the probability of co-occurrence of each input sentence as a unigram and bigram as a function for depression detection [8], [40]. As a numerical statistic for n-gram modelling, we use the Term Frequency-Inverse Document Frequency (TF-IDF), where the value of a word is highlighted with respect to each corporate document (Fig. 2).

Prediction of Depression Using Machine Learning and NLP Approach

175

Fig. 1 Dataset

Fig. 2 Bar Plot for top words

5 Proposed Methodology Architecture In computational linguistics, topic modelling is an important method to reduce the input of textual data feature space to a fixed number of topics [19]. Hidden topics such as subjects connected with anxiety and depression can be extracted from the selected documents via the unsupervised text mining approach. It is not generated by a predetermined

176

A. Mali and R. R. Sedamkar

collection of pre-established terms in contrast to LIWC. However, it produces a category of non-labelled terms automatically (Fig. 3).

Fig. 3 Block diagram of proposed system

We use classifying methods to quantify the probability of depression among the users in order to quantify the presence of depression. Using Logistic Regression, Support Vector Machine, Random Forest, Adaptive Boosting and Multilayer Perceptron classifier, the proposed structure is built (Fig. 4).

Fig. 4 Live blog data

Prediction of Depression Using Machine Learning and NLP Approach

177

Adaptive Boosting (AdaBoost) is an ensemble technique that can make one strong classifier combine several weak classifiers [56]. It is commonly used for problems of binary classification, A special case of the artificial neural network, Multilayer Perceptron ( MLP) is mostly used for modelling complex relationships between the input and output layers [58]. It is able to discern data that is not only non-linearly separable due to its many layers and non-linear activation [59]. In our analysis, we used the MLP method and two hidden layers with 4 and 16 perceptrons to correct all the characteristics in order to ensure accuracy of the comparison (Fig. 5).

Fig. 5 Flow of proposed system

Since depression also affects psychomotor functions [60], we can find terms that represent the symptoms of low energy, exhaustion or inverse insomnia and hyperactivity (tired, I ’m tired or sleepy). It is also articulated somatically (my brain, discomfort, hurt) via the symptoms of the body. Unigrams and bigrams in regular posts, unlike depressionindicative posts, contain the terms identifying the events that happened quite in the past (time, month ago, year ago, last year) (Fig. 6). To evaluate the connexion between the textual data and the features themselves, we selected 68 out of 95 characteristics. In view of psycholinguistic characteristics resulting in association provided in the features extraction, we transformed every depressive and non-depressive post into numerical values. The Psychological Mechanisms (0.19) followed by the Linguistic Aspects (0.17) and Personal Concerns (0.16) show the greatest correlation. c)GSM Module Working (Fig. 7).

178

A. Mali and R. R. Sedamkar

Fig. 6 Most top words used

Fig. 7 Occurrence of top words used

The findings indicate that depressed individuals use more self-oriented references with respect to the mental concentration of depressed and non-depressed users and prefer to shift their attention to themselves (I, me, and mine) (0.17). The work of [12, 13] is confirmed by the findings. With a stronger focus on the present and future, their posts contain more negative feelings, depression and anxiety. Based on our results, when applied to the design tools, LIWC may play an effective role in data detection models (Fig. 8). We developed a topic model to quantify the hidden topics extracted from the posts, which acts as a depression triggering point. LDA requires that the number of topics produced be specified. Any parameter change can trigger a change in the accuracy of the classification. For this purpose, an acceptable value needs to be identified.

Prediction of Depression Using Machine Learning and NLP Approach

179

Fig. 8 Word comparison

6 Results Deployment done using flask where as in user right in text box it will call the backend API and through that will give the result as shown in Fig. 9. There is an 90% accuracy of prediction depression text data. The data was extracted from Reddit and categorised by subreddit, with VADER Sentiment analysis performed on each request. Separately from the test set, a training set of data was developed. The training set of data was pruned to exclude submissions that were overly positive for r/SuicideWatch and excessively negative for r/CasualConversation. This was achieved to increase the training data’s divergence. The data was fitted and transformed using a Spark Machine Learning Pipeline that generated features based on TF-IDF analysis and the negative sentiment.

Fig. 9 Result of algorithm

VADER Sentiment analysis, and was vectorized after the pruning process was completed. A Naive Bayes multinomial classifier and a KNN model are trained using the

180

A. Mali and R. R. Sedamkar

newly transformed results. This model was used to predict whether a particular test set submission will be posted in r/SuicideWatch or r/CasualConversation. The accuracy was estimated to be 90% plus or minus 2% (Fig. 10).

Fig. 10 Deployment using flask

7 Conclusion Basically, Naïve Bayes, SVM and KNN classification will be used to final classify the text used. And finally achieving the outcome was suicidal note or not as the post. After model prediction deployment, it is possible to allow users to input text and receive predictions about their mental state. prediction result is 90%.

References 1. Greene D, Cunningham P (2006) Practical solutions to the problem of diagonal dominance in kernel document clustering. In: Proceedings of the ICML 2. Grimmer J, Stewart BM (2013) Text as data: the promise and pitfalls of automatic content analysis methods for political texts. Polit Anal 1–31. doi:https://doi.org/10.1093/pan/mps028. 3. Sebastiani F (2002) Machine learning in automated text categorization. ACM Comput Surv 34(1):1–47 4. Guo GD, Wang H, Bell D, Bi YX, Greer K (2006) Using kNN model for automatic text categorization. Soft Comput 10(5):423–430 5. Patil AS, Pawar BV (2012) Automated Classification of web sites using naive bayesian algorithm. In: Proceedings of the international multiconference of engineers and computer scientists, vol I. Hong Kong, pp 14–16 6. Jiang L, Li C, Wang S, Zhang L (2016) Deep feature weighting for naïve Bayes and its application to text classification. Eng Appl Artif Intell 52: 26–39Lasisi H, Ajisafe AA (2012) Development of stripe biometric based fingerprint authentications systems in automated teller machines. IEEE. 978–1 -4673–2488–5, pp 1 72–l75

Prediction of Depression Using Machine Learning and NLP Approach

181

7. Haqani H, Saleem M, Banday SA, RoufKhan AB (2014) Biometric verified access control of critical data on a cloud. In: International conference on communication and signal processing, India 8. Yuan Q, Cong G, Thalmann NM (2012) Enhancing naive bayes with various smoothing methods for short text classification. WWW 2012 Companion. Lyon, France, ACM 9781–4503–1230–1/12/04 9. Lertnattee V, Theeramunkongt T (2014) Analysis of inverse class frequency in centroidbased text classification. In: International symposium on communication and information technologies 2004 (ISCIT 2014), Sapporo, Japan, pp 1171–1176. London, UK 10. Powers DMW (2007) Evaluation: from precision, recall and f-factor to ROC, informedness, markedness & correlation. School of Informatics and Engineering, Flinders University of South Australia, Technical Report SIE-07–001 11. Razaque A, Amsaad FH, Nerella CH, Abdulgader M, Saranu H (2016) Multi-biometric system using fuzzy vault. IEEE. 978–1–4673–9985–2/16/$31.00©2016 12. Sadhya D, Singh SK, Chakraborty B (2016) Review of key-binding-based biometric data protection schemes. IET Biom. © The Institution of Engineering and Technology 13. Xu J (2015) An online biometric identification system based on two dimensional fisher linear discriminant. IEEE. 978–1–4673- 9098- 9/15/$31.00 ©2015 14. Nagaraju S, Parthiban L (2015) Trusted framework for online banking in public cloud using multi-factor authentication and privacy protection gateway. J Cloud Comput Adv Syst Appl 4:22 15. Dilsizian SE, Siegel EL (2014) Artificial intelligence in medicine and cardiac imaging: harnessing big data and advanced computing to provide personalized medical diagnosis and treatment. Curr Cardiol Rep 16(1):1–8 16. Markonis D, Schaer R, Eggel I, et al (2012) Using mapreduce for large-scale medical image analysis. In: 2012 IEEE second international conference on healthcare informatics, imaging and systems biology (HISB), La Jolla, California; IEEE. 2012:1 17. Shortliffe EH, Cimino JJ (2014) Biomedical informatics. Springer, Berlin 18. Hay SI, George DB, Moyes CL et al (2013) Big data opportunities for global infec-tious disease surveillance. PLoS Med 10(4):e1001413 19. Kupersmith J, Francis J, Kerr E et al (2007) Advancing evidence-based care for dia-betes: lessons from the Veterans Health Administration. Health Aff 26(2):w156–w168

Detection and Performance Evaluation of Online-Fraud Using Deep Learning Algorithms Anam Khan(B)

and Megharani Patil

Thakur College of Engineering and Technology, Mumbai University, Mumbai, India [email protected]

Abstract. Online News Portals are currently one of the primary sources used by people, though its credibility is under serious question. Because the problem associated with this is Click-bait. Click-baiting being the growing phenomenon on internet has the potential to intentionally mislead and attract online viewership thereby earning considerable revenue for the agencies providing such false information. There is need for accurately detecting such events on online-platforms before the user becomes a victim. The solution incorporates a Novel Neural Network Approach based on FastText Word2Vec Embeddings provided by Facebook and Natural Language Processing where Headlines are specifically taken into consideration. The proposed system consists of Hybrid Bi- Directional LSTMCNN model and MLP model. Promising Results have been achieved when tested on a dataset of 32,000 columns equally distributed as Click-bait and Non-Clickbait, in terms of Accuracy, Precision and Recall. The graphs achieved are also self-explanatory in terms of reliability of the system. A comparative analysis is also been done to show the effectiveness of our design with respect to detecting Click-bait which is heavily present on-line. Keywords: Click-bait · FastText · Convolutional Neural Network (CNN) · Long Short-Term Memory (LSTM) · Multi-layer Perceptron (MLP)

1 Introduction Currently, print media is replaced by digital media, resulting in the increasing number of news portals that provides a number of information. The growth of online media led to Click-bait, which is a negative impact of online journalism referring to the consumption of unrestricted or appalling headlines with only intention of attracting traffic and mouse clicks to gain more and more revenue from site. Often written in a language which is misleading and has provocative sentences. A title gives the first impression and dominates the user’s approach and is a key term in the news. As theory stated by Lowenstein states that Click-bait is the result of lack of understanding created by one’s interest in certain matters. This gap is capable of affecting one’s emotions. In the time of instant access to internet, people are increasingly consuming more and more contents in the name of news, images and videos for e.g., YouTube or from internet rather than trusted cable networks and news agencies (Fig. 1). © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2022 V. E. Balas et al. (eds.), Intelligent Computing and Networking, Lecture Notes in Networks and Systems 301, https://doi.org/10.1007/978-981-16-4863-2_16

Detection and Performance Evaluation of Online-Fraud

183

Fig. 1 Overview of click-bait

The time spent by people on online media is anticipated to be much more than people spend on traditional TV worldwide in the year 2019. Considering, YouTube has more than a billion users occupying nearly one-third of the Internet population and reaches billions of views per day. An online news typically is made up of title, thumbnail, and the video content. Before the news the title and thumbnail (in case of video) are visible to the viewers, before they click and actually identify the content. Hence, headlines are found to be the crucial factors that attracts the users to click and watch any video. The content is clearly different from its title or thumbnail. The content is specially generated to attract viewers to click video and increase the viewership of the video. However, spreading of Click-bait videos wastes the time of viewers also decreases the trustworthiness towards journalism. Listed below are few examples of Click-bait headlines: • You won’t believe…… • These 12 tricks ……will change your life…… • Omg!!! Click to see what happens next…. Using intentionally deceiving links, tweets, or social media posts to attract online viewership, are all strategies of Click-baiting and it has been one method of flooding misinformation on the internet. A lot of attention has come towards Click-bait even though research in the field of Click-bait detection is still in an early phase. Because of its extensive use in online media and news, significant fallout has started to happen against social media platforms where any such content is found. Social Media platforms such as Facebook decided to take action against such activities, however it still continues to be flooded with such articles. To fight against this, huge number of Twitter handles have emerged and gained number of followings, with only purpose to identify Click-bait. Handles such as @SavedYouAClick and @HuffPoSpoilers are consistently updating their feeds with such posts to create awareness about them. The method is bit time taking because of their manual detection as users running those accounts themselves read and classify the tweet as Click-bait or not for the benefit of people. According to sources sentimental headlines create more curiosity among people and leads to Click-bait. Around 69, 000 headlines from four international media houses in 2014 were analyzed based on polarity of sentiments and found extremities in sentiments resulted in increased popularity. Headlines are the first impression and it can affect how the news articles are considered by users. A headline strongly affects which existing knowledge is triggered in one’s brain. By its way of phrasing, a headline can dominate one’s mindset so that readers later recall details that coincide with what they were expecting, leading individuals to perceive the same content differently according to the headline. Another explanation is the frequently said Loewenstein’s information gap theory. In simple words, the theory states that whenever we distinguish a gap between what we

184

A. Khan and M. Patil

already know and what is unknown to us, that gap has emotional consequences. Such information gaps lead us towards false content provided online (Fig. 2).

Fig. 2 Effect of click-bait on mental state

2 Literature Survey The Click-bait detection system proposed here has been primarily built upon feature extraction. Where in total 60 features are taken into consideration [1]. The baseline experimental setup constitutes: (a) (b) (c) (d)

Logistic Regression, Support Vector Machine Convolutional Neural Network Parallel Convolutional Highway Network

In the following model word embeddings learned from a large corpus is used. The corpus consists of the data collected from Reddit, Facebook, Twitter, and keeping the hyperparameters constant, the word embeddings are then fed to convolutional neural network [2]. The model here achieves higher accuracy without any feature extraction or hyperparameter tuning. A simple CNN with one layer of convolution is used here. The following paper demonstrate the approach based on machine learning for detecting Thai Click-bait. Such texts mostly consist of attractive words, low quality of information regarding the content to gain visitor attention. The corpus is generated by crowdsourcing, 30,000 of headlines to draw up the dataset [3]. The work specifically shows how to develop Click-bait detection model using two type of features in the embedding layer and three different networks in the hidden layer. Click-bait spread wider and wider, with the evolution of online advertisements. It disappoints the users because the article content does not go along with their expectation. Thus, recently detecting Click-bait has quite a lot of attention. Limited information being the reason in headlines [4] traditional Click-bait detection methods are built upon heavy feature engineering and cannot appropriately distinguish Click-bait from normal headlines. A suitable network can be Convolutional Network and can be useful for its detection, since it uses pretrained Word2Vec embeddings to understand the semantics of headline, and uses different kernels to find various characteristic feature of the headlines.

Detection and Performance Evaluation of Online-Fraud

185

However, different ways to gain users attention are used by different articles, and cannot be distinguished by pre-trained Word2Vec easily. To address these issues, a Clickbait convolutional neural network (CBCNN) is built to consider not only the specific characteristics but also overall characteristics features from different types of article. The results show that the method currently outperforms all the traditional methods and the Text-CNN model in terms of precision, recall and accuracy. The use of misleading techniques in user-generated news portals are ubiquitous. Unscrupulous uploaders intentionally mislabel video descriptions aims to increase their views and results in increasing their ad revenue [5].

3 Problem Statement The work stated in this paper solves the following issues: • To identify Click-bait and non-Click-bait headline and classify them successfully. • To obtain word embeddings for the words present in the dataset and specifically, for rare words, which was found to be the common drawback in almost all proposed solutions till now. • To allow the system to run on CPU rather than GPU. Thus, it is considered to be a binary classification problem, where the headline is taken into consideration.

4 Data Collection and Visualization The data is collected from Kaggle. The spam headlines are collected from sites such as ‘BuzzFeed’, ‘Upworthy’, ‘Viral Nova’, ‘That scoop’, ‘Scoop whoop’ and ‘Viral Stories’. The relevant or non-spam headlines are collected from many trustworthy news sites such as ‘Wiki News’, ‘New York Times’, ‘The Guardian’, and ‘The Hindu’. The dataset in total has 32,002 rows and 2 columns. It has two columns. The first column consists of headlines and the second one has numerical labels or binary labels of Click-bait in which 1 represents Click-bait and 0 represents non-Click-bait headline. The dataset contains total 32,000 rows of which 50% are clickbait and other 50% are non-clickbait, shows that dataset is equally distributed (Table 1). Table 1 Statistics of data set used Total headlines

Click-bait headlines

Non click-bait headlines

Vocabulary length

Year of creation

32,000

15,999

16,001

18,966

2015

The dataset is taken from [6], for our project. The Dataset is divided into 80:20 training and testing data respectively (Table 2).

186

A. Khan and M. Patil Table 2 Dataset splitting into training and testing Number of training data 25,600 Number of testing data

6400

An additional variable “document length” is used here which accounts for total words present in a headline, in order to understand the distribution of headlines in training and testing datasets (Fig. 3).

Fig. 3 Word cloud for click-bait and non-click-bait headlines respectively

The above visualization of Click-bait and non-clickbait datasets clearly shows the dissimilarities between these two. The following Fig. 4 shows the top words and their count.

5 Proposed System The development of the proposed system has been classified into phases for ease of operations. The first phase is built by pre-processing of textual data using Natural Language Processing technique. In this, extra white spaces, punctuations have been removed. Entire headline is converted from uppercase to lowercase. Stop Words Removal and tokenization have been done. While performing visualization on the dataset it has observed that numbers also play an important role in Click-bait headlines, hence numbers are kept for detailed analysis of headlines. On the other hand, stemming and lemmatization chop off the entire word in one way or other, so there will be nothing in specific to feed to a neural network, hence both these techniques are not used in here (Fig. 5). In the next i.e. Second phase FastText word embeddings are obtained for the dataset that will be fed to the neural network. FastText: It allows users to understand and use text representations and text classifiers as it is an open-source, free, lightweight library. Standard, generic hardware is needed for working with FastText. Models in later stage can be reduced in size and can be made able to be used on mobile devices. It is another word embedding method and an extension of the word2vec model. Instead of learning word vectors directly, FastText represents each word as an n-gram of characters. This helps in capturing the meaning of shorter words and allowing the embeddings to understand suffixes and prefixes. It works on

Detection and Performance Evaluation of Online-Fraud

187

Fig. 4 Top 10 words in headlines and their count

Fig. 5 Architecture diagram for the proposed model

CPU rather than GPU. This makes our model really effective in terms of cost. Here, for this model FastText embeddings provided by Facebook for text classification is used, which is 2-million-word vectors generated from common crawl in total accounts for 600 B tokens. The next and the third phase is model building with the embeddings generated.

188

A. Khan and M. Patil

Hybrid Bi-Directional LSTM-CNN model Flowchart for the model is shown in the following Fig. 6. The FastText word embeddings will be first fed to bi-directional LSTM and later to CNN, keeping all the parameters default.

Fig. 6 Bi-directional LSTM-CNN model

Detection and Performance Evaluation of Online-Fraud

189

The loss function used here is binary cross entropy and Adam optimizer is used along with “sigmoid” activation function. Multi-layer Perceptron Model: In machine learning MLP algorithm is known to be the backbone of deep learning. Using MLP with proper parameters and FastText word embeddings, it can give considerable result (Fig. 7).

Fig. 7 MLP model

The parameters here are kept constant also same optimization, loss and activation functions have been used as for LSTM-CNN model to achieve the best accuracy.

6 Model Evaluation and Result The last and most crucial phase comprises of model evaluation and discovering really considerable results (Tables 3 and 4). Table 3 Bi-directional LSTM-CNN model parameters Total parameters Trainable parameters

7,159,681 218,881

190

A. Khan and M. Patil Table 4 MLP model parameters Total parameters Trainable parameters

6,944,525 3,725

It can be clearly seen that for both, Bi-Directional LSTM-CNN and MLP count of Trainable parameters is much lower compared to total parameters clearly states that these are optimal weights found by the model to reduce the cost function of the model (Figs. 8, 9 and 10).

Fig. 8 Accuracy graph with FastText embeddings on bi-directional LSTM CNN model

Fig. 9 ROC curve with FastText embeddings on bi-directional LSTM CNN model

The curve denotes an accurate classifier, as it shows high precision denoting a low false positive rate and recall denoting a low false negative rate (Figs. 11, 12 and 13 and Tables 5 and 6). From the above given graphs and performance metrics it is evident that MLP model is doing good and can be reliable in terms of detecting Click-baits. Whereas compared to

Detection and Performance Evaluation of Online-Fraud

191

Fig. 10 Precision-recall curve with FastText embeddings on bi-directional LSTM CNN model

Fig. 11 Accuracy graph with FastText embeddings on MLP model

results of the paper [7] wherein GloVe word embeddings are used and 95%, 89%, 94% accuracy is obtained on Dataset1, Dataset2, Dataset3 respectively, also in [8] where a number of methods like AdaBoost, VGG-16, SVM etc. are used and 89%, 82%, 88% accuracy is obtained, also in this paper MLP model is used which achieves 84% accuracy. In contrary our MLP model with FastText word embeddings achieves 86% accuracy, showing a 2% rise in accuracy of the model. The CNN model shown in [9] achieves a good 90% accuracy score using word2vec embeddings. Comparing with all the best performing models it is evident that Bi-Directional LSTM-CNN model with FastText word embeddings outperform all other models based on machine learning algorithms and word embedding techniques such as Word2Vec and GloVe with 99% remarkable accuracy.

192

A. Khan and M. Patil

Fig. 12 ROC curve with FastText embeddings on MLP model

Fig. 13 Precision-recall curve with FastText embeddings on MLP model

Table 5 Performance metric for bi-directional LSTM CNN model Accuracy Precision Recall F1-score Click-bait Non-click-bait

99.00

99.00

98.00

99.00

98.00

99.00

99.00

Detection and Performance Evaluation of Online-Fraud

193

Table 6 Performance metric for MLP model Accuracy Precision Recall F1-score Click-bait Non-click-bait

86.00

91.00

80.00

85.00

82.00

92.00

87.00

7 Conclusion To stop and demote the excessive usage of click-bait headlines by agencies in order to earn revenue for themselves and flooding the online portals with fake news, novel neural network architectures are been proposed, and it has been found out that word embeddings play a major role in increasing the accuracy and reliability of the system. Also, the Bi-Directional LSTM-CNN model performs better than MLP model with FastText word embeddings and outperforms all other classification algorithms and Word2Vec techniques by achieving highest accuracy till now. However, the job is far from over so, the future scope includes not only detecting click-baits but also blocking them. Acknowledgements. I sincerely acknowledge the support, guidance and encouragement of my Dissertation Guide Associate Professor Dr. Megharani Patil ma’am for this novel work.

References 1. Adelson P, Arora S, Hara J (2018) Clickbait; Didn’t read: clickbait detection using parallel neural networks 2. Agrawal A (2016) Clickbait detection using deep learning. In: 2nd international conference on next generation computing technologies 3. Klairith P, Tanachutiwat S (2018) Thai clickbait detection algorithms using natural language processing with machine learning techniques. In: International conference on engineering, applied science and technology 4. Zheng H-T, Chen J-Y, Yao X, Sangaiah AK, Jiang Y, Zhao C-Z (2018) Clickbait convolutional neural network 5. Zannettou S, Chatzis S, Papadamou K, Sirivianos M (2018) The good, the bad and the bait: detecting and characterizing clickbait on YouTube. In: IEEE symposium on security and privacy workshops 6. Pandey S, Kaur G (2018) Curious to Click It?—identifying clickbait using deep learning and evolutionary algorithm. In: International conference on advances in computing, communications and informatics (ICACCI) 7. Kaur S, Kumar P, Kumaragurub P (2020), Detecting clickbaits using two-phase hybrid CNNLSTM biterm model. J Expert Syst Appl 8. Shang L, Zhang D, Wang M, Lai S, Wang D (2019) Towards reliable online clickbait video detection: a content-agnostic approach. J Knowl Based Syst 9. Chakraborty A, Paranjape B, Kakarla S, Ganguly N (2016) Stop clickbait: detecting and preventing clickbaits in online news media. In: IEEE/ACM international conference on advances in social networks analysis and mining (ASONAM)

Data Compression and Transmission Techniques in Wireless Adhoc Networks: A Review V. Vidhya1(B) and M. Madheswaran2 1 Department of Computer Science, M.E.T Engineering College, Kanyakumari District, India 2 Department of Electrical and Communication Engineering, Muthayammal Engineering

College, Namakkal, India

Abstract. A Wireless ad hoc network (WANET) is a decentralized brand where all nodes compete in routing by promoting data for supplementary nodes. This is arranged in a disaster region to gather patient’s data and develop medical amenities. The health information gathered from these calamity regions are compressed and transmitted through this network for treatment purposes. During this process several issues identified comprises of the quality of service (QoS) issues, interference between nodes, connectivity issues, efficient routing, security or authority issues, scalability, network topology, network lifetime, battery power consumption, network bandwidth etc. To overcome these issues so far many algorithms for instance data compression (DC) algorithms, routing algorithms etc. Furthermore several issues exist with these existing algorithms. This work reviews some existing algorithm their advantages and limitations identified with the existing works in DC and transmissions and also provide future study and growth directions. Keywords: Wireless Adhoc Network (WANET) · Lossless compression · Routing algorithm · Data transmission techniques · Security · Interference · Network connectivity

1 Introduction The WANETs is self-assured of numerous tiny nodes sprinkled in the calamity region. The nodes are adequate of wirelessly transferring the composed health data to the base stations [1]. A WANET is meant to be arranged in a calamity region to gather data of patients and look up health facilities [2]. In the calamity region model, a calamity circumstances is alienated into diverse context-based regions (illustrated in Fig. 1) such as the calamity region, casualty treatment region and transport zone [3]. The ad hoc network (NET) architecture is applied in trade, corporate companies to enhance the yield and revenue. The NET is divided according to their appliance as Mobile Ad hoc NETwork (MANET) which is a identity managing infrastructure less network, Vehicular Ad hoc NETwork (VANET) that utilizes driving cars as network nodes [4]. Rather than these applications it can be also applied in other purpose for instance military arena, provincial level, industry sector, Bluetooth etc. [5]. DC is the procedure of transforming a source data stream to a new data stream i.e. the compressed stream that © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2022 V. E. Balas et al. (eds.), Intelligent Computing and Networking, Lecture Notes in Networks and Systems 301, https://doi.org/10.1007/978-981-16-4863-2_17

Data Compression and Transmission Techniques in Wireless

195

Clearing Station

Disaster Location

Patients for ailment

Clearing Station

Ambulance parking

Clearing Station

Fig. 1 Calamity region model

has fewer bits [6]. Several common issues identified in NET comprises of restricted wireless choice, secret terminals, packet fatalities, route changes, heterogeneity in devices, power constraint in battery etc. [7]. Further the key concern includes energy conservation, unstructured network topology, scalability, low quality communications, resource constrained computation, hidden node crisis etc. [8]. Finally NET has no centralized authority, QoS, sufficient admission control and interference between the nodes [9]. In wireless networks, all node has undeviating radio association to further nodes, known as its peers. Before efficient routing the nodes determine and recognize the network interface addresses (NIAs) of their neighbours labelled peer discovery. The crisis is essential in NET [10]. So far several techniques are proposed to resolve these issues that comprises of a dual authentication method to offer a high stage of security and prevent unauthorized vehicles incoming into the VANET [11]. Then a vibrant trust prediction model rooted in the historical behaviors of nodes is presented. This is then integrated with a Source Routing method named as Trust-based Source Routing protocol (TSR) [12]. Then a neighbour coverage rooted probabilistic rebroadcast protocol is proposed for dropping routing overhead in NETs [13]. A procedure for signal potency depending on tie availability calculation is utilized in Ad-hoc on demand Distance Vector (AODV) routing. At this point the nodes calculate approximately the link crack instance and hint the other nodes concerning the link discontinuities [14]. A directional routing and scheduling scheme (DRSS) is presented that enhance the energy effectiveness in consideration of congestion, safeguard and interruption [15]. Apart from this WANET suffer from DOS attacks comprising black hole attack (BHA), grayhole attack, wormhole attack, byzantine attack etc. [16]. To overcome the security issues a hierarchical structure rooted in chance detection and usage control (UCON) skills is presented. The features of UCON tackle continuing attacks [17]. The paper structure is as pursues Sect. 2 facts the review on recent

196

V. Vidhya and M. Madheswaran

works describing the difference between the advantages and limitations of the existing works. Section 3 depicted the conclusion of the review tracked by the references.

2 Review on Recent Works for Lossless Compression A. Medical Data Compression in WANET Wagh et al. [18] described an application based on WANET which was deployed in calamity region. An AODV was used as routing protocol between the nodes. This AODV protocol was independent of any location information. Medical Data of Patient (MDP) was main entity of raw images of patients such as endoscopic image. These images are of large size and use up lot of bandwidth and also decline the life span of the Remote Medical Monitoring (RMM) system was introduced for routing the health data in the calamity region. The proposed structure comprised a apparatus that collected, compressed, and transmitted compressed health data to the base station via WANETs. Dutta [19] discussed about WANETs that is self-possessed of nodes sprinkled in calamity region. The nodes were capable to wirelessly broadcast the composed health data to the base stations. To tackle this concern, an optimization based health DC method, was proposed which was vigorous to broadcast faults. Then a fuzzy-logic-based direction choice procedure was given to distribute the compressed data that maximized WANETs lifetime. The technique does not employ several environmental data. Cho et al. [20] described about health information systems that were striving towards observing models to concern patients through Electrocardiogram (ECG) signals. Conversely, there were limits for instance data distortion and restricted bandwidth. Hence this study alerted on compression. Moreover this work presented an algorithm Encoded Data Lempel–Ziv–Welch (EDLZW) for ECG data transmission. B. Security Issues Nogueira et al. [21] described about raising reliance of citizens on crucial applications and wireless networks to assure protected and dependable service operation. To maintain network actions and security needs of serious applications, SAMNAR, a Survivable Ad hoc and Mesh Network Architecture was presented. It provides protective, reflex and liberal security means to necessary forces in attacks. SAMNAR was used to devise a trail selection method for WANET. Yao et al. [22] discussed safe routing in a multihop WANET in front of eavesdroppers. The eavesdropper’s positions were moulded as a consistent Poisson point process (PPP) and the source–destination duo was aided by middle relays by the decode-and-forward (DF) policy. To assist finding a key to secure routing, estimation of the SCP was derived. A Bellman-Ford algorithm was assumed to discover the best trail. Cai et al. [23] focused on relay transmission for safe letter of a private message in the existence of snoopers. The source–destination duo is capable of potentially aided by relays. For a random relay, precise terms of safe association likelihood for colluding and non-colluding eavesdroppers were derived. Additionally, minor bounce terms on

Data Compression and Transmission Techniques in Wireless

197

the safe association likelihood attained. These lower bound expressions, was used to proposed a relay assortment approach to progress the safe association likelihood. Xu et al. [24] depicted Physical Layer Security (PLS) in wireless communication systems. The amalgamation of PLS and QoS for route choice in multi-hop WANETs remains a technical issue. This work focused on a multi-hop WANET with two transmission plans like amplify-and-forward (AF) and decode-and-forward (DF), and discovers the route selection with the deliberation of both security and QoS. Kulkarni et al. [25] discussed the knowledge dispatch in active network is named Delay Tolerant Network (DTN). One main concern on this network is safety. This work is centered on the shield issues associated with DTN routing protocols. The routing in DTN is a concern, resulting in non-operation of routing protocols, non efficient network, and it’s believed to routing hard to safeguard beside attacks of hateful deeds, to nonexistence of central power in DTN. Kiskani and Sadjadpour [26] described about caching that intend to hoard data nearby in several nodes inside the network to be capable to recover the fillings. However, caching in the network did not believe secure storage. Here, a decentralized safe coded caching loom was projected. Here nodes just broadcast coded records to evade snooper’s wiretappings and defend the consumer fillings. Here arbitrary vectors were utilized to merge the fillings by XOR action. The proposed coded caching method was modeled by a Shannon cipher scheme to prove that coded caching attain asymptotic privacy. Hence by the proposed plan, any filling can be rescued by choosing an arbitrary lane. Xu et al. [27] studied the safe optimal QoS routing (SOQR) in WANETs rooted on the PLS techniques. First closed-form expression was derived of connection outage probability (COP) and secrecy outage probability (SOP) for any specified end-to-end (ETE) trail Then, the least COP habituated on that SOP was explored under a threshold and the equivalent realizable power allotment was attained. With the aid of study of a known trail, additional SOQR algorithm was proposed which choose the safe trail amid source and destination nodes duo. Zhang et al. [28] addressed the issue of recognizing and dividing mischievous nodes that reject to send packets in multi-hop NET. An Audit-rooted Misbehavior Detection (AMD) was urbanized that isolated nonstop and choosy packet droppers. The AMD integrated honest route innovation, reputation management, and misbehaving nodes discovery rooted on behavioural checks. AMD also notice choosy dropping attacks when ETE traffic was encrypted and is practical to multi-channel networks. C. Algorithm for Detecting Attacks in WANET Shu and Krunz [29] described about connection fault and wicked packet dropping. In this work insider-attack, whereby malicious nodes (MN) use their information of the communication environment to fall serious to the network performance was explored. To advance the accuracy, the connection amid vanished packets was developed. Moreover, to guarantee honest computation of these associations, a homo-morphic linear authenticator (HLA) was developed rooted communal review structural design that permit the detector to confirm the honesty of the packet loss data. A packet-block-rooted method was moreover projected to diminish the calculation overhead.

198

V. Vidhya and M. Madheswaran

Baadache and Belmehdi [30] described that in multi-hop WANET, nodes not in straight series rely on middle nodes to correspond. To protect its restricted assets, a middle node fall packets, in its place of forwarding it to descendant. In this work, this BHA was dealt, by proposing an authentic ETE response rooted method. Lee et al. [31] depicted the routing misbehavior of WANET where nodes were not forwarding letters properly. One time the attack was commencing, nodes in the network were not capable to dispatch letters. To spot the attack, previously a watchdog means was employed which is not competent with high false positive ratio. Here, a mechanism rooted on routing misconduct was proposed. It detected attacks and solved the issues in existing watchdog. Soleimani and Kahvand [32] since WANET have no central infrastructure and administration; they are susceptible to numerous safety threats. Hateful packet dropping is an attack in these networks, where an opponent node attempts to fall packets in its place of forwarding them to the next hop. In this work, an active trust representation was proposed to protect network beside this attack. Here, a node belief all instant neighbors originally. Attainment of feedback from neighbours’, a node revises the equivalent trust rate. Koh et al. [33] presented a phantom-receiver-rooted routing plan to improve all source–destination duo anonymity while employing a modifiable overhead. Also this included conventional network policy and opportunistic routing to drip appropriate confidentiality and to mitigate vulnerability. Also the destination was allowed to secretly surrender a response to the source. Chen et al. [34] described the characteristic of non-infrastructure hold in WANET that experience diverse attacks. Besides, user certification was the original protection fence in a network. A common faith was accomplished by a protocol that enabled corresponding gathering to verify everyone together and to swap session keys. Thus, a user verification plan rooted on the self-certified public key scheme and elliptic curves cryptography was proposed. Hence, a collaborative user verification and safe session key contract was attained. Sneha and Jose [35] described about packet loss which is an issue rooted by hateful packet dropping. It is hard to separate the packet failure owing to connection fault and hateful sinking. Here are mechanisms which spot the hateful packet sinking by relationship amid packets. An assessment planning rooted on homomorphic linear authenticator was employed to guarantee the response of packets confirmation. Also a repute method rooted on not direct reciprocity was employed to guarantee the forwarding of packets at each node. Shrivastava and Verma [36] portrayed the infrastructure-less characteristic of NET that has concerns like routing, security, MAC layer, etc. Security concern is a major one. Vampire attacks (VM) alter embattled packets by arranging extended way or misguiding the packets. MNs employ fake messaging affecting the bandwidth and node battery power. Routing and network possessions acquire VM defence; a method was proposed to discover hateful routing packets. The proposed method employed packet screening method to discover hateful packets.

Data Compression and Transmission Techniques in Wireless

199

D. Algorithm for Detecting Vampire Attacks Vasserman and Hopper [37] focused on security that payed attention on contact rejection at the routing. This paper explored supply exhaustion attacks at the routing protocol layer, which enduringly hinder networks by exhausting nodes’ battery power. These VA were not precise, but rely on the belongings of numerous routing protocols classes. It was identified that all inspect protocols were vulnerable to VMs, which are disgusting, hard to notice, and were easy to embrace away by as little as lone intolerable insider transferring merely protocol submissive letters. Umakanth and Damodhar [38] researched on ever more approved ideas from wireless communications. In this paper routing protocols, concern from attack yet those intended to be safe, require attacks guard, called VMs. An Energy Weighted Monitoring Algorithm (EWMA) method was projected to combine the crack rooted by vampire in the packet forwarding stage. Vijayanand and Muralidharan [39] discussed about ad hoc sensor wireless networks. The security work focused on contact rejection of routing admission manages levels. The attacks which were centered on routing protocol layer were recognized as resource exhaustion attacks. This attack causes the collision of continually stopping the networks by exhausting the node’s battery power. These VM were not impacting some exact protocols sort. Discovery of VMs was not a trouble-free one. A uncomplicated vampire in the network increased network extensive energy handling. Rajipriyadharshini et al. [40] described the energy wastage at sensor nodes. Wireless sensor networks (WSN) necessitate clarification for saving energy point. VM attack happen at network layer that guide to resource exhaustion at all sensor nodes; by destroying battery power. It broadcasts a little protest letters to stop an entire network, making it hard to identify and avoid. Existing protocols were not centred on this VM event on routing layer, then there subsist two attacks specifically, carosuel and stretch attack. New PLGP protocol, a precious and safe protocol was proposed with the key administration protocol entitled Elliptic Diffie-Hellman key swap protocol to evade VM. Vasserman and Hopper [37] depicted WSN in globe are the way of contact. These contain nodes that operate as transmitter and receivers were attacks prone leading to fatalities. The resource exhaustion attack VM ditches out the energy. These attacks were protocol acquiescent, they were effortless to apply. As they were orthogonal they know how to infringe into every routing protocol. They distress the whole network grounding energy failure. The proposed method notice the occurrence of VM shows the energy consumption [60]. Kwon et al. [41] studied safe contact by the joint loom of sport hypothesis and stochastic geometry, where legal transmitters and eavesdroppers are dispersed. Two scenarios were considered to the Eve level plan: (I) the Eve level activate nodes to maximally snoop the secret letters of the Alice level and (II) the Eve level triggers just few nodes to capitalize on its energy efficiency (EE) in snooping according to the Alice level node establishment. In situation I, an irregular optimization plan was proposed that take advantage of the secrecy EE of the Alice tier by calculating the node launch likelihood, the secret letter rate, the redundancy rate, and the count of dynamic antennas. In situation II, an EE node launch game was set among the Alice level and the Eve level,

200

V. Vidhya and M. Madheswaran

where the earlier and the second manage their node-activation likelihood to exploit the confidentiality EE and the snooping EE. E. Security Issues in MANET Turkanovi´c et al. [42] discussed the distinguishing features of MANETs, with lively topology and unlock wireless standard, may guide to MANETs anguish from safety vulnerabilities. Here, a faith administration plan was proposed that enhanced the safety. Here the faith outline has two works: faith from straight and circuitous observation. With straight study, the faith value was copied by Bayesian inference. Conversely, with circuitous study, the faith value was copied via the Dempster-Shafer theory (DST). Movahedi et al. [43] depicted that trust management to conduct node’ transactions and set up management interactions in MANETs. Need of central administration, strict resource limitation, and network dynamics make faith administration a risky mission. In this work, diverse faith administration frameworks geared for MANETs was offered, that hold active attacks misleading honesty calculation to misinform faith-rooted network actions. Moreover, a holistic classification was proposed. For all structure, a loom was employed to explain the faith form, enchanting all components for faith administration as a instruction. Dhananjayan and Subbiah [44] depicted safe data transport alongside the hateful attacks in MANET. The requirements of nodes positional data revise in AODV protocol recommend less faith level between the nodes. A trust-aware ad-hoc routing (T2AR) protocol was given to advance the faith level among the nodes. This mode modified the usual AODV routing protocol by limitations of faith rate, energy, mobility rooted hateful deeds forecast. The packet series ID identical from the record information of neighbor nodes conclude the faith rate avoiding the hateful details creation. Muthuramalingam and Suba Nachiar [45] described faith rooted models that present security. Here, two schemes like, straight and circuitous examination rooted faith assessment were proposed. Originally, the network was created to examine security. The full likelihood form exploitation in Bayesian edge evaluated the faith from the spectator node in straight surveillance system. On the other hand, the neighbour hop data was employed in the origin of faith value in circuitous examination plan. The Dempster-Shafter hypothesis calculated the faith rate following the examination plans. After all, the Dijkstra’s algorithm launches the routing procedure rooted on shortest trail. Ahmed et al. [46] debated locked data distribution that is a tricky job in MANET. This work proposed Flooding Factor rooted Framework for Faith Administration (F3TM). True flooding loom was exploited to recognize attacker nodes rooted on the computation of faith value. Route invention Algorithm was urbanized to determine a data trail for sending by investigational Grey Wolf algorithm for authenticating network nodes. Improved Multi-Swarm Optimization was employed to optimize the release trail. Ullah et al. [47] figured faith advice, encompassing a critical part in calculation of faith and confidence in peer to peer (P2P) atmosphere. So, alleviation of dishonest faith suggestion was predetermined as a challenging concern in P2P systems (esp in MANET). To gratify these concerns, “intelligently Selection of Faith Recommendations rooted on

Data Compression and Transmission Techniques in Wireless

201

Dissimilarity factor (iSTRD)” was devised. iSTRD develops individual knowledge of an “evaluating node” in combination with common vote of the recommenders. Wei et al. [48] proposed a trust administration plan that enhanced Manet’s security. The proposed faith administration scheme has two mechanisms: faith from straight and circuitous test. With straight test from a spectator node, the faith value is imitative by Bayesian inference. Alternatively, with circuitous inspection, the faith value is derived by the DST. Singh et al. [49] debated the routing protocol design with EE and security. To conquer this dispute, EE secured routing protocol was proposed. To offer safety for association and letter without relying on the mediator, security was provided by a safe association for routing by safe Optimized association State Routing Protocol. All nodes decide multipoint conveys nodes amid the one-hop neighbours set, so as to attain all two-hop neighbors. The admission rule unit approve nodes proclaiming the node discovery. After selecting the association, on need of a fresh way, ensure nodes’ power position in its routing table and then consequently start a way. Then, execute set key allocation using the produced keys. The group key can be distorted occasionally to evade non-authorized nodes. Then, offer letter privacy for message correspondent and message receiver. Poongodi and Karthikeyan [50] depicted BHA which confine the way from starting place to target by transferring respond with main sequence digit and smallest hop count. Here, Localized Secure Architecture for MANET (LSAM) routing protocol was projected to discover and avoid supportive BHA. Security Monitoring Nodes (SMNs) would be triggered only if the threshold rate was surpassed. If MNs were identified, former SMNs in its immediacy spot are humiliated to detach the MNs. Kaliappan and Paramasivan [51] discussed the challenges of MANETs to different security attacks. Because of central management lack, safe routing was tricky in MANETs. Sport theory was employed as a means to analyze, formulate and crack egoism concerns in MANETs. This work used a Dynamic Bayesian Signalling sport to analyze tactic outline for usual MNs. The Payoff to nodes for motivating the exacting nodes concerned in misbehavior was calculated. Regular nodes observe endlessly to assess their neighbors by belief evaluation and faith revise method of the Bayes rule. F. Security Issues in Adhoc Networks Tan et al. [52] discussed that NET endure from diverse attacks in the data plane. To lock the data plane of NET, a faith administration system was proposed. Fuzzy logic (FL) was engaged that invent vague experimental data, and then estimate the path faith value. FL with graph hypothesis was approved to construct a faith form for computing the node faith rate. To protect attacks to faith administration systems, a filtering algorithm was proposed. An honesty decompose way was also intended to resolution the argument regarding the decomposing past faith rate in a faith-rooted routing assessment. As well, a faith factor collection approach was proposed to guarantee that the faith administration method was attuned with further security primitives. Finally, the proposed faith administration method was incorporated into the optimized connection state routing protocol.

202

V. Vidhya and M. Madheswaran

Xu et al. [53] expresses PLS in NETs. To concentrate on these concerns, this work discovers the PLS alert routing and recital tradeoffs in a multi-hop NET. For one ETE path foremost, get its connection outage probability (COP) and secrecy outage probability (SOP) in closed-form, which provide presentation metrics of letter QoS and broadcast safety. Rooted on the closed-form terms, the security-QoS tradeoffs to diminish COP trained on that SOP was assured. Adams and Bhargava [54] focused on friendly jamming PLS method that uses further obtainable nodes to jam any snoopers. This work judges extra obtainable nodes employment as gracious jammers to advance the route security. This work considered routing issues during a D2D network as equally diminishing the SOP and COP, by gracious jamming to advance the SOP of all link. The jamming authorities were resolute to put nulls at gracious receivers whereas exploiting the snoopers’ power. Then the crisis was mounted as a convex optimization problem and an supplementary variable was measured to adjust the optimization among the two metrics. G. Security Issues in VANET Faghihniya et al. [55] conferred security in VANET acts critical task. The broadcast packets usage in the AODV route detection phase root it is tremendously susceptible beside DOS and DOS flooding attacks. The method proposed here is Balanced AODV (B-AODV) since it suppose all network nodes act usually. If network nodes were away of regular activities then they acknowledged as MN. B-AODV was intended with subsequent characteristic: (1) The exploit of adaptive threshold depending to network circumstances and nodes behavior (2) No additional routing packets was used to verify MNs (3) execute discovery and avoidance process autonomously (4) execute discovery and avoidance action in real time (5) No requirement for promiscuous mode. This technique for discovery and avoidance flooding attack utilize average and standard deviation. Krundyshev et al. [56] discussed information security problems in VANET. This examined routing attacks on dynamic networks, in which a MN supply fake routing data by promoting itself comprising shortest trail to the source node and then deny the traffic from the source node. A method to offer for VANET security by swarm algorithms of artificial intelligence was proposed. Li and Song [57] discussed the security and privacy challenges posed by VANETs. The honesty of VANETs was improved by tackling both data faith, and node faith. Here, an attack-resistant trust scheme (ART) was projected that notice and handle with cruel attacks and as well assess the honesty of data and MNs in VANETs. Data faith was assessed by the data sensed and gathers from numerous vehicles; node faith was measured in two proportions. Mokdad et al. [58] discussed VANETs that are considered as NETs with changed topology, making complicated resource administration and unlock a few strands in security. Particularly on the Physical and MAC layers that were extra susceptible. Thus, it is exigent to discern when broadcast data were not distributed to the target, if this is owed to an attack. In this work, an algorithm DJAVAN (Detecting Jamming Attacks in VANET) was proposed to identify a jamming attack.

Data Compression and Transmission Techniques in Wireless

203

He et al. [59] guaranteed secure communication in VANETs must be dealt. The conditional privacy-preserving authentication (CPPA) plan was appropriate for resolving safety and privacy-preserving issues, since it tackles joint verification and privacy guard. To attain improved presentation and decrease computational difficulty, the CPPA scheme design without bilinear paring for the VANET environment develop into a dispute. To tackle this, a CPPA plan was proposed for VANETs without bilinear paring (Table 1). Table 1 Description of existing medical data compression Technique

Merits

Demerits

Remote Medical Monitoring (RMM) system [18]

Energy efficiency

Security

Fuzzy logic rooted route selection technique [19]

Robust, High network lifetime

Security, hardware implementation

Survivable ad hoc and mesh network architecture [21]

Security, low cost

Network performance, survivability

Secure connection probability [22]

Secure routing

Benchmark

3 Conclusion This paper presented a review on the DC and transmission techniques in NETs. Several issues occurred in NETs such as connectivity issues, DC and data transmission issues, security or authority issues, energy issues, lifetime issues, power issues, QoS issues, network bandwidth issues, routing issues etc. are analyzed and reviewed. The advantages and limitations of the works are discussed. Thus this review aid to promote the research society to tackle the issues identified in WANET. Hence lastly, a few rules are known on the open dispute that have not been tackled or where more research is required. Thus the future works focus on the issues that are further identified. Acknowledgements. I thank the God almighty for giving me the research opportunity. I thank my guide Dr.M.Madheswaran for his constant assistant towards the improvement of my research activity. I thank my family, friends and colleagues for their continuous support.

Conflict of Interest. The authors state that there is no conflict of interest and funding.

204

V. Vidhya and M. Madheswaran

References 1. Reina DG, Toral SL, Johnson P, Barrero F (2015) A survey on probabilistic broadcast schemes for wireless ad hoc networks. Ad Hoc Netw 25:263–292 2. Reina DG, Askalani M, Toral SL, Barrero F, Asimakopoulou E, Bessis N (2015) A survey on multihop ad hoc networks for disaster response scenarios. Int J Distrib Sens Netw 11(10):647037 3. Mutahara M, Haque A, Shah Alam Khan M, Warner JF, Wester P (2016) Development of a sustainable livelihood security model for storm-surge hazard in the coastal areas of Bangladesh. Stochastic Environ Res Risk Assess 30(5):1301–1315 4. Helen D, Arivazhagan D (2014) Applications, advantages and challenges of ad hoc networks. JAIR 2(8):453–457 5. Qiu T, Chen N, Li K, Qiao D, Fu Z (2017) Heterogeneous ad hoc networks: architectures, advances and challenges. Ad Hoc Netw 55:143–152 6. Razzaque MA, Bleakley C, Dobson S (2013) Compression in wireless sensor networks: a survey and comparative evaluation. ACM Trans Sens Netw (TOSN) 10(1):5 7. Mokhtar B, Azab M (2015) Survey on security issues in vehicular ad hoc networks. Alex Eng J 54(4):1115–1126 8. Al-Sultan S, Al-Doori MM, Al-Bayatti AH, Zedan H (2014) A comprehensive survey on vehicular ad hoc network. J Netw Comput Appl 37:380–392 9. Liang W, Li Z, Zhang H, Wang S, Bie R (2015) Vehicular ad hoc networks: architectures, research issues, methodologies, challenges, and trends. Int J Distrib Sens Netw 11(8):745303 10. Zhang L, Luo J, Guo D (2013) Neighbor discovery for wireless networks via compressed sensing. Perform Eval 70(7–8):457–471 11. Vijayakumar P, Azees M, Kannan A, Deborah LJ (2016) Dual authentication and key management techniques for secure data transmission in vehicular ad hoc networks. IEEE Trans Intell Transp Syst 17(4) 12. Xia H, Jia Z, Li X, Ju L, Sha EH-M (2013) Trust prediction and trust-based source routing in mobile ad hoc networks. Ad Hoc Netw 11(7):2096–2114 13. Zhang XM, Wang EB, Xia JJ, Sung DK (2013) A neighbor coverage-based probabilistic rebroadcast for reducing routing overhead in mobile ad hoc networks. IEEE Trans Mob Comput 12(3):424–433 14. Yadav A, Singh YN, Singh RR (2015) Improving routing performance in AODV with link prediction in mobile adhoc networks. Wirel Pers Commun 83(1): 603–618 15. Zeng Y, Xiang K, Li D, Vasilakos AV (2013) Directional routing and scheduling for green vehicular delay tolerant networks. Wireless Netw 19(2):161–173 16. Prabu M, Vijaya Rani S, Santhosh Kumar R, Venkatesh P (2015) Dos attacks and defenses at the network layer in ad-hoc and sensor wireless networks, wireless ad-hoc sensor networks: a short survey. Eur J Appl Sci 7(2):80–85 17. Wu J, Ota K, Dong M, Li C (2016) A hierarchical security framework for defending against sophisticated attacks on wireless sensor networks in smart cities. IEEE Access 4(4):416–424 18. Wagh J, Bhatt A, Wayachal R, Ghate S, Petkar A, Joshi S, Professor UG (2016) MDP: medical data of patients management using wireless ad-hoc network-WANET. Int J Eng Sci 3632 19. Dutta T (2015) Medical data compression and transmission in wireless ad hoc networks. IEEE Sens J 15(2):778–786 20. Cho G-Y, Lee S-J, Lee T-R (2015) An optimized compression algorithm for real-time ECG data transmission in wireless network of medical information systems. J Med Syst 39(1):161 21. Nogueira M, Silva H, Santos A, Pujolle G (2012) A security management architecture for supporting routing services on WANETs. IEEE Trans Netw Serv Manage 9(2):156–168

Data Compression and Transmission Techniques in Wireless

205

22. Yao J, Feng S, Zhou X, Liu Y (2016) Secure routing in multihop wireless ad-hoc networks with decode-and-forward relaying. IEEE Trans Commun 64(2):753–764 23. Cai C, Cai Y, Zhou X, Yang W, Yang W (2014) When does relay transmission give a more secure connection in wireless ad hoc networks? IEEE Trans Inf Forensics Secur 9(4):624–632 24. Xu Y, Liu J, Shen Y, Jiang X, Taleb T (2016) Security/QoS-aware route selection in multi-hop wireless ad hoc networks. In: 2016 IEEE international conference on communications (ICC). IEEE, pp 1–6 25. Kulkarni AA, Shinde SM (2018) Attacker and different security scheme in delay tolerant wireless ad hoc network. 5(7):857–859 26. Kiskani MK, Sadjadpour HR (2017) A secure approach for caching contents in wireless ad hoc networks. IEEE Trans Vehic Technol 66(11):10249–10258 27. Xu Y, Liu J, Takahashi O, Shiratori N, Jiang X (2017) SOQR: secure optimal QoS routing in wireless ad hoc networks. In: Wireless communications and networking conference (WCNC). IEEE, pp 1–6 28. Zhang Y, Lazos L, Kozma W (2016) AMD: audit-based misbehavior detection in wireless ad hoc networks. IEEE Trans Mob Comput 15(8):1893–1907 29. Shu T, Krunz M (2015) Privacy-preserving and truthful detection of packet dropping attacks in wireless ad hoc networks. IEEE Trans Mob Comput 14(4):813–828 30. Baadache A, Belmehdi A (2014) Struggling against simple and cooperative black hole attacks in multi-hop wireless ad hoc networks. Comput Netw 73:173–184 31. Lee G, Kim W, Kim K, Oh S, Kim D (2015) An approach to mitigate DoS attack based on routing misbehavior in wireless ad hoc networks. Peer-to-Peer Netw Appl 8(4):684–693 32. Soleimani MT, Kahvand M (2014) Defending packet dropping attacks based on dynamic trust model in wireless ad hoc networks. In 2014 17th IEEE mediterranean electrotechnical conference (MELECON). IEEE, pp 362–366 33. Koh JY, Teo JCM, Leong D, Wong W-C (2015) Reliable privacy-preserving communications for wireless ad hoc networks. In: 2015 IEEE international conference on communications (ICC). IEEE, pp 6271–6276 34. Chen H, Ge L, Xie L (2015) A user authentication scheme based on elliptic curves cryptography for wireless ad hoc networks. Sensors 15(7):17057–17075 35. Sneha CS, Jose B (2016) Detecting packet dropping attack in wireless ad hoc network. Int J Cybern Inform (IJCI) 5:118–124 36. Shrivastava A, Verma R (2015) Detection of vampire attack in wireless ad-hoc network. Int J Softw Hardw Res Eng 3(01):43–48 37. Vasserman EY, Hopper N (2013) Vampire attacks: draining life from wireless ad hoc sensor networks. IEEE Trans Mob Comput 12(2):318–332 38. Umakanth B, Damodhar J (2013) Detection of energy draining attack using EWMA in wireless ad hoc sensor networks. Int J Eng Trends Technol (IJETT) 4(8) 39. Vijayanand G, Muralidharan R (2014) Overcome vampire attacks problem in wireless ad-hoc sensor network by using distance vector protocols. Int J Comput Sci Mob Appl 2(1):115–120 40. Rajipriyadharshini P, Venkatakrishnan V, Suganya S, Masanam A (2014) Vampire attacks deploying resources in wireless sensor networks. Int J Comput Sci Inform Technol (IJCSIT) 5(3):2951–2953 41. Kwon Y, Wang X, Hwang T (2017) A game with randomly distributed eavesdroppers in wireless ad hoc networks: a secrecy EE perspective. IEEE Trans Vehic Technol 66(11):9916– 9930 42. Turkanovi´c M, Brumen B, Hölbl M (2014) A novel user authentication and key agreement scheme for heterogeneous ad hoc wireless sensor networks, based on the Internet of Things notion. Ad Hoc Netw 20:96–112

206

V. Vidhya and M. Madheswaran

43. Movahedi Z, Hosseini Z, Bayan F, Pujolle G (2016) Trust-distortion resistant trust management frameworks on mobile ad hoc networks: a survey. IEEE Commun Surv Tutor 18(2):1287–1309 44. Dhananjayan G, Subbiah J (2016) T2AR: trust-aware ad-hoc routing protocol for MANET. Springerplus 5(1):995 45. Muthuramalingam S, Suba Nachiar T (2016) Enhancing the security for manet by identifying untrusted nodes using uncertainity rules. Indian J Sci Technol 9(4) 46. Ahmed MN, Abdullah AH, Chizari H, Kaiwartya O (2017) F3TM: flooding factor based trust management framework for secure data transmission in MANETs. J King Saud Univ-Comput Inform Sci 29(3):269–280 47. Ullah Z, Islam MH, Khan AA, Sarwar S (2016) Filtering dishonest trust recommendations in trust management systems in mobile ad hoc networks. Int J Commun Netw Inform Secur (IJCNIS) 8(1) 48. Wei Z, Tang H, Richard Yu F, Wang M, Mason PC (2014) Security enhancements for mobile ad hoc networks with trust management using uncertain reasoning. IEEE Trans Vehic Technol 63(9):4647–4658 49. Singh T, Singh J, Sharma S (2017) Energy efficient secured routing protocol for MANETs. Wireless Netw 23(4):1001–1009 50. Poongodi T, Karthikeyan M (2016) Localized secure routing architecture against cooperative black hole attack in mobile ad hoc networks. Wirel Pers Commun 90(2):1039–1050 51. Kaliappan M, Paramasivan B (2015) Enhancing secure routing in mobile ad hoc networks using a dynamic Bayesian signalling game model. Comput Electr Eng 41:301–313 52. Tan S, Li X, Dong Q (2016) A trust management system for securing data plane of ad-hoc networks. IEEE Trans Vehic Technol 65(9):7579–7592 53. Xu Y, Liu J, Shen Y, Jiang X, Shiratori N (2017) Physical layer security-aware routing and performance tradeoffs in ad hoc networks. Comput Netw 123:77–87 54. Adams M, Bhargava VK (2017) Using friendly jamming to improve route security and quality in ad hoc networks. In: 2017 IEEE 30th Canadian conference on electrical and computer engineering (CCECE). IEEE, pp 1–6 55. Faghihniya MJ, Hosseini SM, Tahmasebi M (2017) Security upgrade against RREQ flooding attack by using balance index on vehicular ad hoc network. Wirel Netw 23(6):1863–1874 56. Krundyshev V, Kalinin M, Zegzhda P (2018) Artificial swarm algorithm for VANET protection against routing attacks. In: 2018 IEEE industrial cyber-physical systems (ICPS). IEEE, pp 795–800 57. Li W, Song H (2016) ART: an attack-resistant trust management scheme for securing vehicular ad hoc networks. IEEE Trans Intell Transp Syst 17(4):960–969 58. Mokdad L, Ben-Othman J, Nguyen AT (2015) DJAVAN: detecting jamming attacks in vehicle ad hoc networks. Perform Eval 87:47–59 59. He D, Zeadally S, Xu B, Huang X (2015) An efficient identity-based conditional privacypreserving authentication scheme for vehicular ad hoc networks. IEEE Trans Inform Forensics Secur 10(12):2681–2691 60. Anand J, Sivachandar K (2014) Vampire attack detection in wireless sensor network. Int J Eng Sci Innov Technol (IJESIT) 3

Message Propagation in Vehicular Ad Hoc Networks: A Review G. Jeyaram1(B) and M. Madheswaran2 1 Department of Computer Science and Engineering, M.E.T Engineering College, Kanyakumari

District, India 2 Department of Electrical and Communication Engineering, Muthayammal Engineering

College, Namakkal, India

Abstract. In the adjacent feature, it is imagined that vehicular Ad hoc networks (VANETs) will utilize long-distance communication performances, for instance cellular networks and Worldwide Interoperability for Microwave Access (WiMAX), to becomeprompt web access for creating the media among vehicles and stable street side infrastructure. In addition, VANETs spirit utilize shortdistance message strategies, for instance Wireless Fidelity (Wi-Fi) and Dedicated Short-Range Communications (DSRC) to perform small choice contact among vehicles in an imprompt way. During this process a few issues identified: difficult to guarantee all nodes, Traffic, empower gathering, communication authentication, low computational cost, better dependability Furthermore a few issues exist with these current techniques. This work surveys some current algorithm their advantages and confinements identified with the current works in information compression and transmissions and furthermore give directions to future research and improvement. Keywords: Wireless fidelity · MANET · GPS · Propagations-GEDIR

1 Introduction VANETs have appeal to the consideration of numerous researchers, for the capability of giving diverse use, for instance vehicle security associated services, Zhou et al. [1] VANETs are particular sort of short-extend wireless communications Mobile Ad hoc Network (MANETs). The vehicles in VANETs act as a node, connecting to one other to shape a network. Each contributing vehicle has program skills. Oubbati et al. [2] Numerous vehicle producers have outfitted their vehicles with Global Positioning systems (GPSs) and wireless contact plans [3]. The measure of vehicles has exponentially increased wherever on the planet, causing genuine traffic congestion. The congestion may result in an increase in street accidents and reaction times of emergency vehicles. Typically, emergency vehicles use alarms and lights to caution street clients about their presence and appearance. People on foot and vehicle drivers may wrongly decipher the appearance direction of the emergency vehicle because of some echo signals, in this way moving into its way and exacerbating © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2022 V. E. Balas et al. (eds.), Intelligent Computing and Networking, Lecture Notes in Networks and Systems 301, https://doi.org/10.1007/978-981-16-4863-2_18

208

G. Jeyaram and M. Madheswaran

the traffic circumstance. In any action, when they correctly decipher the way of the emergency vehicle, they might be in substantial traffic congestion or holding up at an intersection and thusly not in a situation to offer approach to them [4]. Traffic accidents and jams are triggering passing’s and misuse of fuel and productive hours [1]. These statistics could be compact by spreading upcoming traffic data in a timely way utilizing robotized process in vehicular specially appointed networks (VANET) [5]. One of the most important objectives of VANETs is to give security applications to travelers. What’s more, VANETs give comfort applications to clients (e.g., mobile internet fields, and climate data) [3]. Vehicles as mobile hosts can connect with one new directly if and just if their Euclidean distance isn’t longer than the radio spread range. Each vehicle right now treated as a host as well as a switch. Using DSRC empowers a wide assortment of driver-helping applications such as Vehicle to Vehicle (V2V) and vehicle-to-side of the road (VRC) informing of traffic and accident data just as permitting timely and shrewd communication to improve street security and traffic Stream [6]. The key know-how’s for VANETs, called V2V communications, include the vehicles networking of and extra communicating procedures. Force manage is the way to keep up the superior networks link amid plans, which is utilized for VANETs. Be that as it may, dissimilar to the modern mobile impromptu networks, VANETs have a great deal of uniqueness, for instance distribution, arbitrary node versatility, time–space indecision broadcast, and hindrance [7] Directing in VANETs is an invigorating endeavor on account of rapid adaptability and dynamic organization geographies. Low circulation dormancy and high dispersion share are two key targets in the goal of steering structures in VANETs, where the dissemination idleness determines the time frame it takes to make an impression on the objective and the conveyance proportion is the proportion of the amount successfully passed on interchanges towards the full scale number of correspondences [8]. Recent Intelligent Transportation System (ITS) presentations able to be isolated into significant classes; to be specific, on-street travel safety applications and infotainment applications. Traffic well being appliance comprise traffic checking during joint informing, insinuating visually impaired curve street, forestalling conflict during mechanical electronic split scheme, on-street constant traffic rooted traffic light action and traffic light data in read method [9]. A fuzzy-rooted trust prediction method has to implement in the way development in vehicular impromptu networks. In the calculated style, each vehicle representations the trustworthiness of the immigrants select the transfer nodes that can be utilized for information transmission and the election of the fitting way for routing in vehicular specially appointed network condition [10]. A. Applications of Vanets VANETs are considered by a collection of capable functions and facilities. A few methodical reports built up via models figures and modern groupings gradient many applications that will in the end be sent later on. We will recognize three programs of submissions: well being associated, traffic-associated, and infotainment. These programs are designated as trails.

Message Propagation in Vehicular Ad Hoc Networks: A Review

209

(1) Security Related Applications: The reduction in the amount of individuals harmed or executed scheduled the streets is unique of the primary inspirations for the advancement then the investigation of VANETs. This group holds every presentation that mean to advance street well being. These presentations are then proposed to advance the imageground of the driver contribution a heavy guide to it. The driver can accordingly predict and respond to construct the dynamic knowledge additional secure. For instance, it very well may be educated that an automobile takes damaged the ruby traffic beam or a person on foot is traversing the street. (2) Traffic-Related Applications: This group comprise appliance that utilization between vehicular interactions to split traffic data amid vehicles so as to augment the driver knowledge and streamline the passage stream. Diverse situation can be conceived for this category akin to the collaboration amid vehicles to assist the section of crisis vehicles. (3) Infotainment Applications: This group comprises all applications that furnish drivers with data, diversion and promotions in their excursion like convention data services, Internet right to use, and capture spilling and record distribution. As they are tied in with donating extravagance services to drivers, these appliances are not interruption delicate appliance and can endure postponement. Contact rooted management in vehicle networking and vehicles collaborate amongst themselves structure amicable lines to stay away from calamities.

2 Survey on Recent Works for Vehicular Adhoc Networks A. Review on Message Propagation VANETS Zhou et al. [1] VANETs embrace the periodical transmit to scatter notice letters. The transmit means carries the issue of how to locate the appropriate letter duration for the spread of the notice letter. To attempt the previously mentioned issue, Vanet work embraces the accompanying four stages to examine the occasion driven admonition message spread process and propose an appropriate communicationperiod for the Vehicle-toVehicle (V2V) network. From the outset, an investigative form was proposed to explore the occasion driven admonition letter proliferation method in an associated network, and along these lines guarantee it is hard to secure total nodes in the ZOR presence in a similar segment. Subsequently, a logical form was proposed to examine the occasion determined admonition letter proliferation procedure in an apportioned network, and afterward the likelihood of conveying notice letters to every node in the ZOR for diverse traffic circumstances is determined. Moreover, to diminish idleness grounded by multi-jump transmit, a direction-mindful transmit protocol was proposed for the parceled network according to the proposed logical forms. The occasion driven admonition message engendering process in VANETS was investigated. Zhou et al. [11] described to break down the notice message proliferation process and recommend appropriate communication duration for the V2V network, foremost intend investigative forms to contemplate the notice letter spread procedure in the associated

210

G. Jeyaram and M. Madheswaran

and apportioned network. At that point, in view of the proposed models, determines the rescue likelihood beneath diverse traffic situations. Vehicle thickness, speed and perilous time. Examine the admonition letter spread procedure in VANETS, which can be in the connected or parceled networks. The proposed form, have determined the rescue probability beneath diverse traffic circumstances. Vijayakumar et al. [12] proposed another CPAV validation arrangement for safe vehicular message in VANETs which is shaped throughout IoT. In CPAV verification plan, a Road Side Unit (RSU) can efficiently validate vehicles in a mysterious way by giving security associated letters to vehicles. This plan not just gives the unknown verification little record and mark proof price, yet in addition gives a well-organized qualified confidentiality, trail method to uncover the genuine personality of the hateful vehicle for improving VANET method efficiency. It also gives improved effectiveness regarding quick authentication on record and marks than the recently announced plan than the before described patterns. Sugumar et al. [13] built up a faith rooted verification plan for cluster-constructed VANETs. In view of this at that point, all vehicles is checked by a lot of verifiers, and the letters are carefully marked by the dispatcher and encrypted utilizing a public/private key circulated by a faith power and decrypted by the ending station. This confirms the character of dispatcher just as recipient in this way giving verification to the plan. Eckhoff and Sommer [14] introduced a form of a best in class attacker utilizing a multi aim trailing algorithm and pertained this form in wide processor replications utilizing recreated vehicle development and genuine traces. The examination bolsters our discoveries that, beneath sensible presumptions concerning a foe’s potential, restricted confidentiality is neither essential nor can it be attained devoid of concession traffic well being. Roca et al. [15] proposed a technique using the network state data (i.e., the nodes versatility examples and link quality) to overcome flimsy communication in the haze computing and SDN-rooted connected vehicle condition. At that point manufactured a campus network utilizing the SUMO test system and executed the mobile SDN condition by modifying the Wifi. Hassan et al. [16] introduced Multi-metric Geographic Distance Routing (MGEDIR) protocol for VANET. M-GEDIR depends on subsequent jump vehicle assortment from active sending district in view of various metrics. The security territory and hazardous region was resolved for ideal subsequent bounce vehicle assortment. The outage likelihood of wellbeing and hazardous vehicles was assessed to abstain from picking inaccessible vehicle. Upcoming point was evaluated for the entire hazardous vehicles to stay away from shaky vehicles. The utilization of weighting aspects has empowered M-GEDIR to pick ideal vehicle ensuing in elevated throughput. It moreover diminishes jump calculation devoid of distressing the nature of connectivity ensuing in minor start to finish hindrance. The precise routing choice of the proposed protocol diminishes the likelihood of link disappointment ensuing in minor pace of way detachment.

Message Propagation in Vehicular Ad Hoc Networks: A Review

211

B. Features of VANET Liu et al. [17] Examined VANET, a sub-class of MANET and is increasingly well known in advancing street wellbeing and Smart City applications. In particular, the better information sending have been discussed and furthermore introduced another cooperation approach dependent on Mobile Social Networking (MSN) alongside the conventional cooperation approaches. Each cooperation approach holds its own features, criteria and drawbacks. Tomar and Prakash Sharma [18] had anticipated the vehicle location and speed through GPS and assessed the comparable with aid of GPS innovation. By in that case, the preeminent assessment of the vehicle boundaries is made throughout Kalman channel by joining of the two likelihood thickness capacity of the assessment and assessment vehicle data. Kumar et al. [19] discussed the vehicles are assembled with RHET partitioning region enclosed by hand-off node in RHETBP, transfer nodes are determined utilizing equation WT in all region for subsequent jump, and file recipe can be utilized to limit link delay. Consequently, the number of hand-off nodes is condensed, and superfluous letter was condensed, as well. So the street calamity letter can be speedily spread in diverse tracks beside diverse streets about the calamity node. RHETBP doesn’t utilize cyclic signals, and neither keeps up the posting of neighbors for discovery of street junctures, which severely diminish network transparency, diminish holding up delay and sending node ratio, and advance coverage rate. Shah et al. [20] described a RSU-rooted effective channel access pattern for VANETs below high traffic and portability clauses. It dynamically adjusts the contention window of each vehicle dependent on its cutoff time of takeoff from the scope of RSU. Goudarzi et al. [21] proposed to stay away from channel overcrowding and to attain exactness. Right now, constraints comprising traffic thickness, vehicle position and spot position are measured as data sources. To wrap the diverse clauses of traffic thickness, another model by performing fuzzy logic was employed to discover the thickness of traffic. Balico et al. [22] VANET has unsteady wireless channel superiority, which is influenced by a couple of elements (e.g., street development, street surroundings, vehicle category, and vehicle velocity). (i) A hub that gives a firm power flow that the engine sponsorships. Space vehicle transporters can guarantee the receiving wire estimations and supplementary correspondence gear, yet as well enclose tough registering power and capacity limit. (ii) The hub progress by a specific consistency: it simply has two-course advancement next to a basic solitary way. Zheng [23] dissected a multi-source information blend way to deal with identify fake crisis letters, in which every vehicle employs its on-board sensor data and got signal letters to see the traffic circumstance and figures its conviction on validity for crisis

212

G. Jeyaram and M. Madheswaran

letters. Likewise, the proposed loom gives improved strength next to intrigue assaults via planning an anomaly discovery system in which a bunching calculation is carried out to filter through the colluder whose lead veers off generally from others. Banani et al. [24] discussed the location and path of the transmitter, just as nearness (i.e., communications from vehicles that are possibly not going to ground a calamity). Evaluating with supplementary existing plan, the examination outcome illustrate that the proposed system can validate letters from close by vehicles with inferior between letter delay and condensed packet misfortune and in this manner gives elevated stage of consciousness of the close by vehicles. Ji et al. [25] described a clustering algorithm subject to the data of course masterminded by vehicular course schemes. Comprising course data into cluster means isn’t minor because of two concerns: (i) reliability is a possession of time somewhat than position; (ii) course different assortment might root elevated re-clustering overhead at way junctions. To deal with the principle concern, we propose a task to quantitatively estimate the covering time amid vehicles reliant upon course data, with which a cluster head assortment metric. Gurung and Chauhan [26] depicted the Reasonable AODV (B-AODV) on the grounds that it expects all organization hubs continue routinely. In the event that network hubs are strange direct (an unnecessary measure obviously demands) by then they recognized as vindictive hub. B-AODV is tended to with following highlights: (1) The usage of adaptable limit as per network clauses and hubs lead (balance record) (2) Not using extra directing bundles to recognize noxious hubs (3) Perform identification and expectation activities independently on each node (4) Perform recognition and evasion tasks constantly (5) No necessity for wanton mode. Medani et al. [27] had proposed “Equilibriums Table Vigorous Telecom” (OTRB) calculation. At this moment, to every hub speaks with its area, a ton of hubs is chosen to extend the time over the complete organization. The proposed time synchronization convention is throughout acclimated to self-assertive organization geography changes, high nodal speed while offering extraordinary exactness and power against nodal frustration and parcel incident. C. Vanets Confinement Strategies Hassan et al. [9] portrayed IVD rooted network careful steering (Ivd-Vehicle) for upgrading availability careful data dispersal. IVD scheming is robust and can efficiently deal with immediate GPS disappointment. Two localization methods; in particular, supportive localization and Geometry rooted Localization are created. Average deviation of constant IVDs of a sending way is determined. Dissemination of IVDs of a sending way is utilized for evaluating connectivity. Portion vehicle established succeeding bounce means of transportation selection is used. Shams et al. [28] discussed a whole IDS in VANET utilizing the combinations of changed unrestrained mode for information gathering and SVM for information examination to set up a mutual belief an incentive for each vehicle on the network as Trust Aware SVM Built IDS (TSIDS).

Message Propagation in Vehicular Ad Hoc Networks: A Review

213

Zhang et al. [29] examined two cluster-rooted algorithms for target aphrend in VANETs in our past works. These algorithms give a dependable and stable stage for tracking a vehicle dependent on its visual features. Menuar et al. [30] analyzed an Effective and QoS upheld Multichannel Medium Access Control (EQM-MAC) rules for VANETs in a road domain. The EQM-MAC protocol uses the overhaul network funds for non-security communication spreads through the entire organization interim, and it energetically changes according to the traffic clauses. D. Vehicular Communications Credentials and Detection Latif et al. [31] described in writing, different information dispersal schemes are proposed. Be that as it may, a large portion under meager or thick traffic clauses. Additionally, these plans don’t successfully conquer the previously mentioned concerns at the same time. Another information spread VANETs protocol that scatters the tragedy letters in diverse situations beneath shifting traffic clauses. In thick traffic clauses, DDP4V utilizes the division of communication locale about a vehicle so as to choose the mainly fitting next sending vehicle (NFV). So, it partitions the communication area fragment to advance the letter to all neighbor vehicles, though it furthermore utilizes implied acknowledgments as assured letter rescue in meager traffic clauses. Another information scattering protocol, DDP4V, to conquer the tricky transmit gale, network parcel, sporadically connected network, and ideal NFVs assortment issues. The projected protocol exhibits the likelihood to give a data scattering in arranged VANET circumstances with fluctuating traffic clauses. It offers sensible presentation in three dissimilar appraisal circumstances: road circumstances and two metropolitan circumstances by means of and devoid of network parcel. Beneath thick traffic scenario, DDP4V lean towards the vehicle(s) within the perfect part about transmission locale to retransmit the data packet. It reduce the data packet delivery delay with creating traffic compactness in completely assessed VANET circumstances as vehicles within the perfect bit communicate the data packet with most concise holding up time. Cart wheel concept helps theDDP4V protocol in choosing the finest vehicle as NFV to hold on the scattering procedure and alleviate the transmit gale. Gong and Yu [32] discussed right now, content downloading plan with the aid of side of the road left cars. The left cars, which organize a virtual cluster, assume the job of RSUs, downloading the content from the supplier and send out it to the downloader. Cluster builds a plan to inform the downloader concerning how to get content lumps from which clusters dependent on the evaluated downloader trail. Nguyen et al. [33] described a novel pattern utilizing a story-carry-forward (SCF) system to challenge the network segment and transmit gale issues that are two significant challenges in VANETs. The proposed system embraces a SCF method to unravel transmit storm also network segment issues, while keeping up elevated neighbor data perfection. The imitation end demonstrate that the SCF system beats different plan at relieving broadcast storms, and has a elevated delivery ratio crosswise different traffic

214

G. Jeyaram and M. Madheswaran

compactness. This resources the SCF plan functions admirably in both thick and meager traffics. Auxiliary examination centered on declining expectancy commenced by the transmit concealment method and SCF means. Kolandaisamy et al. [34] described MVSA approaches keeps up the various phases for detecting DDoS attack in web. The Multivariant flow examination offers one of a kind result dependent on the V2V letter during RSU. The loom watches the traffic in diverse circumstances and time outlines and keeps up diverse principles as different traffic modules in different time pane. An efficient Multi variant Stream Analysis (MVSA) approaches to perceive also mitigates DDoS attacks have being implemented. The vehicle peruses the network trace and calculates time to live, a normal proportion of payload, and the frequency as all stream categories at dissimilar time pane. Four features are estimated in the strategies to create the standard deposit. The standard deposit is produced also the features are mined from the packet established from the client. By and by, the strategy computes the multivariate stream weight. The strategy was demonstrated competent in perceiving VANET DDoS attacks and in this way condensed the collision on the VANET condition. Kadadha et al. [35] described a multi-pioneer multi-supporter Stackelberg sport classical that rouses nodes to perform considerately, being MPRs, by growing their notoriety. Assembled notoriety augments are utilized to decide the arrangement of devotees that a MPR (pioneer) course for dependent on nodes’ notoriety. Furthermore, the offered protocol offer a comparative percentage of chosen MPRs and start to finish delay compared to the benchmarks. Singh and SimoFhom [36] described an obscure permit scheme rooted protocols for VANET so as to engage the discovery and constraint of pen name/overspending. There vocation about the getting into mischief vehicle can be as well attained through the projected result. With the typical execution about the projected protocols, it has been indicated that the thriving discovery of deception, i.e., pseudonyms and the consequent revocation about certificates are conceivable in VANET and controlled handling of the baffling credential system (RU-ACS) protocols for tackling the issue articulation introduced in achievements RUACS, and ease the triumphant tracing and revocation of the defaulter vehicle. The confirmation of the idea has been profitably shown by utilization of a model ward on the RUACS protocols. Liang et al. [37] discussed systems to assess the phony caution likelihood and discovery likelihood dependent on the spatial correlation amid the vehicular clients in a delicate choice combination rooted CR vehicular improvised network. Auxiliary, an undertaking has been prepared to enhance the VSUs force allotment. The energy effectiveness aid concern is arranged beneath the limitations of most prominent broadcast power bound, hindrance to the fundamental recipient also the base attainable data rate. For the nonlinear and non-convex optimization issue to be comprehended, the parametric change is employed. The force allotment to the VSUs gratifying the nonlinear limit is disentangled by the projected flexible plan dependent on regularizes slightest mean square algorithm. The nonlinear and non-convex EE extension issue was settled by parametric change system and the force allotment to the VSUs was overcome the projected flexible approach NLMS algorithm.

Message Propagation in Vehicular Ad Hoc Networks: A Review

215

Hajlaoui et al. [38] analyzed the principal that plans to highlight the different feature that force the improvement of ITS so as to aid researchers as additional advancements also executions. Despite the fact that the decent variety about solutions, the future endeavors must be intrigued by effectiveness as well as by limiting the cost of proposed solutions and by giving general guideline as choosing the appropriate test system also metrics for each scenario. Chaudhary and Singh [39] clarified the idea of a keen metropolis to progress the personal satisfaction. Shrewd cities are rising to satisfy the craving as the wellbeing of its clients’ and safe excursions above in the metropolitan situation by building up the brilliant portability concept. Simultaneously, Vehicular Ad-hoc networks are generally accepted to accomplish such thought, away giving wellbeing and non-security applications. Be that as it may, VANET has its individual dispute from node portability to spot privacy. Furthermore, argues the appliance regions, safety dangers and their penalty of VANET into the smart metropolis. E. Review on Vehicular Adhoc Networks A review of some existing works is tabulated in Table 1. Table 1 Describing review on several existing works Technique

References

Advantages

Disadvantages

Analytical model

[1]

Reduce redundancy

Difficult to guarantee all nodes

Delivery probability

[11]

High network lifetime

Traffic

CPAV authentication scheme

[12]

Efficient conditional privacy

Enable group communication authentication low computational cost

Trust rooted authentication scheme

[13]

Reduces the authentication delay

Security

Power control algorithm

[7]

Reduces the interference High outage probability

Decentralized learning-rooted relay selection algorithm

[40]

High energy

Better reliability

3 Conclusion In this effort, we appropriated a designed methodology to developing a complete result for position privacy security in VANETs. The advantages and limitations of the works are discussed. Thus this examination helps to encourage the research community to report the issues identified in VANET. Hence at long last, a few rules are given on the open

216

G. Jeyaram and M. Madheswaran

challenges that have not been tended to or where more research is required. Thus the future works focus on the issues that are further identified. Acknowledgements. I thank the God almighty for giving me the research opportunity. I thank my guide Dr. M. Madheswaran for his constant assistant towards the improvement of my research activity. I thank my family, friends and colleagues for their continuous support.

Conflict of Interest. The writers pronounce that there is no clash of interest and finance.

References 1. Zhou H, Xu S, Ren D, Huang C, Zhang H (2017) Analysis of event-driven warning message propagation in vehicular ad hoc networks. Ad Hoc Netw 55:87–96 2. Oubbati OS, Lakas A, Lagraa N, Yagoubi MB (2016) UVAR: an intersection UAVassisted VANET routing protocol. In: 2016 IEEE wireless communications and networking conference. IEEE, pp 1–6 3. Katsikogiannis G, Kallergis D, Garofalaki Z, Mitropoulos S, Douligeris C (2018) A policyaware service oriented architecture for secure machine-to-machine communications. Ad Hoc Netw 80:70–80 4. Taleb AA (2018) VANET routing protocols and architectures: an overview. JCS 14(3):423– 434 5. Das D, Misra R (2018) Improvised dynamic network connectivity model for Vehicular Ad-Hoc Networks (VANETs). J Netw Comput Appl 122:107–114 6. Seliem H, Shahidi R, Ahmed MH, Shehata MS (2018) Drone-based highway-VANET and DAS service. IEEE Access 6:20125–20137 7. Wu X, Sun S, Li Y, Tan Z, Huang W, Yao X (2018) A power control algorithm based on outage probability awareness in vehicular ad hoc networks. In: advances in multimedia, pp 1–8 8. Sun G, Zhang Y, Liao D, Yu H, Du X, Guizani M (2018) Bus-trajectory-based street-centric routing for message delivery in urban vehicular ad hoc networks. IEEE Trans Vehic Technol 67(8):7550–7563 9. Hassan AN, Kaiwartya O, Abdullah AH, Sheet DK, Raw RS (2018) Inter vehicle distance based connectivity aware routing in vehicular adhoc networks. Wirel Pers Commun 98(1):33– 54 10. Singh K, Verma AK (2018) A fuzzy-based trust model for flying ad hoc networks (FANETs). Int J Commun Syst 31(6):e3517 11. Zhou H, Xu S, Ren D, Huang C (2014) Reliable delivery of warning messages in partitioned vehicular ad hoc networks. In: 2014 IEEE 17th international conference on computational science and engineering. IEEE, pp 1417–1423 12. Vijayakumar P, Chang V, Jegatha Deborah L, Balusamy B, Shynu PG (2018) Computationally efficient privacy preserving anonymous mutual and batch authentication schemes for vehicular ad hoc networks. Fut Gen Comput Syst 78:943–955 13. Sugumar R, Rengarajan A, Jayakumar C (2018) Trust based authentication technique for cluster based vehicular ad hoc networks (VANET). Wirel Netw 24(2): 373–382 14. Eckhoff D, Sommer C (2018) Readjusting the privacy goals in Vehicular Ad-Hoc Networks: a safety-preserving solution using non-overlapping time-slotted pseudonym pools. Comput Commun 122:118–128

Message Propagation in Vehicular Ad Hoc Networks: A Review

217

15. Roca D, Milito R, Nemirovsky M, Valero M (2018) Tackling IoT ultra large scale systems: fog computing in support of hierarchical emergent behaviors. In: Fog computing in the internet of things. Springer, Cham, pp 33–48 16. Hassan AN, Abdullah AH, Kaiwartya O, Cao Y, Sheet DK (2018) Multi-metric geographic routing for vehicular ad hoc networks. Wirel Netw 24(7): 2763–2779 17. Liu J, Zhong N, Li D, Liu H (2018) BMCGM: a behavior economics-based message transmission cooperation guarantee mechanism in vehicular ad-hoc networks. Sensors 18(10):3316 18. Tomar RS, Prakash Sharma MS (2018) One dimensional vehicle tracking analysis in vehicular ad hoc networks. In: Innovative computing, optimization and its applications. Springer, Cham, pp 255–270 19. Kumar M, Nigam AK, Sivakumar T (2018) A survey on topology and position based routing protocols in vehicular ad hoc network (VANET). Int J Fut Revol Comput Sci Commun Eng 4(2):432–440 20. Shah SAA, Ahmed E, Xia F, Karim A, Shiraz M, Noor RM (2016) Adaptive beaconing approaches for vehicular ad hoc networks: a survey. IEEE Syst J 12(2):1263–1277 21. Goudarzi S, Kama MN, Anisi MH, Soleymani SA, Doctor F (2018) Self-organizing traffic flow prediction with an optimized deep belief network for internet of vehicles. Sensors 18(10):3459 22. Balico LN, Loureiro AAF, Nakamura EF, Barreto RS, Pazzi RW, Oliveira HABF (2018) Localization prediction in vehicular ad hoc networks. IEEE Commun Surv & Tutor 20(4):2784–2803 23. Zheng Q (2018) Detecting bogus messages in vehicular ad-hoc networks: an information fusion approach. In: Wireless sensor networks: 11th China wireless sensor network conference, CWSN 2017, Tianjin, China, October 12–14, 2017, Revised Selected Papers, vol 812. Springer 24. Banani S, Gordon S, Thiemjarus S, Kittipiyakul S (2018) Verifying safety messages using relative-time and zone priority in vehicular ad hoc networks. Sensors 18(4):1195 25. Ji X, Yu H, Fan G, Sun H, Chen L (2018) Efficient and reliable cluster-based data transmission for vehicular ad hoc networks. Mob Inform Syst 1–15 26. Gurung S, Chauhan S (2018) A novel approach for mitigating route request flooding attack in MANET. Wireless Netw 24(8):2899–2914 27. Medani K, Aliouat M, Aliouat Z (2018) Impact of clustering stability on the improvement of time synchronization in VANETs. In: Computational intelligence and its applications: 6th IFIP TC 5 international conference, CIIA 2018, Oran, Algeria, May 8–10, 2018, Proceedings 6. Springer International Publishing 28. Shams EA, Rizaner A, Ulusoy AH (2018) Trust aware support vector machine intrusion detection and prevention system in vehicular ad hoc networks. Comput Secur 78:245–254 29. Zhang D, Ge H, Zhang T, Cui Y-Y, Liu X, Mao G (2018) New multi-hop clustering algorithm for vehicular ad hoc networks. IEEE Trans Intell Transp Syst 20(4):1517–1530 30. Menouar H, Filali F, Lenardi M (2018) An extensive survey and taxonomy of MAC protocols for vehicular wireless networks. Adapt Cross Layer Des Wirel Netw 6 31. Latif S, Mahfooz S, Ahmad N, Jan B, Farman H, Khan M, Han K (2018) Industrial internet of things based efficient and reliable data dissemination solution for vehicular ad hoc networks. Wirel Commun Mob Comput 1–16 32. Gong H, Yu L (2017) Content downloading with the assistance of roadside cars for vehicular ad hoc networks. Mob Inform Syst 1–9 33. Nguyen TDT, Le T-V, Pham H-A (2017) Novel store–carry–forward scheme for message dissemination in vehicular ad-hoc networks. ICT Exp 3(4):193–198 34. Kolandaisamy R, Noor RM, Ahmedy I, Ahmad I, Reza Z’aba M, Imran M, Alnuem M (2018) A multivariant stream analysis approach to detect and mitigate DDoS attacks in vehicular ad hoc networks. Wirel Commun Mob Comput 1–13

218

G. Jeyaram and M. Madheswaran

35. Kadadha M, Otrok H, Barada H, Al-Qutayri M, Al-Hammadi Y (2018) A Stackelberg game for street-centric QoS-OLSR protocol in urban vehicular ad hoc networks. Vehic Commun 13:64–77 36. Singh A, SimoFhom HC (2017) Restricted usage of anonymous credentials in vehicular ad hoc networks for misbehavior detection. Int J Inf Secur 16(2):195–211 37. Liang W, Li Z, Zhang H, Wang S, Bie R (2015) Vehicular ad hoc networks: architectures, research issues, methodologies, challenges, and trends. Int J Distrib Sens Netw 11(8):745303 38. Hajlaoui R, Moulahi T, Guyennet H (2018) Vehicular ad hoc networks: from simulations to real-life scenarios. J Fundamental Appl Sci 10(4S):632–637 39. Chaudhary B, Singh S (2017) Vehicular ad-hoc network for smart cities. In: Proceedings of the first international conference on information technology and knowledge management 14:47–51 40. Tian D, Zhou J, Chen M, Sheng Z, Ni Q, Leung VCM (2018) Cooperative content transmission for vehicular ad hoc networks using robust optimization. In: IEEE INFOCOM 2018-IEEE conference on computer communications. IEEE, pp 90–98

A Comparative Study of Clustering Algorithm Khyaati Shrikant(B) , Vaishnavi Gupta, Anand Khandare, and Palak Furia Department of Computer Science, Thakur College of Engineering and Technology, Mumbai, India [email protected]

Abstract. Clustering is a technique of segregating the data points into certain clusters in such a manner that the data points present in the cluster are more similar in nature to the other data points in that cluster as compared to the data points that are segregated in different groups. In layman words, the purpose is to form distinct groups of data points having alike characteristics and allot it into the groups. The clustering algorithms are advocated for the task of classifying the spatial datasets into a set of groups. Generally, this algorithm can be distributed into two sections—Hard Clustering where every data object either corresponds to a group or cluster entirely and the other is Soft Clustering wherein the odds of that data object to be in those clusters is assigned rather than placing every data object into a distinct cluster. This paper focuses on comparative study of different clustering algorithms that are applied in data mining. Keywords: Clustering · K-means · Hierarchical · DBSCAN · Silhouette coefficient index · Davies–Bouldin index · Calinski–Harabasz index

1 Introduction Clustering technique is primarily defined as segregation of the data objects into collection of related objects. Such collections are called a cluster that consists of points having similar nature amidst themselves and disparate nature as compared to other groups. Clustering algorithms is the most widely used type of unsupervised machine learning algorithm [1]. In this respect, we compare among different types of clusters and found it more appropriate to place the algorithms such as: K-means algorithm, Agglomerative hierarchical [2] clustering algorithm and the Density Based Spatial Clustering of Applications with Noise (DBSCAN) [3] algorithm, to be under dissertation. They are generally used to find out how subjects are similar on a variety of different variables. However the algorithms are not eager learners and rather directly learn from the training instances. The most frequently used process of gathering the points consists of the several steps mentioned—feature extraction and feature selection that helps to extricate and handpick the supreme demonstrative attributes from the initial data set; next step is clustering algorithm design—that design the clustering algorithm corresponding to the features of the challenge; the following step is result evaluation in which we appraise the algorithm result and estimate the legitimacy of algorithm; and the last step is the result explanation—it gives a concrete elucidation for the aforesaid result. It is also witnessed © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2022 V. E. Balas et al. (eds.), Intelligent Computing and Networking, Lecture Notes in Networks and Systems 301, https://doi.org/10.1007/978-981-16-4863-2_19

220

K. Shrikant et al.

that the application to large spatial [3] datasets promotes such prerequisites for the clustering algorithms: marginal necessities of the domain knowledge so as to influence the input considerations, the unearthing of clusters with capricious shape and fine efficiency on the large datasets.

2 Related Work Chitra and Maheswari, This paper focuses on a keen study of different clustering algorithms in data mining. A brief overview of various clustering algorithms are analysed and draw a conclusion that how this algorithm plays a significant role in data analysis and data mining applications. Besides, it is majorly the task of a combination of a set of objects in such a manner that objects in the identical group are more related to each other than to those in other groups. They elucidated that the clustering algorithms can be classified into partition-based algorithms, Chitra and Maheswari [4] hierarchical based algorithms, density-based algorithms and grid-based algorithms. Cluster analysis can be exercised by finding similarities between data according to the characteristics found in them and integrating similar data objects into clusters. Clustering has an extensive and prosperous record in a range of scientific fields in the vein of image segmentation, information retrieval and web data mining. Peerzada Hamid Ahmad, Dr. Shilpa Dang With the ongoing elevation in the technology, cluster analysis plays a significant role in analyzing examining text mining techniques. It segregates the dataset into several various meaningful clusters to exhibit the dataset’s natural structure. During the course of study, they look over the four major clustering algorithms namely Simple K-mean, DBSCAN, HCA and MDBCA [5] and detailed study on the performance of these four clustering algorithms. The behaviour study of these four techniques are evaluated in depth and interpreted using a clustering tool called WEKA. The resultant are tested evaluated on different datasets namely Abalone, Bankdata, Router, SMS and Webtk dataset using WEKA interface and compute instances, attributes and the time taken to build the model. Also they have highlighted the advantages, disadvantages and applications of each clustering technique. Khandare, Anand & Alvi, In this paper, they explored few reports on the improved k-means algorithms, abstracted their shortcomings and further identified any scope for enhancement in future to make it more scalable and efficient for large data. Conventional algorithms used in various fields of engineering science and technologies are the clustering algorithms. We also observed that the k-means is an exemplar of unsupervised clustering algorithm which is often implemented in applications in terms of medical images clustering, gene data clustering etc. In spite of extensive research work on the basic k-means clustering algorithm, the researchers could focused only on certain limitations of k-means. From the literature, it was observed that this paper studied distance, validity and stability measures, algorithms for initial centroids selection and algorithms to decide the value of k. Further, the author proposed the objectives and guidelines for enhanced scalable clustering algorithms. There were variour suggesting methods to avoid outliers using concepts of semantic analysis and AI [1]. Na, Xumin and Yong, Clustering analysis method is one of the main analytical [6] methods in data mining and this will have direct influence on the clustering results. This

A Comparative Study of Clustering Algorithm

221

paper deliberates the standard k-means clustering algorithm and peruse the limitations of standard k-means algorithms. To say, the k-means clustering algorithm is used to compute the distance between each data object and all cluster centers in each iteration, this in turn makes the efficiency of clustering on a lower side. So this paper put forward an improved k-means algorithm in order to resolve such challenges which have a simple data structure to store certain information in every iteration that might be used in the next iteration. The improvised method focused on evading to compute the distance of each data object to the cluster centers repeatedly, hence helps in saving the running time. The study predicts that this methodology can effectively improve the speed of clustering with better accuracy while reducing the computational complexity of the k-means. Valarmathy and Krishnaveni, Data mining in the educational system has received great interest and has become a new emerging research nowadays. Recently all universities and colleges [7] are generating huge volumes of data by conducting online exams and storing a lot of information for future purposes. These colossal amounts of data need some data mining techniques to retrieve some useful and meaningful information from the dataset. The real victory can be achieved only when a task specialized is applied so that it can be effective in that area. This paper surveys the purpose of data mining to the outmoded educational systems, various well known clustering algorithms, its applications, advantages and disadvantages. This paper also focuses on performance evaluation of some clustering algorithms using educational dataset.

3 Proposed Work In this comparison paper, we are comparing various distinct types of clustering algorithm which are: K-means algorithm, Agglomerative hierarchical [2] clustering algorithm and the density based spatial clustering of application with noise [8] (DBSCAN) algorithm. 3.1 Technology and Tools The comparison of the clustering algorithm is implemented using the Python language on the Google Colab platform. Colaboratory (Colab) that supports Python language as well as pre-installed python libraries which can be executed on any browser having the internet connectivity. It facilitates the users to inscribe and implement capricious Python programs and is exclusively well suited for machine learning [9], data analysis and education. 3.2 Dataset The dataset used herein is on the origin of certain socio-economic and health factors that helps to determine the overall development of the countries. The features of the dataset are name of the country, child mortality rate which is in terms of death of children below 5 years of age per 1000 live births, total health spending per capita and percentage of the [10] GDP per capita, the country’s exports of goods and services per capita, imports of goods and services per capita, the net income per person, inflation which is the measurement of the annual growth rate of the total GDP, life expectation of the countrymen, fertility rate of the country and it GDP per capita. The dataset consists of information of the countries in 167 rows and 10 columns.

222

K. Shrikant et al.

3.3 Clustering Cluster mixing techniques are one of the utmost unsupported methods of ML. These approaches are applied to determine similarities and patterns of relationships between data points [11] and then combine those points into groups that are similar in terms of symbols. Merging is imperative as it establishes the internal collection between current data without labels. They make some assumptions about data points to make their similarities. Each assumption will create different but equally acceptable collections. Since the task of assembling is low, there are sundry ways to achieve this. The whole process follows a set of different rules to define ‘similarities’ between data points [11]. In fact, there are more than 100 known algorithms for integration. But there are a small number of widely used algorithms: • Connectivity models: As the name suggests, these models are centered on the notion that data points [11] closer to the data space show more resemblance to each other than data points lying far away. These species can follow two paths. In the first method, they first divide the data into isolated groups and then combine as the gap decreases. In the second method, all data points [11] are divided into a single group and then separated as the distance upsurges. Also, the choice of a distance job is logical. These types are very easy to translate but do not have the ability to manage large databases. Examples of these models are hierarchical clustering algorithms and their variations. • Centroid models: These are iterative integration algorithms where the concept of similarity is established on the proximity of a data point to the centroid collections. Kmeans compilation algorithm is a widespread algorithm that comes under this category. In these species, the number of [7] the required sets at the end must be listed down in advance, which makes it important to have prior knowledge of the data. These species work hard to find a local optima. • Distribution models: These types of mergers are grounded on the postulation that all data points [11] in a collection are subject to the same dispersal (Example: General, Gaussian). These species often suffer excessively. A popular example of these types is the Expectation–maximization algorithm that uses many common distributions. • Density Models: These models search for data space locations for an eclectic of data points [11] in the data space. It divides the different regions of quantity and assigns data points within these regions into one group. Popular examples of congestion models are DBSCAN and OPTICS [12]. 3.4 Cluster Formation Methods It is not mandatory that clusters designed after applying the technique will be of spherical shape. Followings are various additional cluster formation methods: • Density-based: Here, in this technique, the groups or clusters are designed as the dense region. The benefit is that it provides decent precision as well as decent proficiency to amalgamate two clusters. Eg: Ordering Points To Identify the Clustering Structure (OPTICS) [12] etc.

A Comparative Study of Clustering Algorithm

223

• Hierarchical-based: The cluster is formed based on hierarchy in the form of tree type structure in this technique. There are two categories namely, Agglomerative i.e. Bottom up approach and Divisive i.e. Top down approach. Eg: Balanced Iterative Reducing and Clustering using Hierarchies [13] (BIRCH), etc. • Partitioning: The cluster is formed into k clusters by portioning the object. Number of partitions is equivalent to the number of [7] clusters. eg: K-means algorithm, Clustering Large Applications based upon Randomized Search (CLARANS) [14]. • Grid: The clusters formed are grid like structure. The clustering operations which are done on grids are expeditious and independent of the number of [7] data points which is an advantage for these methods. eg: Clustering in QUEst [15] (CLIQUE). 3.5 Clustering Algorithms: The three discrete categories of clustering algorithms [4] mentioned are K Means Clustering, Agglomerative hierarchical clustering, Density-based spatial clustering (DBSCAN). K Means Clustering K-means is the least difficult unsupervised learning algorithm that solves the well-known clustering problem. The method follows a straightforward and simple approach to organise a particular dataset through a specific number of clusters (accept k clusters) fixed apriori [16]. The primary thought is to characterize k centers, one for each bunch. The better decision is to put them at a distant away from each other. The following stage is to consider each point of the given dataset and correlate it to the closest center. When no point is forthcoming, the initial step is finished. Now there is need to analyze the k new centroids as the barycenter [16] of the clusters that would provide result from the previous step. After acquiring the k new centroids, a new binding needs to be done between the identical dataset points [11] and the nearest new center. A loop is generated. As it turns out the loop may notice that k centers modify their position gradually step by step until no variations are to be done or the centers do not shift any more. The algorithm targets at minimizing an objective function know as squared error function which is given by: J (V ) =

ci c xi − vj2 i=1 i=1

where, ||xi − vj|| is the Euclidean distance [17] between xi and vj. ci is the number of [7] data points in ith cluster [17] c is the number of cluster centers [17].

(1)

224

K. Shrikant et al.

Agglomerative Hierarchical Clustering Clustering types also encompass partitional clustering which split the dataset into a preferred number of [7] clusters. The dataset is first assigned to a single cluster which is further divided until all clusters containing a single instance are called as divisive hierarchical clustering (DHC). Each instance is initially assigned to a distinct cluster and the closest clusters are then agglomerated until all instances are contained in a particular cluster is called as agglomerative hierarchical [2] clustering which is an opposite approach of DHC (Fig. 1).

Fig. 1 Agglomerative hierarchical clustering

The benefit of the algorithm is that it can churn out the categories of the objects, which may serve as enlightening for statistics display. The smaller clusters which are generated may be helpful to establish the resemblance between prototypes and data points [11], and it operates well only in. The constraints faced are, some constraint amalgamations when applied with agglomerative algorithms can affect the dendrogram to stop impulsively in a dead-end elucidation even though other feasible solutions exist with a smaller amount of clusters. When constraints escort to competently solvable feasibility problems and do not give a dead-end solution, we can then illustrate the profits of benefitting constraints to enhance cluster purity and [18] average distortion. DBSCAN DBSCAN algorithm or the Density based spatial clustering of application with noise [8], is a density based method which can detect the arbitrary shaped clusters where the clusters are demarcated as the dense regions disconnected by low dense regions [19]. This methodology begins with a random arbitrary object present in the dataset and verifies its neighbouring objects within the given radius (Eps) set by the user. If the neighbouring objects within that Eps are more than the [8] minimum number of objects required for a cluster, it is marked as a core object or the core point [8]. In case the neighbouring objects it surrounds within given Eps are less than the minimum number of [8] the objects required, then such points are marked as noise or the outliers.

A Comparative Study of Clustering Algorithm

225

Due to this, this algorithm can determine which information or points should be categorised under an outlier. The DBSCAN algorithm can recognize the clusters of large spatial datasets observing the local density of wedges of data using a single input parameter. Furthermore, the user gets a recommendation for the parameter value which would be appropriate. An example of a software program that has the DBSCAN algorithm implemented is WEKA [8]. The DBSCAN clustering algorithm can be categorised as following. Detection Based DBSCAN clustering: This technique is the most simple detection problem as it can be solved proficiently as an upper bound on a discredited likelihood function. It is a helpful algorithm as it recovers maximum number of likelihood of sides and orientation at locations of the most likely polygons the first stage of the detection is posed as a discrete Hough-based algorithm. The second stage takes an approximation to the full likelihood function to recover orientation [20] and number of sides. Hierarchical DBSCAN clustering: Hierarchical DBSCAN clustering is one of an approach of cluster analysis which builds a hierarchy of clusters. The tactics for hierarchical clustering mostly fall into two categories: In hierarchical clustering the data are not partitioned into a particular cluster [20] in a single step. Instead, a series of barriers takes place, which may commence from a single cluster encompassing all objects to n clusters each comprehending a single object. Spatial–temporal DBSCAN clustering: It is a fresh clustering algorithm which is premeditated for storing and clustering a widespread range of spatial–temporal data [20]. Distinctive functions were developed for data amalgamation, conversion, visualization, analysis and management. It is indexed and salvaged according to spatial and time dimensions. A time period is attached to the spatial data to display the time validity or storage information in the database. A temporal [20] database may aids valid time, transaction time or both. The valid time signifies the time period during which the statement carries the validity with respect to the real world. The transaction [20] time is the time period during which a fact is stored in the database. Partitioning based DBSCAN clustering: This method of clustering, generally yields in a set of N clusters, each object fitting to one such cluster. Each cluster is exemplified by a centroid or a cluster. The precise form of this explanation will depend on the kind of the entity which is being clustered. In cases where real-valued data is available, the arithmetic mean [20] of the attribute vectors for all objects within a cluster provides an apt representative. If the number of the clusters is more, the centroids can be further clustered to produce [20] hierarchy within a dataset. Incremental DBSCAN clustering: This clustering algorithm basically handles the dynamic databases. It has the capability to change the radius threshold [20] value dynamically. The algorithm confines the number of the final clusters and only once reads the original dataset. Simultaneously, the algorithm introduces the frequency data of the attribute values. It can be used for the categorical [20] data. Grid-based DBSCAN clustering: This approach quantizes the dataset into a certain number of cells and operates with objects belonging to these particular cells. The method

226

K. Shrikant et al.

does not displace the points but rather build several hierarchical levels of the groups of objects. The unification of grids and subsequently clusters, does not depend on a distance measure. It is determined by a predefined [20] parameter. 3.6 Different Comparison Metrics See Table 1. Table 1 Comparing metrics Feature/Algorithm

K-means

Agglomerative clustering

DBSCAN

Parameters

Number of clusters

Number of clusters or distance threshold, linkage type, distance

Neighborhood size

Scalability

Very large n_samples, Large n_samples and medium n_clusters n_clusters

Very large n_samples, medium n_clusters

Use case

General-purpose, even cluster size, flat geometry, not too many clusters

Many clusters, possibly connectivity constraints, non Euclidean distances

Non-flat geometry, uneven cluster sizes

Geometry (metric used)

Distances between points

Any pairwise distance

Distances between nearest points

Dataset

Numerical and symbolic

Numerical dataset

Numerical dataset

Scalability

Good

Good

Good

Efficiency

Normal

Normal

Normal

3.7 Comparing the Clustering Quality Measure Silhouette Coefficient (S) The Silhouette index [21] validates the clustering operation based on the pairwise difference of between and within-cluster distances. The Silhouette Coefficient s for a single sample is given as: s(i) =

b(i) − a(i) max(a(i), b(i))

(2)

where, ai is average dissimilarity of ith data point to all other points in the same [22] cluster;

A Comparative Study of Clustering Algorithm

227

bi is minimum of average [22] dissimilarity of ith data point to all data points in other cluster [22]. The Silhouette index [21] for a set of data points is demarcated as the mean of the Silhouette Coefficient [23] for each sample of the dataset. It tells whether the individual points are appropriately assigned to their clusters. The range of possible values of Silhouette Coefficient (S) are mentioned below: i. If the value of S is close to 1, then the point belongs to the correct clusters. ii. If the value of S is close to 0, then it indicates that the point is in the overlapping clusters. iii. If the value of S is close to −1, then the point belongs to the incorrect clusters and should be assigned to the other clusters. The higher value of Silhouette Coefficient score associates to a model with enhanced defined clusters and well separated. Davies–Bouldin Index The Davies-Bouldin index [22] (DB) is deliberate as follows—for individual cluster C, the resemblances between C and all other clusters are computed, and the highest value is assigned to C as its cluster similarity [24]. Then the DB index can be attained by taking an average of all the cluster similarities. The smaller the value of index, better the clustering result. By abating this index, clusters are the most distinct from each other, and therefore achieves the best partition [24]. The formula for Davies–Bouldin index is given by: ⎫ ⎧⎡ ⎤ ⎬ ⎨ 1 1 1 maxj,j=i ⎣ d (x, ci) + d (x, ci)⎦ d (ci, cj) (3) ⎭ ⎩ ni x∈c NC ni x∈c i

i

j

where, D: data set [22]; n: number of objects in D; c: center of D [22]; Ci: the ith cluster [22]; NC: number of clusters; ni: number of objects in Ci; ci: center of C [22]. The minutest score of Davies–Bouldin index is zero, with the lower values signifies enhanced clustering of the points. The various advantages of this index are that the computation of Davies–Bouldin is much straightforward than Silhouette scores and the index computes only magnitudes and characteristics intrinsic to the dataset. One of its drawbacks is the use of centroid distance limits the distance metric to Euclidean [17] space.

228

K. Shrikant et al.

Calinski-Harabasz Index The Calinski-Harabasz index depends on the concept that good clusters or points are always compact in nature and have distinct space between them r. The Calinski-Harabasz index (CH) [25] assesses the cluster validity based on the average between and within the cluster sum of squares. The formula for this Index is defined as: i ni d 2 (ci, c)/(NC − 1) i kCi d 2 (x, ci)/(N − NC)

(4)

where, D: data set; n: number of objects [22] in D; c: center of D; NC: number of clusters; Ci: the ith cluster; ni: number of objects in Ci [22]; ci: center of C. Higher the value of this index, better is the clustering model compared to others. One of the drawbacks of this index is observed that the value of index is in general higher for the convex type of clusters than other types of clusters, such as density based [8] clusters which are obtained through the DBSCAN algorithm.

4 Implementation 4.1 K-means Algorithm K - Means uses the Euclidean distance [17] method to compute the distance between the clusters. For the implementation, we first initialize the k points called means. Then we compartmentalize each data point to its closest mean and we transform the mean’s coordinates, which are the averages of the data points classified in the mean so far. We repeated the process for a given number of iterations and at the end, we obtained our required clusters. To decide how many clusters to consider, we can employ several distance calculating methods such as the elbow [26] curve method, BIC score with a Gaussian Mixture Model, etc. For our dataset, we have used the basic and most widely used method the Elbow Curve method. In this method, wherever we observe a “knee” like bent formed by the lines, we consider that number as the ideal figures of clusters in the K-means algorithm (Fig. 2). Here, we did the graphical visualizations between the sum of the squared differences and the K-mean values to find the number of clusters. Afterwards, we fitted our scaled data into a K-means model having 3 clusters, and then labeled each data point (each record) to one of these 3 clusters. We then calculated the number of records in each cluster (Fig. 3).

A Comparative Study of Clustering Algorithm

229

Fig. 2 Elbow method

Fig. 3 Count of K-means cluster

Fig. 4 Child mortality versus gdpp

After that we checked our model, using the Silhouette score, Davies Bouldin and Calinski-Harabasz Index and visualized the clusters formed between child mortality and gdpp (Fig. 4). From the above graph, we can conclude that the country having high child-mortality and low GDP [10] per capita (measurement of the annual growth frequency of the Total

230

K. Shrikant et al.

GDP) is an under-developing country, the country having low child-mortality and high GDP per capita [10] is a developed country. Here, in K- means algorithm, we conclude that 0 depicts an under-developing country, 1 depicts a developing country and 2 depicts a developed country. 4.2 Hierarchical Clustering Divisive and Agglomerative are two classifications of hierarchical clustering. We have used Agglomerative Hierarchical Clustering because it is the most generic type of algorithm that is used to congregate objects in clusters according to their similarities. This a bottom-up approach [27], each observation starts in its own cluster, and pairs of clusters are merged as one moves up the hierarchy. A Dendrogram [9] is a type of tree diagram that indicates the hierarchical relationships between different sets of data. We plotted the dendrogram on the dataset (Fig. 5).

Fig. 5 Dendrogram

From the above dendrogram we can take the minimum number of [7] clusters as 2 and maximum number of clusters as 5. As we can see from dendrogram 3 in the right no of clusters, so we are going to take 3 no of clusters. We fitted the scaled data into an agglomerative hierarchical model and then labeled each data point. After that we visualized using scatterplot between child mortality and GDPP (Fig. 6). Here, in an agglomerative hierarchical clustering algorithm, we conclude that 0 depicts an—developed country, 1 depicts a developing country and 2 depicts an underdeveloping country. 4.3 DBSCAN Algorithm DBSCAN is an abbreviation of “Density based spatial clustering of applications with noise [8]”. This algorithm convenes the data points that are immediate to each other based on Euclidean distance [17] and a minimum number of [7] data points. It also denotes noise as outliers (noise means the points which are in low-density regions). For implementation, we dropped the country feature from the dataset and converted it

A Comparative Study of Clustering Algorithm

231

Fig. 6 Child mortality vs gdpp

into distance matrix form using the numpy array. Then we applied data transformation to the dataset by scaling the data using StandardScaler and fitting the data and then transforming it. The DBSCAN algorithm [3] was implemented using eps as 1.38 and min_sample = 9. Data fitting was applied on the DBSCAN algorithm and we further calculated the number of records in each cluster (Fig. 7).

Fig. 7 Count DBSCAN

In the subsequent step, we calculated the core points or the core sample, the data points that are within a dense region of distance eps and calculated the number of [7] clusters formed keeping −1 as the outlier. We visualized the dataset using the seaborn scatterplot between child mortality and gdpp (Fig. 8). Here, in the DBSCAN algorithm, we conclude that 2 depicts an under-developing country, 0 depicts a developing country and 1 depicts a developed country.

5 Result On implementing the various measures of clustering performance such as the Silhouette score, Davies–Bouldin index and Calinski–Harabasz index, it is evident that the Agglomerative hierarchical algorithm is better than the K-means and DBSCAN algorithm. When comparing the algorithms on the base of the Silhouette score, K-means outperforms agglomerative clustering and DBSCAN algorithms based on all cluster validation metrics (Table 2).

232

K. Shrikant et al.

Fig. 8 Child mortality versus gdpp Table 2 Comparing scores of algorithm Types of algorithm

Silhouette score

Davies–Bouldin

Calinski –Harabasz

K-means algorithm

0.28329

1.27690

66.2347

H-cluster algorithm

0.172565

0.67965

184.18768

DBSCAN algorithm

0.156088

2.18342

22.5491

When compared with Davies–Bouldin and Calinski–Harabasz index, Agglomerative hierarchical [2] clustering algorithm outperforms DBSCAN and K-means algorithm. The Davies–Bouldin index states that the algorithm having minimum score i.e., near zero indicates better clustering and the Calinski–Harabasz index states that higher the score of the algorithm, better is the performance and clusters are dense and well separated (Fig. 9). To further assist our outcome, we compared the three clusters of countries that were formed-developed, developing and under-developing countries. We discovered that certain developing countries in 2017 according to the United Nations [10], Botswana and Angola are developing countries. In Agglomerative hierarchical [2] clustering algorithm, Botswana and Angola is under developing country but according to K-means algorithm, they are categorised as under-developing country whereas in case of DBSCAN algorithm, they are identified as an outlier and so are not under any of the three clusters.

6 Future Scope This paper was intended to compare certain data clustering algorithms such as K-means algorithm, Agglomerative hierarchical [2] clustering algorithm and DBSCAN algorithm. In this paper, we have anticipated a smaller dataset for now and so for the future work, we will work on a larger dataset to obtain efficient high-quality clusters. Furthermore,

A Comparative Study of Clustering Algorithm

233

Fig. 9 List of developed, developing and under developing countries

the comparisons among the aforesaid three clustering algorithms endeavored by sundry factors other than those contemplated in this paper. To enhance the comparison we will work on the additional variety of clustering algorithms such as—K medoids clustering algorithm, Divisive Hierarchical clustering algorithm, Make density based clustering (MDBC) [5] algorithm etc.

7 Conclusion To conclude, the most widespread applied clustering algorithms such as K-means, Agglomerative hierarchical clustering algorithm and DBSCAN were applied on the dataset. The performance of a particular clustering algorithm depends on the dataset that is selected. In our case, agglomerative hierarchical clustering worked more efficiently than DBSCAN algorithm and K-means algorithm. Since the K-mean algorithm is a centroid based algorithm, it is more sensitive towards the outliers because a mean is easily influenced by the extreme values. In our dataset, we have data points [11] that form groups of capricious density, thereby resulting in the DBSCAN algorithm to not work more efficiently as it does not cluster the data points [11] well. Since the clustering in DBSCAN depends on Eps and minimum number of points [8] parameters, they cannot be opted separately for all clusters. The limitations of the country dataset that it consists of many outliers but since our dataset is small (167 rows only) we did not remove the outliers as it will further reduce the number of rows.

234

K. Shrikant et al.

References 1. Khandare A, Alvi A (2016) Survey of improved k-means clustering algorithms: improvements, shortcomings and scope for further enhancement and scalability. In: Information systems design and intelligent applications, proceedings of third international conference, India, vol 2 2. Davidson I, Ravi SS (2009) Using instance-level constraints in agglomerative hierarchical clustering: theoretical and empirical results. Data Mining Knowl Discov 18(2):257–282, April 2009 3. Ester M, Kriegel H-P, Sander J, Xu X (1996) A density based algorithm for discovering clusters in large spatial databases with noise. The Association for the Advancement of Artificial Intelligence (AAAI) 4. Chitra K, Maheswari D (2017) A comparative study of various clustering algorithms in data mining. Int J Comput Sci Mob Comput 6(8):109–115 5. Ahmad PH, Dang S (2015) Performance evaluation of clustering algorithm using different datasets. J Inform Eng Appl 5(1). ISSN 2224-5782 (print) ISSN 2225-0506 (online) 6. Na S, Xumin L, Yong G (2010) Research on k-means clustering algorithm: an improved k-means clustering algorithm. In: Third international symposium on intelligent information technology and security informatics, Jian, China, pp 63–67 7. Valarmathy N, Krishnaveni S (2019) Performance evaluation and comparison of clustering algorithms used in educational data mining. Int J Recent Technol Eng (IJRTE) 7(6S5):April 2019. ISSN: 2277-3878 8. Tiwari KK, Raguvanshi V, Jain A (2016) DBSCAN: an assessment of density based clustering and its approaches. Int J Sci Res Eng Trends 2(5):September 2016 9. Patel D, Modi R, Sarvakar K (2014) A comparative study of clustering data mining: techniques and research challenges. Int J Latest Technol Eng Manag Appl Sci (IJLTEMAS) iii:67–70 10. Paprotny D (2021) Convergence between developed and developing countries: a centennial perspective. Soc Indic Res 153:193–225. https://doi.org/10.1007/s11205-020-02488-4 11. Xu D, Tian Y (2015) A comprehensive survey of clustering algorithms. Ann Data Sci 2:165– 193. https://doi.org/10.1007/s40745-015-0040-1 12. Ankerst M, Breunig M, Kriegel H-P, Sander J (1999) OPTICS: ordering points to identify the clustering structure. In: Proceedings ACM SIGMOD international conference on management of data, June 1–3, 1999, Philadelphia, Pennsylvania, USA 13. Zhang T, Ramakrishnan R, Livny M (1997) BIRCH: a new data clustering algorithm and its applications. Data Mining Knowl Discov 1:141–182. https://doi.org/10.1023/A:100978382 4328 14. Ng R, Han J (2002) CLARANS: a method for clustering objects for spatial data mining. IEEE Trans Knowl Data Eng 14:1003–1016. https://doi.org/10.1109/TKDE.2002.1033770 15. Yadav J, Kumar D (2014) Subspace clustering using CLIQUE: an exploratory study. Int J Adv Res Comput Eng Technol (IJARCET) 3(2):February 2014 16. Venkat Reddy M, Vivekananda M, Satish RUVN (2017) Divisive hierarchical clustering with k-means and agglomerative hierarchical clustering. Int J Comput Sci Trends Technol (IJCST) 5(5):September–October 2017 17. Rajurkar PP, Bhor AG, Rahane KK, Pathak NS, Chaudhari AN (2015) Efficient information retrieval through comparison of dimensionality reduction techniques with clustering approach. Int J Comput Appl (0975–8887) 129(4):November2015 18. Davidson I, Ravi SS (2005) Agglomerative hierarchical clustering with constraints: theoretical and empirical results. In: Jorge AM, Torgo L, Brazdil P, Camacho R, Gama J (eds) Knowledge discovery in databases: PKDD 2005. Lecture notes in computer science, vol 3721. Springer, Berlin, Heidelberg

A Comparative Study of Clustering Algorithm

235

19. El-sonbaty Y, Ismail M, Farouk M (2004) An efficient density based clustering algorithm for large databases. In: Proceedings of the 16th IEEE international conference on tools with artificial intelligence (ICTAI 2004) 20. Suthar N, jeet Rajput I, Gupta VK (2013) A technical survey on DBSCAN clustering algorithm. Int J Sci Eng Res 4(5):May 2013 21. Rousseeuw P (1987) Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. J Comput Appl Math 20(1):53–65 22. Liu Y, Li Z, Xiong H, Gao X, Wu J (2010) Understanding of internal clustering validation measures. In: 2010 IEEE international conference on data mining, pp 911–916. https://doi. org/10.1109/ICDM.2010.35 23. Aranganayagi S, Thangavel K (2007) Clustering categorical data using silhouette coefficient as a relocating measure. In: International conference on computational intelligence and multimedia applications (ICCIMA 2007), Sivakasi, India, pp 13–17. https://doi.org/10.1109/ICC IMA.2007.328 24. Davies D, Bouldin D (1979) A cluster separation measure. IEEE PAMI 1(2):224–227 25. Calinski T, Harabasz J (1974) A dendrite method for cluster analysis. Commun Stat 3(1):1–27 26. Syakur M, Khotimah B, Rohman E, Dwi Satoto B (2018) Integration K-means clustering method and elbow method for identification of the best customer profile cluster. IOP Conf Ser: Mater Sci Eng 336:012017. https://doi.org/10.1088/1757-899X/336/1/012017 27. Ma X, Dhavala S (2018) Hierarchical clustering with prior knowledge. In: Proceedings of ACM conference (conference’17). ACM, New York, NY, USA, 9 p

Refactoring Faces Under Bounding Box Using Instance Segmentation Algorithms in Deep Learning for Replacement of Editing Tools Raunak M. Joshi(B) and Deven Shah Thakur College of Engineering and Technology, Mumbai, MH, India

Abstract. Immense amount of funds and time are being invested in video and photo editing. This has an involvement of resources in a high number. Software as well as hardware. Software is handled by an entire editing team. Hardware is used for the purpose of supporting the editing tools. Resource exhaustion is the main cause and is detrimental for the reach of the organization. The work that we present in this paper describes an approach for the problem. Deep Learning is being used as a replacement for a lot of heavy ended monotonous work. Under this notion, the same can be extended to this specific problem. Automating the problem using Neural Network intended for Image Processing applications. This network will directly do all the necessary amount of work and refactor the entire image and video without using any specific tool or special effect. Keywords: Deep learning · Image processing · Image segmentation · Object detection

1 Introduction Deep Learning has proved to be a successful discovery pertaining to Machine Learning and Statistics. Deep Learning [1] is a non-parametric predictive analysis approach and constitutes all aspects of Machine Learning as its building blocks. It is a blend of mathematical concepts such as Inferential Statistics, Linear Algebra and Calculus. A standard Deep Learning approach is performed using an Artificial Neural Network [2]. These Neural Networks are inspired from the human brain’s biological neural networks. These Artificial Neural Networks are abbreviated as ANN and have become Dense with respect to time. Dense networks have more number of activation layers. This promotes a better learning approach for the network. A standard Deep Neural Network is made up of two phases. These phases are Forward Propagation [3] and Backward Propagation [4, 5]. Forward Propagation is the process where one computes the necessary parameters for learning. Backward Propagation is the process where the Neural Network computes the loss and tries to converge it. Any standard Neural Network consists of one input layer, one output layer and several hidden layers in between them. These hidden layers differ in parameters with respect to data. Every input layer takes the input data in the form of a matrix. These values in the matrix © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2022 V. E. Balas et al. (eds.), Intelligent Computing and Networking, Lecture Notes in Networks and Systems 301, https://doi.org/10.1007/978-981-16-4863-2_20

Refactoring Faces Under Bounding Box Using Instance Segmentation Algorithms

237

are the features of data. The input layer will arrange the matrix in a single long vector which holds values sequentially. The values in the vector will interact with all the neurons of the layers. These neurons are responsible for holding and learning the parameters. All of the neurons will be intertwined with the neurons in the preceding layer. The output layer is the last layer which gives a distinct value with respect to the values in the labels. Each and every single layer has the values from intertwined neurons in the form of weights. These weights are multiplied with feature values. An arbitrary seed value known as Bias is also added to it. The task of Bias is to maintain an absolute value of the equation as Neural Networks do not perform efficiently with negative values. This entire function is inference of every single neuron. It is later given to an Activation Function. It is used to generate the values of every single hidden layer. A standard activation function is a high level mathematical function like Rectified Linear Unit, Tangent Hyperbolic or Sigmoid function which is responsible for learning all the parameters with respect to that layer. These values are then given as inputs to next layers. The entire procedure that you read so far now is the forward propagation. So backward propagation is another side of coin. The neural network often does learn all the parameters in the first pass. So, there is a significant loss of information which can be retained using the backward propagation. The backward propagation calculates the loss [6] using metrics and then tries to apply loss optimizers. These loss optimizers [7] are high end mathematical functions designed for converging the loss. These loss optimizers are known as converging algorithms. The most common type of loss optimizer in deep learning is Gradient Descent [8]. Calculus is the basis of this algorithm. The backward propagation uses loss optimizers and tries to converge the loss generated in earlier activation layers. So basically, the entire neural network reiterates to generate the desired output. This single pass of one forward propagation and backward propagation is considered as one epoch. Standard state-of-the-art neural network has several epochs.

2 Methodology The methodology we prescribe in this paper is concise study of methods we used to solve the problem statement. The problem statement expects to design a Deep Learning Network that will substitute the traditional editing procedures of Noise Induction. This will prevent resource exhaustion of software as well as hardware. The methodology concentrates in the area of Image Processing. The applications of Image Processing in Deep Learning are solved using Convolutional Neural Network [9–11]. The method of Convolutional Neural Network works with a fully functional neural network that takes the input as images and gives desired output. The convolutional neural network works like a typical neural network with convolutions and max pooling as a feature extraction mechanism for activation layers. This discovery was groundbreaking yet in our system we require something more precise (Fig. 1). The first approach we considered was using the Object Detection Method formulated specifically for Deep Learning. Object Detection was a good start as we wanted something specific that can detect faces from videos or images. The object detection works on the dynamics of the bounding boxes. These boxes are able to detect the objects

238

R. M. Joshi and D. Shah

Fig. 1 Convolution neural network (Source Sumit Saha—a comprehensive guide to convolutional neural networks—the ELI5 way)

with cascades. We were fortunate enough to find an Object Detection Algorithm specifically designed for Deep Learning. You Only Look Once abbreviated as YOLO [12] is a high dimensional object detection algorithm designed for deep learning. The bounding boxes scavenge for objects in every single frame. The overlapping boxes are spotted by Intersection over Union (IoU). YOLO was an appropriate kickstart algorithm for our problem as we would detect the faces in our video frames (Fig. 2).

Fig. 2 YOLO object detection network (Source you only look once: unified, real-time object detection, 2016 IEEE conference on computer vision and pattern recognition (CVPR))

Object Detection only works for Detection mechanism but our problem wanted something more precise. Image Segmentation was a better approach as we wanted to extract specific units from the input. Image Segmentation is far varied from detection as the detection only spots an object in a frame and gives output. We specifically wanted

Refactoring Faces Under Bounding Box Using Instance Segmentation Algorithms

239

to separate the frame from the entire portion of video and induce noise in it. But before finding specific object detection and image segmentation amalgamation we decided to search a standard image segmentation algorithm. The first algorithm that we came across was Deep Residual Network [13]. It is also known as ResNet commonly. The ResNet works on Residual Representations and Shortcut Connections. It was designed on the basis of the famous DenseNet [14]. ResNet definitely has limitations but it was our premise to move further for a search of more developed form of Image Segmentation Deep Learning Algorithm (Fig. 3). In order for a better curated solution we also took autoencoders into consideration. An advanced network with multiple stacked layers might not always be reasonable and can promote poor space and time complexity. Under this contention we started our search along with autoencoders and we came across U-Net [15]. It was designed for Biomedical Image Segmentation and was very powerful. It is not a form of traditional Convolutional Neural Network. It is a very dense convolutional autoencoder. It works on the skip connections concept and has over 30 million parameters (Fig. 4). As the search for an improved image segmentation algorithm was satisfactory after UNet, failed to foresee an important point. Driven by the Image Segmentation algorithms we did not take an object detection basis. U-Net is an image segmentation algorithm, but it is a Semantic Image Segmentation approach. We wanted something that could specifically create separation in the input frame. This was viable with only the Instance Image Segmentation approach. This concept was thoroughly shed light on in Conditional Random Fields Meet Deep Neural Networks for Semantic Segmentation [16]. It points out the major differences of semantic and instance segmentation (Fig. 5). The first instance segmentation algorithm that we came across was R-CNN [17]. It is an amalgamation of the object detection and image segmentation. The issue was it was a semantic segmentation algorithm but we had to keep in pace with our main motive. Since only focusing on instance image segmentation was detrimental, we had to consider algorithms that are designed to handle both the tasks. R-CNN is a high dimensional Convolutional Neural Network that works on Region. It performs its operations using the bounding box regression method. It is a very mature and vast network which gave us a new vision for our problem (Fig. 6). We had an intuition that R-CNN would not be able to provide us everything we were looking for. The computational time required was very high. The algorithm was for sure a new start for overcoming problems but required well designed optimization for efficiency. The computational efficiency was quite poor when compared to YOLO. We wanted an algorithm that would provide computational ability as that of YOLO but get the job done. In that fashion we found an algorithm named SPP-Net which was quite better than R-CNN, but was a visual recognition algorithm and Image Segmentation statement would again digress. After a thorough search we were able to find Fast R-CNN [18]. This was designed by the same author who contributed towards R-CNN. The Fast R-CNN uses a multi-stage pipeline that promotes Region of Interest (RoI) [19] pooling layer. It has a truncated SVD for faster detection. SVD is an abbreviation for Singular Value Decomposition [20]. It trains a very deep VGG-16 [21] network which is 9 × times faster than traditional R-CNN.

240

R. M. Joshi and D. Shah

Fig. 3 Deep residual network (Source deep residual learning for image recognition—2016 IEEE conference on computer vision and pattern recognition (CVPR))

Later our search for a more advanced network ended up with Faster R-CNN [22]. It is the fastest R-CNN discovered. It is even faster than Fast R-CNN. Region Proposal Network is the principle of Faster R-CNN. The authors successfully merged the both, i.e. Region Proposal Network (RPN) and Fast R-CNN. It continuously detects all the objects and their respective probabilities at every instance (Fig. 7). The next thing we wanted was a real-time instance segmentation. This was possible when our search came across the Yolact algorithm [23]. It was the most appropriate

Refactoring Faces Under Bounding Box Using Instance Segmentation Algorithms

241

Fig. 4 U-Net (Source U-net: convolutional networks for biomedical image segmentation, medical image computing and computer-assisted intervention—MICCAI 2015. lecture notes in computer science, vol 9351. springer, cham)

Fig. 5 Semantic Segmentation versus Instance Segmentation (Source conditional random fields meet deep neural networks for semantic segmentation: combining probabilistic graphical models with deep learning for structured prediction, IEEE signal processing magazine (volume: 35, issue: 1, Jan. 2018))

algorithm for our problem because it was able to cover all the criterions. It produces robust high-quality masks. The prototypes for masking have been induced in this network. The prototype generation branch also known as protonet predicts a set of n prototype efficient masks for the entire input. It was the last network we came across that was able to match our problem statement (Fig. 8).

242

R. M. Joshi and D. Shah

Fig. 6 R-CNN (Source Rich feature hierarchies for accurate object detection and semantic segmentation, 2014 IEEE conference on computer vision and pattern recognition)

Fig. 7 Comparison of testing time (Source Rohith Gandhi—understanding object detection algorithms—towards data science)

Fig. 8 Yolact (Source YOLACT: real-time instance segmentation, 2016 IEEE conference on computer vision and pattern recognition (CVPR))

3 Implementation Since we were able to find a network as our basis, the designing of the network from scratch was eliminated, but the task of noise induction was there. We started working on the method of decolorization. Since Yolact creates high-quality masks which are separated by color codes as per the detected instance segmentation and objects. The idea was to decolorize the masks by making RGB codes to zero. Although decolorization happens but we wanted faces specifically. An extension to this problem was achieved

Refactoring Faces Under Bounding Box Using Instance Segmentation Algorithms

243

by reducing the box size and removing masks. This approach would more or much be different than traditional transfer learning. As we were looking forward to reducing the computational complexity, we can use the transferred weights and make changes in the last layer. The noise induction has many different techniques but we decided to go with the most basic i.e. Gaussian Noise [24]. It is a statistical noise having a probability density function. The Yolact issue of a pretrained model using COCO dataset [25] as the current approach using ImageNet [26] is nearly impossible due to lack of computational power. The weights are provided by the author. The backbone we used for weights is Resnet50FPN [27]. Transfer learning [28–30] was performed by adding Gaussian Blur in the output layer. Mask was specifically compressed for faces and then noise was added (Figs. 9, 10, 11, 12, 13 and 14).

Fig. 9 Input image

Fig. 10 Output image where the face is blurred

244

R. M. Joshi and D. Shah

Fig. 11 Results on multiple faces input image

Fig. 12 Results on multiple faces output image

4 Conclusion In this paper we were able to find an approach to tackle the problem for replacing editing tools for blur effect. Considering the fact, this paper imposes a curated hypothesis of various deep neural networks which can help one reach the desired outcome. Our approach in this paper creates a pathway for readers to solve the problem with images as well as videos. The project uses instance image segmentation algorithm with transfer

Refactoring Faces Under Bounding Box Using Instance Segmentation Algorithms

245

Fig. 13 Frames of input video

learning approach to induce Gaussian Noise which is capable of replacing any heavy image/video editing tool. This application has much more to do in future.

246

R. M. Joshi and D. Shah

Fig. 14 Result on output video

References 1. Goodfellow IJ, Bengio Y, Courville AC (2015) Deep learning. Nature 521:436–444 2. Mishra M, Srivastava M (2014) A view of artificial neural network. In: 2014 International conference on advances in engineering technology research (ICAETR-2014). 1–3 3. Hirasawa K, Ohbayashi M, Koga M, Harada M (1996) Forward propagation universal learning network. In: Proceedings of international conference on neural networks (ICNN’96). Vol. 1, 353–358 4. Hecht-Nielsen R (1989) Theory of the backpropagation neural network. In: International 1989 joint conference on neural networks, Vol.1, 593–605 5. Rumelhart D, Hinton GE, Williams RJ (1986) Learning representations by back-propagating errors. Nature 323:533–536 6. Janocha K, Czarnecki W (2017) On loss functions for deep neural networks in classification. ArXiv, abs/1702.05659 7. Sun S, Cao Z, Zhu H, Zhao J (2020) A survey of optimization methods from a machine learning perspective. IEEE Trans Cybern 50:3668–3681 8. Ruder S (2016) An overview of gradient descent optimization algorithms. ArXiv, abs/1609.04747 9. LeCun Y, Haffner P, Bottou L, Bengio Y (1999) Object recognition with gradient-based learning. shape, contour and grouping in computer vision 10. Yu S, Wickstrøm K, Jenssen R, Príncipe J (2021) Understanding convolutional neural networks with information theory: an initial exploration. IEEE Trans Neural Netw Learn Syst 32:435–442

Refactoring Faces Under Bounding Box Using Instance Segmentation Algorithms

247

11. Gu J, Wang Z, Kuen J, Ma L, Shahroudy A, Shuai B, Liu T, Wang X, Wang G, Cai J, Chen T (2018) Recent advances in convolutional neural networks. Pattern Recognit 77:354–377 12. Redmon J, Divvala S, Girshick RB, Farhadi A (2016) You only look once: unified, real-time object detection. IEEE Conf Comput Vis Pattern Recognit (CVPR) 2016:779–788 13. He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. IEEE Conf Comput Vis Pattern Recognit (CVPR) 2016:770–778 14. Huang G, Liu Z, Weinberger KQ (2017) Densely connected convolutional networks. IEEE Conf Comput Vis Pattern Recognit (CVPR) 2017:2261–2269 15. Ronneberger O, Fischer P, Brox T (2015) U-net: convolutional networks for biomedical image segmentation. ArXiv, abs/1505.04597 16. Arnab A, Zheng S, Jayasumana S, Romera-Paredes B, Larsson M, Kirillov A, Savchynskyy B, Rother C, Kahl F, Torr P (2018) Conditional random fields meet deep neural networks for semantic segmentation: combining probabilistic graphical models with deep learning for structured prediction. IEEE Signal Process Mag 35:37–52 17. Girshick RB, Donahue J, Darrell T, Malik J (2014) Rich feature hierarchies for accurate object detection and semantic segmentation. IEEE Conf Comput Vis Pattern Recognit 2014:580–587 18. Girshick RB (2015) Fast R-CNN. IEEE Int Conf Comput Vis (ICCV) 2015:1440–1448 19. Lin H, Si J, Abousleman G (2007) Region-of-interest detection and its application to image segmentation and compression. Int Conf Integr Knowl Intensiv Multi-Agent Syst 2007:306– 311 20. Klema V, Laub A (1980) The singular value decomposition: Its computation and some applications. IEEE Trans Autom Control 25:164–176 21. Simonyan K, Zisserman A (2015) Very deep convolutional networks for large-scale image recognition. CoRR, abs/1409.1556 22. Ren S, He K, Girshick RB, Sun J (2015) Faster R-CNN: towards real-time object detection with region proposal networks. IEEE Trans Pattern Anal Mach Intell 39:1137–1149 23. Bolya D, Zhou C, Xiao F, Lee Y (2019) YOLACT: real-time instance segmentation. IEEE/CVF Int Conf Comput Vis (ICCV) 2019:9156–9165 24. Boyat A, Joshi B (2015) A review paper: noise models in digital image processing. ArXiv, abs/1505.03489 25. Lin T, Maire M, Belongie SJ, Hays J, Perona P, Ramanan D, Dollár P, Zitnick CL (2014) Microsoft COCO: common objects in context. ECCV 26. Russakovsky O, Deng J, Su H, Krause J, Satheesh S, Ma S, Huang Z, Karpathy A, Khosla A, Bernstein MS, Berg A, Fei-Fei L (2015) ImageNet large scale visual recognition challenge. Int J Comput Vision 115:211–252 27. Lin T, Dollár P, Girshick RB, He K, Hariharan B, Belongie SJ (2017) Feature pyramid networks for object detection. IEEE Conf Comput Vis Pattern Recognit (CVPR) 2017:936– 944 28. Tan C, Sun F, Kong T, Zhang W, Yang C, Liu C (2018) A survey on deep transfer learning. ICANN 29. Mahbub H, Jordan B, Diego F (2018) A study on CNN transfer learning for image classification, conference. In: UKCI 2018: 18th annual UK workshop on computational intelligence at Nottingham 30. Yosinski J, Clune J, Bengio Y, Lipson H (2014) How transferable are features in deep neural networks? NIPS

Motion Detection and Alert System M. D. N. Akash(B) , CH. Mahesh Kumar, G. Bhageerath Chakravorthy, and Rajanikanth Aluvalu Department of CSE, Vardhaman College of Engineering, Hyderabad, India

Abstract. In the era of the internet and with an expanding number of individuals as its end users, countless trespassers are presented every day. In 2016, four Pakistani terrorists illegally entered into the Indian army base and carried out grenade attacks on the security forces. Thus, tracking the movement and the location of different trespassers and intervention of unknown or unauthorized persons with the assistance of Intruder Detection Systems is the proposed work in this paper. Intruder Detection System is a developing pattern in research in today’s technological world. Existing investigations and with their use of technology show adequacy and poor approaching ability of AI in dealing with Intruder Detection Systems. In this research paper, we intend to clarify and improvise the identification pace of Intruder Detection System by utilizing OpenCV a built-in library in python. Keywords: OpenCV · Image thresholding · Image smoothing · Contours

1 Introduction Now a day’s human behaviors and their activities are more important in surveillance. To identify and tracking the human behaviors is the main factor of video surveillance system. To identify the persons entering into the prohibited or restricted places and to identify the activities happening in the crowded places video surveillance system is used mainly because one person cannot identify the total activities happening in a crowded place or any restricted places entering and he cannot monitor all the time. These days we have CCTV cameras all over the areas for the security purpose but there is no alert system if any intruder enters into a restricted area without permission or access to enter into an area. The CCTV cameras just records the video and the person cannot sit in front of the screen all the time so to avoid that the detection system is used as if any intruder enters into an unauthorized place, it sends an alert so that the authorized person can be alert and know that an intruder has entered. The main motivation behind this research paper is the alarming rate of crime rate in India. As the intruders entering into the prohibited or restricted places so we can’t find easily what exactly the intruder is doing by entering into the prohibited places. So, to avoid this problem we are coming up with the motion alert and detection system in which if the intruder enters into any restricted or prohibited areas then it immediately sends an alert to the registered authorized persons. It is also used to control the crime rate. It is implemented by using OpenCV with python. As there is a rapid increase in technology it is easy to identify the intruders or any suspicious © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2022 V. E. Balas et al. (eds.), Intelligent Computing and Networking, Lecture Notes in Networks and Systems 301, https://doi.org/10.1007/978-981-16-4863-2_21

Motion Detection and Alert System

249

activities happening around it is easy to identify through video surveillance system and it reduces the man work. For a person it is very difficult to identify and analyses the behavior of human activities and it is difficult to identify between a usual person and an unusual person so by video surveillance system it is easy to identify. The main goal is to reduce the crimes and threaten events. Main task is to see that unusual events can be located by using this surveillance system which can be manual or automatic. “Motion Detection and Alert System” is a software application which is used to identify the intruders/ unauthorized persons from entering into the unrestricted or prohibited places. We can detect the intruder entry by detecting the motion of any person entering the area. If a person is detected the it will send an alert that intruder is detected then the authorized persons can get the information about the intruders and get alerted and take necessary steps to avoid unusual events happening in the early stages. We can store the data of the persons entering into the prohibited areas and we can identify later if the person again enters into any other prohibited places and it is easy to identify him. This can be done by using OpenCV with python. This can be very accurate in identify the intruders entering into the prohibited or unrestricted areas.

2 Related Study The already existing systems are using thermal cameras for intruder detection [1, 2]. In the most unfavorable weather conditions, there is a rise of a tool that is used for seeing in the darkness that is Thermal Imaging. It is helpful for detecting thermal energy which is further released from the object. Thermal imaging cameras provide images/pictures of an invisible infrared radiation. Based on the temperature differences between the objects, thermal imaging will produce a crisp image/picture on which the smallest of details can be displayed [3]. Thermal imaging cameras will work both while night time and daytime. Uma maheswari et al. (2020) has proposed maximum edge patterns for facial expression analysis [4]. 2.1 Advantages 1. 2. 3. 4. 5.

It can be used daytime and night time. It uses data from PIR sensors. It works in almost all-weather conditions. It can be blocked by any shielded by any physical objects. It can see within the light fog, smoke, & rain.

2.2 Disadvantages 1. 2. 3. 4. 5. 6.

It does not have any physical barrier. It cannot identify intruders. But it can spot them. [5, 6]. Thermal images cannot be captured through certain materials like glass and water. High Cost. There are several techniques by which the image capturing can be restricted. Using blanket technique, we can directly bypass the temperature recordings.

250

M. D. N. Akash et al.

3 Proposed Model The proposed system uses OpenCV to identify and detect any unusual entry of intruders in the restricted area. This technology is cost-effective. The input video is passed through the OpenCV code that is developed for the purpose of Intruder Detection and Alert Authority. The input video firstly goes under pre-processing then the features are extracted [7]. At this point, if an intruder is detected then the OpenCV code sends the alert mail to the authority as well as the output video is also generated and stored in the local storage. The output video is also generated if the intruder is not detected (Figs. 1 and 2).

Fig. 1 Architecture diagram

3.1 Proposed Model Objectives 1. To detect intruders from entering the defined/restricted premises. 2. To warn the authorized person about the detection of intruders through email. 3. To generate the status in the output video generated. 3.2 Proposed Model Outcomes 1. Accessing dynamic input and produce output video. 2. If a person is detected in the prohibited area, an alert mail is sent immediately to a concerned team. 3. Date and Time Stamp are detected and are attached to an alert message.

Motion Detection and Alert System

251

Fig. 2 Class diagram

3.3 Proposed Model Advantages 1. 2. 3. 4.

It is very Cost-Effective. It will work in all climatic conditions. It can see within the fog, smoke, & rain. Intruders are easily detected and identified.

4 Experimental Setup 4.1 Sensors from Which Data is Taken The data for the project is taken from CCTV or any video recording devices or visible light sensors. 4.2 Taking Video Input In order to detect motion, we should take video as input. Upon starting the program, the user will be prompted to select the video which is to be given as input.

252

M. D. N. Akash et al.

4.3 Motion Detection The main objective is to detect the motion in the video, it is detected by calculating the absolute difference in the frames read by the program, if the absolute difference between the frames is more than the experimental value then we say that motion is detected in the video [8, 9]. 4.4 Sending Mails After detecting motion in the given video, the user should be alerted via email, in order to send a mail to any person using SMTP library, we should provide mail id and password and port number to the methods provided by the library. 4.5 Local Storage The video will be stored in the local system and when a motion is detected it show the status on the video screen as well as there will be marked boundaries around the object (Figs. 3 and 4)s.

Fig. 3 Use case diagram

Motion Detection and Alert System

253

Comparison Chart

9 8 7

Scaling

6 5 4 3 2 1 0

Visible

Thermal

Fig. 4 Comparison chart

5 Proposed Algorithm Step 1: Pre-Processing. Step 2: Simple Thresholding. Step 3: Smoothing. Step 4: Finding Contour. Step 5: Contour Approximation. Step 6: Contour Area. Step 7: Drawing square shapes and Output. Step 8: Sending Mails.

6 Experimental Procedure A certifiable framework for Human Motion Detection and Tracking mainly, this module requires capacities and calculations written in the Intel’s open CV library. In the equipment’s viewpoints, we had utilized cc television camera recordings for testing purposes and usage of movement location. The exploratory system follows a few stages.

254

M. D. N. Akash et al.

6.1 Pre-processing The OpenCV work ’Video Capture’ imports video from a related camera. In this, every information diagram has 640 × 480 pixels with a 3-layer (red, green, blue) RGB concealing plan. The system changes over the caught picture from RGB concealing association to diminish scaled space that has recently a solitary tone for every pixel. This movement shortens the proportion of computation and appropriately enables consistent use [10, 11]. 6.2 Simple Thresholding For every pixel, comparative edge regard is applied. If the pixel regard is humbler than the edge, it is set to 0, else it is set to the best worth [12, 13]. The limit CV breaking point is used to apply the thresholding. The primary dispute is the source picture, which should be a grayscale picture. The ensuing conflict is the breaking point regard which is used to organize the pixel regards. The third conflict is the most limit worth which is given out to pixel regards outperforming the cutoff [14]. The possibility of thresholding is to encourage rearrange visual information for investigation. To begin with, you may change over to dark scale, yet then you need to consider that grayscale actually has in any event 255 qualities. What thresholding can do, at the most essential level, is convert everything to white or dark, in view of a limit esteem. Suppose we need the limit to be 125 (out of 255), at that point all that was 125 and under would be changed over to 0, or dark, and everything over 125 would be changed over to 255, or white. On the off chance that you transform to grayscale as you typically will, you will get white and dark. In the episode that you don’t change over to grayscale, you will get thresholder pictures, yet there will be shading [15, 16]. 6.3 Smoothing As in some other signs, pictures additionally can contain various kinds of commotion, particularly in light of the source (camera sensor). Picture Smoothing methods help in diminishing the commotion. In OpenCV, picture smoothing (likewise called obscuring) should be possible from numerous points of view. In this instructional exercise, we will get the hang of utilizing the Gaussian channel for picture smoothing [17]. Gaussian channels have the properties of having no overshoot to a phase work input while restricting the climb and fall time [18]. As far as picture handling, any sharp edges in pictures are smoothed while limiting an excess of obscuring. Removing disturbance and separation of individual components and joining different components in a picture. As well as finding of boundaries or openings in a picture [19, 20]. 6.4 Finding Contour Forms can be explained basically as a twist joining all the predictable centers (alongside the breaking point), having a similar concealing or force. The structures are a useful device for shape assessment and article area and affirmation, Referred to as form. For better accuracy, use equal pictures [21]. So, prior to finding shapes, apply edge or watchful edge acknowledgement [18, 22]. In OpenCV, finding shapes takes after finding white thing from a dim establishment [23].

Motion Detection and Alert System

255

6.5 Contour Approximation It approximates a form shape to another shape with less number of vertices relying on the accuracy we indicate. It is a usage of Douglas-Peucker calculation [13, 24]. 6.6 Contour Area Contour territory is given by the capacity cv.contourArea() [25]. 6.7 Drawing Square Shapes and Output After finding the contours and retrieving NumPy array of contour coordinates, we have to calculate the area occupies by those contours and we have to observe and keep a record of these areas and after several experiments if area is more than 900 then it is concluded that motion is detected in the given frame. Drawing square shapes around the identified articles [26]. Now after detecting motion, in order to store the output indicating that motion is detected and status stating ‘Intruder detected’ will be written and output video is stored in the local storage [27]. 6.8 Sending Mails The smtplib module set out a Simple Mail Transfer Protocol client session item which will be utilized to send email to the Internet device with a ESMTP or SMTP listener daemon. Mails are sent to the client utilizing SMTP. Sendmail () [28]. 6.9 Experimental Data Used There are several types of experimental data like Raw Frames, Pre-processed Frames, Counter array, Counter Area, Email credentials. 1. The main source of data for the project is the feed taken from CCTV cameras and other type of video recording devices. 2. The experimental data will be in the form of frames which will be transmitted to several modules in the execution, first the data is read in the form of frames. 3. These frames are first pre-processed and are converted to readable frames. 4. The frame thus pre-processed is sent to thresholding to maintain the pixel balance. 5. The frames are dilated and are sent to counter retrieving. 6. The counter data is stored in to arrays and are sent for counter approximations. 7. After successful approximation, counter area is calculated which is used to determine motion.

256

M. D. N. Akash et al.

6.10 Control Data The counter area calculated is the control data of our project. Based on the variation of counter data the motion detection is justified. As per calculations if counter area is greater than 900 then motion is detected.

7 Results and Discussions 1. The output data is in the format of video mode specified by the user and the video or the output is stored in the specified location as per the user specification and also the output video can be utilized for further references in the future such as face recognition etc. 2. No moving object can escape the proposed motion detection system, because visible cameras can’t be blocked directly using any material directly and also the visible detection is very powerful during the day time and there is also no loss of image during the transmission. 3. The proposed system on detecting the motion, sends an alert mail to the registered person which can be sent to anywhere in the world rather than the conventional system which produces mechanical sounds or an alarm which may not be that effective in terms of long-distance alert system. 4. The output data is in the format of thermal imaging for the thermal imaging for motion detection system because the images are captured through PIR sensors. 5. A straight forward strategy to impede IR is a conventional ‘space cover’, ‘crisis cover’ or warm cover. They are made of Mylar foil materials and will block IR and bomb the cameras to catch the readings [29]. 6. Symbolism Conventional motion detection systems which use thermal imaging mostly dependent on hardware hence produces signals to the antenna which may be of less distance and may not be that effective in nature. 7. The result obtained will be of specified qualities and formats as specified by the user while saving the video. 8. The video containing intruder gets automatically saved in the local machine, in which the program is being executed. 7.1 Existing System Outcomes 1. The output data is in the format of thermal imaging. 2. A straightforward strategy to obstruct IR is a customary ‘space cover’, ‘crisis cover’ or ‘warm cover. They are comprised of Mylar foil materials and it will impede IR symbolism. 3. Conventional motion detection systems which use thermal imaging mostly dependent on hardware hence produces signals to the antenna which may be of less distance.

Motion Detection and Alert System

257

7.2 Proposed System Outcomes 1. The output data is in the format of video mode specified by the user. 2. No moving objects can escape the proposed motion detection system, because visible cameras cannot be blocked directly using any material. 3. The proposed system on detecting the motion, sends an alert mail to the registered person present anywhere in the world.

8 Future Scope There is good scope for Motion Detection and Alert Systems. We can use this detection system in restricted places like military areas, there are prohibited areas like sectors present in military areas where all are not allowed to enter that area and in night times if anyone enters then it is difficult to detect who has entered, if we implement this system in such areas then it will be very easy to detect and it immediately sends the alert to the authorized persons so that we can respond immediately and we can prohibit them entering into that area. We can further use this system in Image processing, IOT for further purposes. If we can use this system in image processing then we can detect and differentiate the authorized and unauthorized persons and it gives the details of who has entered that region. Not only in military areas we can use it in museums, industrial compounds which is very useful to protect ourselves.

9 Conclusion This paper illuminates that all the problems, issues faced by the Human behaviors. In this paper we did the analysis of detection of intruders entering into the prohibited places such as abnormal behavior and motion detection. It detects the intruders which is recorded in CCTV camera. It can be used for security purposes further. This system is used to monitor the restricted areas in a better way. If in the restricted or prohibited areas if any intruder enters or if any motion is detected then it will send a warning mail to the registered authority along with time stamp. The output video in which the motion is detected is stored in the system itself for future references. This detection system is also very much secured to use and it detects the intruders very easily. There is decent scope for intruder detection using visible camera. Although thermal cameras are used for intruder detection, there are some flaws with thermal imaging. Utilizing warm covers—Less than one millimeter thick, the sheet retained 94% of the infrared light it experiences. Catching such an excess of light implies that warm articles underneath the shrouding material become undetectable to infrared indicators [30], so visible detection comes to the rescue. As technology is increasing, we can integrate face detection system which can be more effective. We can make multiple person’s detection we can detect the count of the person’s entering to any surveillance place and we can separate the authorized and unauthorized persons by saving their data in the database. In the future we can make a database and store the persons data in such a way that if the persons enter again into a prohibited place the data of that persons will be shown and the alert will be sent to all so that they can restrict the persons from doing any unusual things.

258

M. D. N. Akash et al.

References 1. Mukherjee S, Das K (2013) A novel equation based classifier for detecting human in images. arXiv preprint arXiv:1307.5591 2. Wong WK et al. (2009) An effective surveillance system using thermal camera. In: 2009 International conference on signal acquisition and processing. IEEE. 3. Abdelrahman Y et al. (2017) Stay cool! understanding thermal attacks on mobile-based user authentication. In: Proceedings of the 2017 CHI conference on human factors in computing systems. 4. Uma MV, Varaprasad G, Raju SV (2020) Local directional maximum edge patterns for facial expression recognition. J Ambient Intell HumIzed Comput 1–9 5. Wilson PI, Fernandez J (2006) Facial feature detection using Haar classifiers. J Comput Sci CollEs 21(4):127–133 6. Hasan M et al. (2019) A smart semi-automated multifarious surveillance bot for outdoor security using thermal image processing. Adv Netw 7(2):21–28 7. Arya A, Farhadi-Niaki F (2010) An implementation on object move detection using OpenCV. Carleton Univ Ottawa, Department of System and Computer Engineering 8. Viola P, Jones M (2001) Rapid object detection using a boosted cascade of simple features. In: Proceedings of the 2001 IEEE computer society conference on computer vision and pattern recognition. CVPR 2001. Vol. 1, IEEE 9. Benezeth Y, Jodoin PM, Emile B, Laurent H, Rosenberger C (2010) Comparative study of background subtraction algorithms. J Electron Imaging 19 Oct 10. Traisuwan A, Tandayya P, Limna T (2015) Workflow translation and dynamic invocation for image processing based on OpenCV. 319–324. https://doi.org/10.1109/JCSSE.2015.7219817 11. Solak S, Emine DB (2013) Real time industrial application of single board computer based color detection system. In: 2013 8th International conference on electrical and electronics engineering (ELECO). IEEE 12. Piccardi M (2004) Background subtraction techniques: a review. In: 2004 IEEE International conference on systems, man and cybernetics (IEEE Cat. No. 04CH37583). Vol. 4. IEEE 13. Selvakumar K, Ray BK (2013) Survey on polygonal approximation techniques for digital planar curves. Int J Infonnation Technol, Model Comput (IJITMC) 1 14. Pujara H, Prasad KMVV (2013) Image segmentation using learning vector quantization of artificial neural network. Image 2(7) 15. Ulloa R (2013) Kivy: interactive applications in python. Packt Publishing Ltd 2013 16. Zheng J, Li B, Xin M et al (2019) Structured fragment-based object tracking using discrimination, uniqueness, and validity selection. Multimedia Syst 25:487–511 17. Developers G (2012) Google for education—python regular expressions. Dec 18. Pandagre KN (2020) Detection of arrhythmia disease in ecg signal using optimal features. Int J Inf Technol (IJIT) 6(5) Sep–Oct 19. Uma MV, Raju SV, Sridhar Reddy K (2019) Local directional weighted threshold patterns (LDWTP) for facial expression recognition. In: 2019 fifth international conference on image information processing (ICIIP). IEEE 20. Naveenkumar M, Vadivel A (2015) OpenCV for computer vision applications. In: National conference on big data and cloud computing (NCBDC’15). 21. Uma MK, Chaithanya JK (2014) An ARM based door phone embedded system for voice and face identification and verification by OpenCV and Qt GUI framework. Int J Comput Appl 91(13) 22. Ogale NA (2006) A survey of techniques for human detection from video. Survey, University of Maryland 125(133):19

Motion Detection and Alert System

259

23. Awcock GJ, Ray T (1995) Applied image processing. Macmillan international higher education. 24. Tienaah T, Stefanakis E, Coleman D (2015) Contextual douglas-peucker simplification. Geomatica 69(3):327–338 25. Graham RL, Yao FF (1983) Finding the convex hull of a simple polygon. J Algorithms 4(4):324–331 26. Prasad GV, Raju SV (2018) A survey on local textural patterns for facial feature extraction. Int J Comput Vis Image Process (IJCVIP) 8(2):1–26 27. Chikurtev D (2017) Vision system for recognizing objects using open source computer vision (OpenCV) and robot operating system (ROS). Probl Eng Cybern Robot 68 28. Tzerefos P et al. (1997) A comparative study of simple mail transfer protocol (SMTP), post office protocol (POP) and X. 400 electronic mail protocols. In: Proceedings of 22nd annual conference on local computer networks. IEEE 29. Isser M et al (2019) High-energy visible light transparency and ultraviolet ray transmission of metallized rescue sheets. Sci Rep 9(1):1–6 30. Kumar N, Sampat RV (2017) Stealth materials and technology for airborne systems. aerospace materials and material technologies. Springer, Singapore, 519–537

Feasibility Study for Local Hyperthermia of Breast Tumors: A 2D Modeling Approach Jaswantsing Rajput1(B) , Anil Nandgaonkar1 , Sanjay Nalbalwar1 , Abhay Wagh2 , and Nagraj Huilgol3 1 Dr. Babasaheb Ambedkar Technological University, Lonere 402104, India

[email protected]

2 Directorate of Technical Education, Mumbai, Maharashtra 400001, India 3 Dr. Balabhai Nanavati Hospital, Mumbai 400056, India

Abstract. This communication presents efficient focusing of radiofrequency for the non-invasive local hyperthermia treatment (HT) of breast tumors. Hyperthermia technique is used to raise the tumor temperature from 42–45 °C. In this combinational therapy, controlled heating at tumor site plays a vital role in the success of HT; else it creates hotspots on surrounding area of the treatment site. In this communication we are presenting mathematical analysis for the temperature distribution during the HT, Computational complexity for effective focusing of radio waves on tumor sites, and 2D modeling of breast model for testing the feasibility of HT. Pennes bio-heat equation is used for thermal analysis of tumor and healthy tissue in MATLAB environment, whereas thermal conduction parameters of 2D breast model are acquired from finite element method by imposing radiation, and convection boundary conditions. Heat flow modeling and electrostatic modeling of the 2D breast model with two tumors is carried out. 2D modeling result shows that HT is safe and reliable if temperature parameters are maintained in the safe limit, avoids hotspots on healthy surrounding tissue and reduces toxicity to a greater extent. Also, the obtained results for magnitude of electric flux density, heat flow in the tumors shows higher efficiency, with minimized hotspots. Keywords: Antenna · Cancer · Hyperthermia treatment (HT) · Electrostatic and Heat flow modeling · Radiofrequency (RF) · Tissue · Tumor

1 Introduction Cancer is a disease, in which growth of cells in the body happens in an uncontrolled manner. It spreads very fast in other normal cells. Its uncontrolled spreading can result in death. Hyperthermia treatment (HT) is used to elevate the temperature of the tumor tissue artificially. Off late, researchers have been prompted to probe this noninvasive technique for the destruction of tumor cells. This technique involves heat irradiation [1]. The efficacy of hyperthermia treatment has been demonstrated for various types of cancer [1]. In HT, the temperature at a tumor site is kept in the range of 41–45 °C for the time duration of one hour or more. HT is used to damage and kill cancer cells. It helps © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2022 V. E. Balas et al. (eds.), Intelligent Computing and Networking, Lecture Notes in Networks and Systems 301, https://doi.org/10.1007/978-981-16-4863-2_22

Feasibility Study for Local Hyperthermia

261

to reduce the number of radiation treatments needed to cure the tumor. Clinically, HT also improves the effect of other cancer treatments like chemotherapy or radiotherapy when used in coalition on cancer patients [2, 3]. Before this treatment, a CT scan is performed to precisely locate the tumor region. Depending on the tumor’s location and stage of cancer, local, regional or whole-body HT is used to dispense heat to the tumor [4]. During local HT, a tumor area is heated superficially by the use of applicators. Superficial applicators of different shapes and types are positioned on the surface of superficial tumors with a contacting layer called a bolus. To avoid any side effects due to excessive temperature, rise at nearer skin area of the tumor, water flow in these boluses is concurrently used to maintain the temperature of the normal tissue skin nearer to the tumor at about 37 °C. Electromagnetic linking between applicators and tumor tissues is provided through these boluses attached to applicator. Thermometers are connected through tubes or needles to observe the temperature of the anesthetized tumor tissue [5]. Breast cancer, one of the most common and highly prevalent forms of cancer, leads to over five lack deaths worldwide every year [6]. This social issue motivated many researchers to investigate noninvasive alternative for the breast cancer treatment. From the experimental investigation and validation by Phong et al. in [7–12], HT with controlled focusing of microwave energy at tumor position is best option for the breast cancer patient [7]. In the implemented model of breast cancer [8], antenna excitation amplitudes and phase excitations are controlled using thermal analysis of tissue temperature to avoid hotspots in healthy tissues [9]. 3D focusing techniques of microwave energy gives better focusing effects than 2D focusing techniques [9]. A fabricated breast phantom model [10] is used for experimental implementation and validation of HT. Coupled simulation of MATLAB and CST microwave studio software has been carried out using the particle swarm optimization computational technique on a woman breast phantom, fabricated with actual mass material properties [11, 12]. Modern RF and microwave heating devices with phased antenna array has advanced settings to control optimized focusing of microwave power on tumor to minimize hotspots. Treatment planning is an important aspect to get best results of locoregional HT [13]. Figure 1 Shows the block arrangement for HT system.

Fig. 1 Block diagram of hyperthermia system

262

J. Rajput et al.

In this article mathematical analysis of hyperthermia, equations for effective temperature distribution in the tumor, and heat flow modeling are explained in Sect. 2. All results of MATLAB simulation and heat flow modeling using COMSOL Multiphysics solver for breast tumors are discussed in Sect. 3. Section 4 concludes this article with optimum temperature distribution on the tumor sites.

2 Mathematical Modeling and Analysis for Hyperthermia HT requires a deep understanding on thermal, electromagnetic as well as the biological characteristics of human tissue. This coupled mathematical analysis is used to model the HT in simulation environment. 2.1 Equations for Effective Temperature Distribution The bio-heat equation for tumor domain is given by Eq. 1 [14]. ρc

∂T = ∇(k∇T ) − ωb cb (T − Tb ) + Qm + αP ∂t

(1)

The aqueous solution and the intracellular medium have a velocity difference which is represented as a correction parameter for the compensation of this velocity difference. α-where, α-a correction parameter used for compensation of velocity difference between the aqueous solution and the intracellular medium, P-power dissipation of particles per unit volume, T -absolute temperature of body tissue, ρ-denotes the density of tissue, cdenotes specific heat of body tissue, k-represents the thermal conductivity of body tissue, ρb -blood density, cb -blood specific heat,ωb -perfusion rate of blood, Tb -temperature of blood, Qm -metabolic heat per unit volume, αP-effective power dissipation inside the tumor region. The effective area percentage (R) of tumor treatment can be calculated using Eq. 2. %R =

Exposed Tumor Area after HT Total Tumor Area

(2)

The HT exposed tumor area having temperature in the range of 41–45 °C. In closed boundary condition, 15 square inch effective bounded area is divided in finite element of maximum node size, described in Sect. 2.3. This numerical approximation gives high accuracy in the modeling with minimum error in variation of temperature. Equation 3 represents the sum of square of temperature error. 1 Error = . (Tn − Tth )2 CT N

(3)

n=1

where, Tn -Temperature of tumor node at n = 1, 2, 3, 4 … N, N - total number of nodes, Tth threshold temperature, CT is the temperature constant, represents the difference between the actual body temperature and the threshold value of temperature. This modeling is performed with focused RF hyperthermia in two dimensions of tumor. To focus higher

Feasibility Study for Local Hyperthermia

263

power on the tumor. The phase differences of antennae for planner 2D geometry is calculated by (4). (ϕ − ξ ).c = 2π.f.d

(4)

where, ϕ-field distribution within the studied tumor object, d- distance between the antenna elements, ξ in terms of antenna array are the ones that need to be optimized according to Eq. 4. For optimized focusing of radio frequency using four antennas, randomized values can be obtained by (5). ξ = (ξ1 , ξ2 , ξ3 , ξ4 )

(5)

→ The fitness factor is the ratio of the power dissipated inside tumor (Qt (− ϕ )) to the − → power dissipated inside healthy tissue (Qh ( ϕ )) per unit volume. An antenna array produces it. This ratio should be maximum to achieve the higher power dissipation inside the tumor region given in Eq. 6. → maxf (− ϕ)=

→ Qt (− ϕ) → Qh (− ϕ)

(6)

The set of field distribution within the closed boundary of four antenna array is a set of values given in Eq. 7. − → ϕ = (ϕ1 , ϕ2 , ϕ3 , ϕ4 )

(7)

To achieve the better exposure of radio frequency on the tumor, we need to cover the full volume of tumor at 42 °C and more. The difference between total tumor volume (Vtot ) and the volume inside the tumor at which the temperature is 42 °C should be minimum, given in Eq. 8. ◦C (8) min f ( α ) = Total Volume (Vtot ) − V42 α ) tumor ( This optimization parameter is scaled for four antenna arrays shown in Eq. 9. − → α = (α1, α2, α3, α4 )

(9)

2.2 Equations for Computational Optimization This section discusses the optimization of computational load, the iterative optimization method, Alternating Direction Implicit (ADI) method, is used to solve the partial differential bio-heat equation for different heating and cooling time variations. The rate of change of temperature at tumor is described as spatial distribution represented by Eq. 10 [15]. 2 ∂ T ∂ 2T ∂ 2T ∂T (10) =D + + ∂t ∂x2 ∂y2 ∂z 2 Crank-Nicolson (CN) method requires more computational time to solve 3-D sparse matrixes. Thus, ADI method is best suitable option to solve computationally heavy 3-D

264

J. Rajput et al.

sparse matrixes. The intermediate temperatures (T first , T second ) are used to obtain the new temperature (T new ). It can be solved by applying CN method in one direction at a time. Equations 11–17 depicts sequential implementation of ADI method in x, y, and z-directions respectively [15]. new ⎡ new ⎤ T−x −2T new + T+x T−x − 2T + T+x + ⎢ ⎥ 2x2 2x2 ⎢ ⎥ new new + T new ⎥ ⎢ new −2T T −T T T−y − 2T + T+y ⎢ ⎥ −y +y (11) = D⎢ + ⎥ + 2 2 ⎢ ⎥ t 2y 2y ⎢ ⎥ new new ⎦ ⎣ T−z −2T new + T+z T−z − 2T + T+z + + 2z 2 2z 2 The difference between current temperature (T ) and new temperature (T new ) in temporal resolution (t) is proportional to the spatial resolution (x,y&z) of respective temperature locations. r1

T first

first first T−x − 2T first + T+x T−x − 2T + T+x Dt + −T = x2 2 2 +

Dt Dt T−y − 2T + T+y + (T−z − 2T + T+z ) 2 2 y z r2

(12)

r3

−r1 first −r1 first r1 r1 T T + (1 + r1 )T first − = T−x + T+x 2 −x 2 +x 2 2 + r2 T−y + r2 T+y + r3 T−z + r3 T+z (1 − r1 − 2r2 − 2r3 )T

(13)

first

first

where , T−x , T first &T +x -unknown temperature values. first first T−x − 2T first + T+x T−x − 2T + T+x + − T = r1 2 2 second second T−y − 2T second + T+y T−y − 2T + T+y + + r2 2 2

T

second

+ r3 (T−z − 2T + T+z ) (1 + r2 )T second −

r2 second r2 second r1 r1 r1 first T−y − T+y = T−x + T+x + T−x 2 2 2 2 2 r1 first r2 r2 + T+x + T−y + T+y 2 2 2 r1 first + T+x + r3 T−z + r3 T+z 2 + (1 − r1 − r2 − 2r3 )T − r1 T first

(14)

(15)

Feasibility Study for Local Hyperthermia

265

first first T−x − 2T first + T+x T−x − 2T + T+x + T − T = r1 2 2 second second T−y − 2T second + T+y T−y − 2T + T+y + + r2 2 2 new new new + T+z T−z − 2T T−z − 2T + T+z + (16) + r3 2 2 r3 new r3 new r1 r1 r1 first (1 + r3 ) T new − T−z − T+z = T−x + T+x + T−x 2 2 2 2 2 r1 first + T+x 2 r2 r2 r1 second r1 second T−y + T+y + T−y + + T+y 2 2 2 2 r3 r3 T−z + T+z + (1 − r1 − r2 − r3 )T + 2 2 first − r2 T second (17) − r1 T

new

The algorithm is scripted in MATLAB based on Eqs. 11–17. The output of an algorithm is in good agreement with the analytical solution [16]. 2αI0 ∂TRh 2 = e−r /[4kt+β] ∂t ρc[1 + (4kt/β)]

(18)

where, the rate of heating is (Rh) is dependent parameter of temperature (T ), The product of absorption coefficient α and intensity (I0 ) are directly proportional to the rate of heating and the Gaussian variance is represented by β. The product of measured density (ρ) and specific heat (c) is inversely proportional to rate of change of heat, and the thermal diffusivity is represented by k along the cylindrical axis coordinate (r). The Table 1 shows the computational latency for CN and ADI method [17]. Table 1 Computational latency for CN and ADI method Method

Computational time (S)

Self-computational time (S)

CN Method

2.952

0.64

ADI Method

0.001

0.001

2.3 Heat Flow Modeling for Breast Tumor The flow of heat rate in tissue due to thermal conduction, radiation and convection boundary conditions are given Eqs. 19–21 respectively.

f =k ∇T

(19)

266

J. Rajput et al.

where f denotes heat flux vector, k represents thermal conductivity coefficient, and

∇ T -vector quantity. ∂T + h(T − T0 ) = 0 ∂n ∂T + βksb T 4 − T04 = 0 k· ∂n k·

(20) (21)

where, ∂T ∂n -heat emission density, where, ksb is a Stephan-Boltzmann constant, β is an emissivity coefficient, and T0 - ambient radiation temperature. Parameters β and T0 may differ from part to part of the boundary. The tumor tissue surface temperature is 37–42°C with applied boundary condition of hyperthermia treatment. The tumor tissue is modeled with specific heat capacity of 3.75 kJ/g°C, thermal conductivity 0.55 W/m and tissue density of 1050 kg/m3 . 2D breast model with realistic accuracy has been created, two tumors of 10 mm are inserted in the breast model, finite element method (FEM) is used for electrostatic and heat flow modeling of 2D breast model, with tumors. In this 2D modeling, convection and radiation boundary conditions are used for the measurement of heat flow, temperature in the tumors of the created 2D breast model [18]. Heat conduction in these tumors during the hyperthermia is observed. The primary purpose of this modeling is to get the heat flow inside tumor tissue using radiation Eq. 21 with integrated physics of tumor in realistic dimensions. This simulation, and modeling are carried out using COMSOL Multiphysics solver. Table 2 shows electrical properties considered for electrostatic modeling at frequency 2.75 GHz, downloaded on Dec, 03, 2020. Table 2 Electrical properties of human tissue & tumor [18] Material

Relative permittivity Conductivity [S/m]

Fat

10.71999

0.30838

Muscle

52.36191

1.95145

Skin (dry)

37.69119

1.61018

Coolant-water bolus 83.03104

1.70443

3 Results and Discussion The temperature distribution along the radial radius (z) of tumor is shown in Fig. 2. It states that numerical (ADI) method and analytical methods are in good agreement. The tumor is heated and cooled for 5 seconds and 100 seconds respectively shown in Figs. 3 and 4.

Feasibility Study for Local Hyperthermia

267

Temperature profile over radius (heating) 370 ADI 1.6667 s

360

Analytical 1.6667 s ADI 3.3333 s

350

Analytical 3.3333 s ADI 5 s

Temperature (K)

340

Analytical 5 s

330 320 310 300 290 280 270 -1

-0.8

-0.6

-0.4

-0.2

0

0.2

0.4

0.6

0.8

1

Radius (cm)

Fig. 2 Distribution of temperature along the radius for heating process of 10 seconds Temperature with time at origin 360 ADI

350

Analytical

340

Temperature (K)

330 320 310 300 290 280 270

0

0.5

1

1.5

2

2.5

3

3.5

4

4.5

5

Time (s)

Fig. 3 Distribution of temperature for the duration of 5 seconds

The optimum and deep temperature distribution around the tumor has been shown with the help of temperature gradient distribution plot in Fig. 5. The 2D graph in Fig. 6 shows distribution of highest amount of heat flux density (W/m2) inside the breast tumors. It clearly shows that temperature rise from 42–44°C has been achieved in the

J. Rajput et al. Temperature profile over radius (cooling) 330 ADI 16.6667 s Analytical 16.6667 s ADI 33.3333 s

320

Analytical 33.3333 s ADI 50 s Analytical 50 s

310

Temperature (K)

268

300

290

280

270

-1

-0.8

-0.6

-0.4

-0.2

0

0.2

0.4

0.6

0.8

1

Radius (cm)

Fig. 4 Distribution of temperature along the radius for heating process of 100 seconds

Fig. 5 The Magnitude of temperature gradient in Kelvin/meter

Feasibility Study for Local Hyperthermia

269

tumor regions, other surrounding tissues are in the safe temperature range (below 40°C). Figure 7 shows that temperature on the tumor site varies in the range of 42–44°C, and the heat flux density varies from 20.73 W/m2 to 31.89 W/m2 in the tumors. Our frequency is higher; hence it has limitations for the deep penetration in the tumors.

Fig. 6 Distribution of electric flux density inside the tumors

4 Conclusion This article concludes with successful results that, computational complexity has been reduced by iterative alternate direction implicit method, simulation results are in good agreement for numerical, analytical and 2D modeling of breast tumors. 2D modeling simulation results are showing efficient focusing of radio frequency waves on tumor sites of the breast model. The optimum distribution of heat flow density and temperature inside the tumor has also been demonstrated successfully. Hyperthermia treatment can be optimized concerning the heat application method and tumor site, to avoid hotspots on healthy tissues around the tumor. We have simulated, and modeled 2D breast model, for the HT’s accuracy, and stability of various treatment parameters in the HT. 3D modeling can also be done for different models of fatty, and dense fatty breast phantoms. Even these modeling results have scope of experimental validation in realistic environment.

270

J. Rajput et al.

Fig. 7 Temperature distribution in two tumors Acknowledgment. The authors would like to thank management and Principal, Ramrao Adik Institute of Technology, Nerul, Navi Mumbai, and Dr BATU for the research facility and support provided in the research.

References 1. Zee JV (2002) Heating the patient: a promising approach? Ann Oncol 13:1173–1184 2. Wust P, Hildebrandt B, Sreenivasa G (2002) Hyperthermia in combined treatment of cancer. Lancet Oncol 8487–497 3. Dewhirst MW, Viglianti BL, Lora-Michiels M, Hanson M, Hoopes PJ (2003) Basic principles of thermal dosimetry and thermal thresholds for tissue damage from hyperthermia. Int J Hyperth 19:267–294 4. Kok H et al (2015) Current state of the art of regional hyperthermia treatment planning: a review. Radiat Oncol 10 5. Falk MH, Issels RD (2001) Hyperthermia in oncology. Int J Hyperth 17:267–294 6. Cancer facts and figures 2018. Amer Cancer Soc Atlanta, GA, USA, 1–10, 2018 7. Nguyen PT, Abbosh A, Crozier S (2017) 3-D focused microwave hyperthermia for breast cancer treatment with experimental validation. IEEE Trans Antennas Propag 65:3489–3499 8. Nguyen PT, Crozier S, Abbosh A (2014) Realistic simulation environment to test microwave to test microwave hyperthermia treatment of breast cancer. In: IEEE Antennas and propagation society international symposium (APSURSI), 1188–1189

Feasibility Study for Local Hyperthermia

271

9. Nguyen PT, Abbosh A, Crozier S (2015) Microwave hyperthermia for breast cancer treatment using electromagnetic and thermal focusing tested on realistic breast models and antenna arrays. IEEE Trans Antennas Propag 63(10):4426–4434 10. Nguyen PT, Abbosh AM (2015) Focusing techniques in breast cancer treatment using non-invasive microwave hyperthermia. In: 2015 International symposium on antennas and propagation (ISAP), Hobart, TAS, 1–3 11. Nguyen PT, Crozier S, Abbosh A (2015) Thermo-dielectric breast phantom for experimental studies of microwave hyperthermia. IEEE Antennas Wirel Propag Lett, 476–479 12. Nguyen PT, Crozier S, Abbosh A (2017) Three-dimensional microwave hyperthermia for breast cancer in a realistic environment using particle swarm optimization. IEEE Trans Biomed Eng 64(6):1335–1344 13. Huilgol NG, Gupta S, Sridhar CR (2010) Hyperthermia with radiation in the treatment of locally advanced head and neck cancer: a report of randomized trial. J Cancer Res Ther 6(4):492–496 Oct–Dec. https://doi.org/10.4103/0973-1482.77101. PMID: 21358087 14. Pennes HH (1948) Analysis of tissue and arterial blood temperatures in the resting human forearm. J Appl Physiol 85:5–34 15. Patankar SV (1980) Numerical heat transfer and fluid flow. CRC Press.1, 137–139 16. Parker KJ (1985) Effects of heat conduction and sample size on ultrasonic absorption measurements. J Acoust Soc Am 77:719–725. https://doi.org/10.1121/1.392340 17. Rajput JL, Nandgaonkar AB, Nalbalwar SL, Wagh AE (2019) Design study and feasibility of hyperthermia technique. In: Computing in Engineering and Technology, vol. 1025, 721–732, Oct. https://doi.org/10.1007/978-981-32-9515-5_68. 18. Dielectric properties of body tissues. https://itis.swiss/virtualpopulation/tissue-properties/dat abase/dielectric-properties

Author Index

A Akash, M. D. N., 257 Aluvalu, Rajanikanth, 257

B Bhirud, S. G., 151

C Chakravorthy, G. Bhageerath, 257 da Cruz, Mauro A. A., 1

D Dalip, 141

F Furia, Palak, 227

G Goutham Kumar, M., 55 Gulhane, Vijay, 13 Gupta, Vaishnavi, 227

H Hewei, Guan, 77 Huilgol, Nagraj, 269

I Illuri, Babu, 115

J Janjua, Juhi, 65 Jeyaram, G., 215 Joshi, Raunak M., 245

K Kadam, Vinod, 163 Kanaparthi, Sowbhagya Hepsiba, 129 Kane, Aminata, 25 Katla, Nithin, 55 Khan, Anam, 189 Khandare, Anand, 89, 103, 227 Konate, Karim, 25 Kumar, CH. Mahesh, 257

M Madheswaran, M., 201, 215 Mafra, Samuel B., 1 Mali, Amrat, 179 Malve, Pravin, 13 Meshram, B. B., 45

N Nalbalwar, Sanjay, 269 Nandgaonkar, Anil, 269 Nivelkar, Mukta, 151

P Patankar, Archana, 65 Patil, Megharani, 189 Pawar, Rutika, 89 Pidugu, Rohithraj, 55

© The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2022 V. E. Balas et al. (eds.), Intelligent Computing and Networking, Lecture Notes in Networks and Systems 301, https://doi.org/10.1007/978-981-16-4863-2

273

274 R Rajput, Jaswantsing, 269 Rodrigues, Joel J. P. C., 1, 25

S Sadiq, Ali Safaa, 77 Sathis Kumar, A. E., 115 Sedamkar, R. R., 179 Shah, Deven, 245 Shaikh, Akram Harun, 45 Sharma, Pooja Ravindrakumar, 103 Sharma, Sonia, 141 Shrikant, Khyaati, 227 da Silveira, Werner Augusto A. N., 1

Author Index Swapna, M., 129

T Tahir, Mohammed Adam, 77 Teixeira, Eduardo H., 1

V Vhatkar, Sangeeta, 163 Vidhya, V., 201

W Wagh, Abhay, 269